命名实体数据集实体标注方法
命名实体的标注有两种方式:1)BIOES 2)BIO
实体的类别可以自己根据需求改变,通常作为原始数据来说,标注为BIO的方式。自己写了一套标注方法,大家可以参考下
原文:1.txt
Inspired by energy-fueled phenomena such as cortical cytoskeleton flows [46,45,32] during biological morphogenesis, the theory of active polar viscous gels has been developed [37,33]. The theory models the continuum, macroscopic mechanics of a collection of uniaxial active agents, embedded in a viscous bulk medium, in which internal stresses are induced due to dissipation of energy [41,58]. The energy-consuming uniaxial polar agents constituting the gel are modeled as unit vectors. The average of unit vectors in a small local volume at each point defines the macroscopic directionality of the agents and is described by a polarization field. The polarization field is governed by an equation of motion accounting for energy consumption and for the strain rate in the fluid. The relationship between the strain rate and the stress in the fluid is provided by a constitutive equation that accounts for anisotropic, polar agents and consumption of energy. These equations, along with conservation of momentum, provide a continuum hydrodynamic description modeling active polar viscous gels as an energy consuming, anisotropic, non-Newtonian fluid [37,33,32,41]. The resulting partial differential equations governing the hydrodynamics of active polar viscous gels are, however, in general analytically intractable.
人工标注文本:1.ann
T1 Task 120 155 theory of active polar viscous gels
T2 Process 195 238 models the continuum, macroscopic mechanics
T3 Material 137 155 polar viscous gels
T4 Material 258 280 uniaxial active agents
T6 Material 296 315 viscous bulk medium
T7 Material 415 436 uniaxial polar agents
T8 Material 454 457 gel
* Synonym-of T7 T8
T9 Material 1074 1092 polar viscous gels
T10 Material 1099 1149 energy consuming, anisotropic, non-Newtonian fluid
* Synonym-of T9 T10
T11 Material 1241 1266 active polar viscous gels
T12 Process 628 646 polarization field
T13 Process 652 670 polarization field
T14 Process 689 707 equation of motion
R1 Hyponym-of Arg1:T13 Arg2:T14
T15 Process 866 887 constitutive equation
T16 Process 1023 1057 continuum hydrodynamic description
T17 Process 959 1011 These equations, along with conservation of momentum
* Synonym-of T17 T16
T18 Process 44 71 cortical cytoskeleton flows
T19 Process 90 114 biological morphogenesis
T20 Material 773 778 fluid
现在批量对数据集进行标注,代码参考如下:
1 import spacy 2 3 4 def extract_entity(): 5 6 with open('./data/1.txt', 'r', encoding='utf8') as tfr, open('./data/1.ann', 'r', encoding='utf8') as afr: 7 content = tfr.read() 8 ann = afr.readlines() 9 doc = spacy.load('en') 10 # 分句子 11 sents = list(doc(content).sents) 12 # 存储每个句子的开始结束索引 13 sent_index_dict = {} 14 for each_sent in sents: 15 sent_index_dict[each_sent] = (each_sent.start_char, each_sent.end_char) 16 17 # 对于每一个标注 18 for each_ann in ann: 19 task_kind = each_ann.strip().split('\t')[0] 20 # 是任务我就继续处理 21 if task_kind.startswith('T'): 22 task_name_start_end_index = each_ann.strip().split('\t')[1] 23 task_text = each_ann.strip().split('\t')[2] 24 25 task_name = task_name_start_end_index.split(' ')[0] 26 task_start = int(task_name_start_end_index.split(' ')[1]) 27 task_end = int(task_name_start_end_index.split(' ')[2]) 28 # 根据索引,找到这个词对应的句子 29 for key in sent_index_dict: 30 if task_start >= sent_index_dict[key][0] and task_end <= sent_index_dict[key][1]: 31 s = key.string 32 temp_str = [token.text for token in doc(s)] 33 if temp_str[-1] == '\n': 34 temp_str = temp_str[:len(temp_str) - 1] 35 start_info = s.find(task_text) 36 end_info = len(task_text) + start_info 37 38 str_test = s[start_info: end_info] 39 assert str_test == task_text 40 41 # 对所有内容打标签 O 42 content_tag = [] 43 for i in range(len(temp_str)): 44 content_tag.append('O') 45 46 sentence_token = temp_str 47 sentence_tag = content_tag.copy() 48 49 entity_str = [token.text for token in doc(task_text)] 50 entity_category = task_name 51 tag_res = {} 52 # 遍历该单词,看是单个词还是多个词 53 if len(entity_str) == 1: 54 # 单个实体,标签: U-Process 55 t = 'B-' + entity_category 56 tag_res[entity_str[0]] = t 57 elif len(entity_str) == 2: 58 # 两个单词组成的实体 59 t1 = 'B-' + entity_category 60 t2 = 'I-' + entity_category 61 tag_res[entity_str[0]] = t1 62 tag_res[entity_str[1]] = t2 63 else: 64 # 三个及以上的单词组成的实体 65 # 先给每个单词打上标签, I-Process,再单独对开始和结束对比 66 for word in entity_str: 67 tag_res[word] = 'I-' + entity_category 68 # 对开始和结束单独标记 69 tag_res[entity_str[0]] = 'B-' + entity_category 70 # tag_res[entity_str[-1]] = 'E-' + entity_category 71 # 按照顺序存储到列表里面 72 entity_tag = [] 73 for word in entity_str: 74 entity_tag.append(tag_res[word]) 75 76 # 找到实体所在句子的下标 77 entity_index = sentence_token.index(entity_str[0]) 78 # 对应更改sentence_tag的标签 79 sentence_tag[entity_index: entity_index + len(entity_str)] = entity_tag 80 print('机器标签:{}'.format(str(sentence_tag))) 81 print('原始标签:{}'.format(str(content_tag))) 82 print('原始标签长度:{}; 机器标签长度:{}'.format(len(content_tag), len(sentence_tag))) 83 84 assert len(sentence_tag) == len(content_tag) 85 86 87 # def tag_BIO(ann, co): 88 # 89 # doc = spacy.load('en') 90 # # 分句子 91 # sents = list(doc(s).sents) 92 # # 找一个句子出来 93 # sentence = sents[2].text 94 # sent_length = len(sentence) 95 # print(sentence,'\n长度是:', sent_length) 96 # target = 'gel' 97 # target_index = (454, 457) 98 # # 对句子进行分词 99 # t = doc(sentence) 100 # token_list = [token.text for token in t] 101 # print('ok') 102 103 104 if __name__ == '__main__': 105 extract_entity()
备注:处理英文字符用spacy,这个工具不错,号称工业级nlp工具
相关文章
- Golang开发工具LiteIDE使用方法整理
- pandas常用数据清洗方法
- Phalcon框架数据库读写分离的实现方法
- CMD魔法堂:获取进程路径和PID值的方法集
- Mysql中查找并删除重复数据的方法
- 数据预处理基本方法
- 数据分析师:避免低质量数据的5个方法
- 浅谈探索性数据分析的方法—如何下手处理一堆繁杂的数据
- hadoop大数据——mapreduce程序提交运行模式及debug方法
- DataReader转DataSet方法
- VMware的“Intel VT-x is disabled”解决方法
- Dataset:New York City Taxi Fare Prediction纽约市出租车票价预测数据集的简介、下载、使用方法之详细攻略
- Dataset之DA:数据增强(Data Augmentation)的简介、方法、案例应用之详细攻略
- Dataset之Knifey-Spoony:Knifey-Spoony数据集的简介、下载、使用方法之详细攻略
- 数学建模学习(75):全局敏感性分析Morris 方法
- Android 圆形/圆角图片的方法
- Error attempting to get column ‘xxx‘ from result set. Cause: java.sql.SQLDataException错误的解决方法
- C语言给结构体赋数据值和带有结构体指针变量的赋值方法
- leaflet两种方法添加比例尺 (示例代码005)
- 从信用卡欺诈模型看不平衡数据分类(1)数据层面:使用过采样是主流,过采样通常使用smote,或者少数使用数据复制。过采样后模型选择RF、xgboost、神经网络能够取得非常不错的效果。(2)模型层面:使用模型集成,样本不做处理,将各个模型进行特征选择、参数调优后进行集成,通常也能够取得不错的结果。(3)其他方法:偶尔可以使用异常检测技术,IF为主
- 各种 AI 数据增强方法,都在这儿了
- .NET Core Session的使用方法
- 基于JAVA实现的WEB端UI自动化 - WebDriver框架篇 - ant使用 - ant发送邮件显示源码的解决方法
- 目标检测论文解读复现之二:基于改进YOLOv5的轻量化航空目标检测方法
- 学习经验分享【26】论文写作画图方法(持续更新)
- 编写SPI_Master驱动程序_老方法