您现在的位置是：首页 > 后端

当前栏目

命名实体数据集实体标注方法

方法数据命名实体标注

2023-09-11 14:17:15 时间

命名实体的标注有两种方式：1）BIOES　　2)BIO

　　实体的类别可以自己根据需求改变，通常作为原始数据来说，标注为BIO的方式。自己写了一套标注方法，大家可以参考下

原文：1.txt

Inspired by energy-fueled phenomena such as cortical cytoskeleton flows [46,45,32] during biological morphogenesis, the theory of active polar viscous gels has been developed [37,33]. The theory models the continuum, macroscopic mechanics of a collection of uniaxial active agents, embedded in a viscous bulk medium, in which internal stresses are induced due to dissipation of energy [41,58]. The energy-consuming uniaxial polar agents constituting the gel are modeled as unit vectors. The average of unit vectors in a small local volume at each point defines the macroscopic directionality of the agents and is described by a polarization field. The polarization field is governed by an equation of motion accounting for energy consumption and for the strain rate in the fluid. The relationship between the strain rate and the stress in the fluid is provided by a constitutive equation that accounts for anisotropic, polar agents and consumption of energy. These equations, along with conservation of momentum, provide a continuum hydrodynamic description modeling active polar viscous gels as an energy consuming, anisotropic, non-Newtonian fluid [37,33,32,41]. The resulting partial differential equations governing the hydrodynamics of active polar viscous gels are, however, in general analytically intractable.

人工标注文本：1.ann

T1 Task 120 155 theory of active polar viscous gels
T2 Process 195 238 models the continuum, macroscopic mechanics
T3 Material 137 155 polar viscous gels
T4 Material 258 280 uniaxial active agents
T6 Material 296 315 viscous bulk medium
T7 Material 415 436 uniaxial polar agents
T8 Material 454 457 gel
* Synonym-of T7 T8
T9 Material 1074 1092 polar viscous gels
T10 Material 1099 1149 energy consuming, anisotropic, non-Newtonian fluid
* Synonym-of T9 T10
T11 Material 1241 1266 active polar viscous gels
T12 Process 628 646 polarization field
T13 Process 652 670 polarization field
T14 Process 689 707 equation of motion
R1 Hyponym-of Arg1:T13 Arg2:T14
T15 Process 866 887 constitutive equation
T16 Process 1023 1057 continuum hydrodynamic description
T17 Process 959 1011 These equations, along with conservation of momentum
* Synonym-of T17 T16
T18 Process 44 71 cortical cytoskeleton flows
T19 Process 90 114 biological morphogenesis
T20 Material 773 778 fluid

现在批量对数据集进行标注，代码参考如下：

  1 import spacy
  2 
  3 
  4 def extract_entity():
  5 
  6     with open('./data/1.txt', 'r', encoding='utf8') as tfr, open('./data/1.ann', 'r', encoding='utf8') as afr:
  7         content = tfr.read()
  8         ann = afr.readlines()
  9         doc = spacy.load('en')
 10         # 分句子
 11         sents = list(doc(content).sents)
 12         # 存储每个句子的开始结束索引
 13         sent_index_dict = {}
 14         for each_sent in sents:
 15             sent_index_dict[each_sent] = (each_sent.start_char, each_sent.end_char)
 16 
 17         # 对于每一个标注
 18         for each_ann in ann:
 19             task_kind = each_ann.strip().split('\t')[0]
 20             # 是任务我就继续处理
 21             if task_kind.startswith('T'):
 22                 task_name_start_end_index = each_ann.strip().split('\t')[1]
 23                 task_text = each_ann.strip().split('\t')[2]
 24 
 25                 task_name = task_name_start_end_index.split(' ')[0]
 26                 task_start = int(task_name_start_end_index.split(' ')[1])
 27                 task_end = int(task_name_start_end_index.split(' ')[2])
 28                 # 根据索引，找到这个词对应的句子
 29                 for key in sent_index_dict:
 30                     if task_start >= sent_index_dict[key][0] and task_end <= sent_index_dict[key][1]:
 31                         s = key.string
 32                         temp_str = [token.text for token in doc(s)]
 33                         if temp_str[-1] == '\n':
 34                             temp_str = temp_str[:len(temp_str) - 1]
 35                         start_info = s.find(task_text)
 36                         end_info = len(task_text) + start_info
 37 
 38                         str_test = s[start_info: end_info]
 39                         assert str_test == task_text
 40 
 41                         # 对所有内容打标签 O
 42                         content_tag = []
 43                         for i in range(len(temp_str)):
 44                             content_tag.append('O')
 45 
 46                         sentence_token = temp_str
 47                         sentence_tag = content_tag.copy()
 48 
 49                         entity_str = [token.text for token in doc(task_text)]
 50                         entity_category = task_name
 51                         tag_res = {}
 52                         # 遍历该单词，看是单个词还是多个词
 53                         if len(entity_str) == 1:
 54                             # 单个实体，标签： U-Process
 55                             t = 'B-' + entity_category
 56                             tag_res[entity_str[0]] = t
 57                         elif len(entity_str) == 2:
 58                             # 两个单词组成的实体
 59                             t1 = 'B-' + entity_category
 60                             t2 = 'I-' + entity_category
 61                             tag_res[entity_str[0]] = t1
 62                             tag_res[entity_str[1]] = t2
 63                         else:
 64                             # 三个及以上的单词组成的实体
 65                             # 先给每个单词打上标签, I-Process,再单独对开始和结束对比
 66                             for word in entity_str:
 67                                 tag_res[word] = 'I-' + entity_category
 68                             # 对开始和结束单独标记
 69                             tag_res[entity_str[0]] = 'B-' + entity_category
 70                             # tag_res[entity_str[-1]] = 'E-' + entity_category
 71                         # 按照顺序存储到列表里面
 72                         entity_tag = []
 73                         for word in entity_str:
 74                             entity_tag.append(tag_res[word])
 75 
 76                         # 找到实体所在句子的下标
 77                         entity_index = sentence_token.index(entity_str[0])
 78                         # 对应更改sentence_tag的标签
 79                         sentence_tag[entity_index: entity_index + len(entity_str)] = entity_tag
 80                         print('机器标签:{}'.format(str(sentence_tag)))
 81                         print('原始标签:{}'.format(str(content_tag)))
 82                         print('原始标签长度：{}; 机器标签长度：{}'.format(len(content_tag), len(sentence_tag)))
 83 
 84                         assert len(sentence_tag) == len(content_tag)
 85 
 86 
 87 # def tag_BIO(ann, co):
 88 #
 89 #     doc = spacy.load('en')
 90 #     # 分句子
 91 #     sents = list(doc(s).sents)
 92 #     # 找一个句子出来
 93 #     sentence = sents[2].text
 94 #     sent_length = len(sentence)
 95 #     print(sentence,'\n长度是：', sent_length)
 96 #     target = 'gel'
 97 #     target_index = (454, 457)
 98 #     # 对句子进行分词
 99 #     t = doc(sentence)
100 #     token_list = [token.text for token in t]
101 #     print('ok')
102 
103 
104 if __name__ == '__main__':
105     extract_entity()

　　备注：处理英文字符用spacy，这个工具不错，号称工业级nlp工具

猜你喜欢

栓奶牛——二分解法
n-null is null: method kotlin.jvm.internal.Intrinsics.checkParameterIsNotNull, parameter convertView
DockerFile 编译语法详解
HTML5应用 + Cordova = 平台相关的混合应用
C# Matlab 相互调用
千万级别的表分页查询优化
01 搭建项目配置环境和创建表
ABAP代码静态分析工具SQF - Support Query Framework
定制二选一按钮SwitchButton
美丽的开关
Android 图片加载框架 Glide 的用法
关于vue.js的使用经验总结
SAP CRM One order appointment duration table
JavaScript中的cookie
[转]微信转载朋友圈时的窗口自定义及回调
文件上传的渐进式增强
硬件调试-ILA
2014牡丹江网络zoj3816Generalized Palindromic Number（dfs或者bfs）
vue状态管理模式之vuex
我写过的软件之FileExpert
Java实现蓝桥杯算法训练前缀表达式
Docker资源隔离（namespace，cgroups）
Axel替代wget

相关主题

Java桥方法
模板方法模式.
PHP方法总结
postMan使用方法
c#写扩展方法
List(T)类的方法
移动方法
设计模式—工厂方法
软件开发的方法
mysql导出数据方法
思维方法
元数据元数据
类的魔术方法
log日志方法
java中方法重载
Java方法的重写
py魔法方法

zl程序教程

当前栏目

命名实体数据集实体标注方法

相关文章