zl程序教程

您现在的位置是:首页 >  数据库

当前栏目

自然语言处理学术速递[12.14]

2023-04-18 15:34:03 时间

cs.CL 方向,今日共计59篇

Transformer(2篇)

【1】 Dependency Learning for Legal Judgment Prediction with a Unified Text-to-Text Transformer 标题:使用统一文本到文本转换器的法律判决预测的依赖学习 链接:https://arxiv.org/abs/2112.06370

作者:Yunyun Huang,Xiaoyu Shen,Chuanyi Li,Jidong Ge,Bin Luo 机构: Software Institute, Nanjing University 备注:The first two authors contributed equally 摘要:鉴于案件的实际情况,法律判决预测(LJP)涉及到一系列子任务,如预测违法条款、指控和处罚期限。我们建议为LJP利用统一的文本到文本转换器,其中子任务之间的依赖关系可以在自回归解码器中自然建立。与以前的工作相比,它有三个优点:(1)它适合于掩蔽语言模型的预训练模式,因此可以受益于每个子任务的语义提示,而不是将它们视为原子标签;(2)它使用单一的统一体系结构,允许在所有子任务之间完全共享参数,(3)它可以包含分类子任务和生成子任务。我们表明,这种统一的转换器,尽管对一般域文本进行了预训练,但优于专门为法律域定制的预训练模型。通过大量的实验,我们发现捕获依赖关系的最佳顺序不同于人类的直觉,人类最合理的逻辑顺序对于模型来说可能是次优的。我们还包括另外两个辅助任务:法庭视图生成和文章内容预测,表明它们不仅可以提高预测精度,而且可以在出现错误时为模型输出提供可解释的解释。在最佳配置下,我们的模型大大优于以前的SOTA和单任务版本的统一Transformer。 摘要:Given the fact of a case, Legal Judgment Prediction (LJP) involves a series of sub-tasks such as predicting violated law articles, charges and term of penalty. We propose leveraging a unified text-to-text Transformer for LJP, where the dependencies among sub-tasks can be naturally established within the auto-regressive decoder. Compared with previous works, it has three advantages: (1) it fits in the pretraining pattern of masked language models, and thereby can benefit from the semantic prompts of each sub-task rather than treating them as atomic labels, (2) it utilizes a single unified architecture, enabling full parameter sharing across all sub-tasks, and (3) it can incorporate both classification and generative sub-tasks. We show that this unified transformer, albeit pretrained on general-domain text, outperforms pretrained models tailored specifically for the legal domain. Through an extensive set of experiments, we find that the best order to capture dependencies is different from human intuitions, and the most reasonable logical order for humans can be sub-optimal for the model. We further include two more auxiliary tasks: court view generation and article content prediction, showing they can not only improve the prediction accuracy, but also provide interpretable explanations for model outputs even when an error is made. With the best configuration, our model outperforms both previous SOTA and a single-tasked version of the unified transformer by a large margin.

【2】 Towards More Efficient Insertion Transformer with Fractional Positional Encoding 标题:走向更高效的分数位置编码插入式Transformer 链接:https://arxiv.org/abs/2112.06295

作者:Zhisong Zhang,Yizhe Zhang,Bill Dolan 机构:Carnegie Mellon University,Meta AI,Microsoft Research 摘要:自回归神经序列模型已被证明是有效的跨文本生成任务。然而,它们从左到右的解码顺序阻止了并行生成。插入转换器(Stern等人,2019年)是一种有吸引力的替代方案,允许在一个生成步骤中输出多个令牌。然而,由于绝对位置编码和基于插入的生成方案的不兼容性,它需要在每个步骤刷新生成的部分假设中的每个令牌的编码,这可能是昂贵的。我们为插入转换器设计了一种新的增量位置编码方案,称为分数位置编码(FPE),它允许重用在前面步骤中计算的表示。对各种语言生成任务的实证研究证明了FPE的有效性,它减少了浮点运算并改善了成批解码的延迟。 摘要:Auto-regressive neural sequence models have been shown to be effective across text generation tasks. However, their left-to-right decoding order prevents generation from being parallelized. Insertion Transformer (Stern et al., 2019) is an attractive alternative that allows outputting multiple tokens in a single generation step. Nevertheless, due to the incompatibility of absolute positional encoding and insertion-based generation schemes, it needs to refresh the encoding of every token in the generated partial hypotheses at each step, which could be costly. We design a novel incremental positional encoding scheme for insertion transformers called Fractional Positional Encoding (FPE), which allows reusing representations calculated in previous steps. Empirical studies on various language generation tasks demonstrate the effectiveness of FPE, which leads to reduction of floating point operations and latency improvements on batched decoding.

BERT(1篇)

【1】 Roof-BERT: Divide Understanding Labour and Join in Work 标题:屋顶-BERT:分裂理解劳动,投身工作 链接:https://arxiv.org/abs/2112.06736

作者:Wei-Lin Liao,Wei-Yun Ma 机构:Academia Sinica 摘要:最近关于用知识图(KG)增强基于BERT的语言表示模型的工作在多个NLP任务中取得了有希望的结果。最先进的方法通常将原始输入语句与KGs中的三元组进行集成,并将组合表示形式输入到BERT模型中。然而,由于BERT模型的序列长度有限,该框架除了原始输入语句外不能包含太多的知识,因此不得不丢弃一些知识。对于那些输入为长段落甚至文档的下游任务(如QA或阅读理解任务),问题尤其严重。为了解决这个问题,我们提出了Roof-BERT模型,该模型具有两个基本BERT和一个融合层。其中一个底层BERTs编码知识资源,另一个编码原始输入句子,融合层像屋顶一样集成了这两个BERTs编码。QA任务的实验结果表明了该模型的有效性。 摘要:Recent work on enhancing BERT-based language representation models with knowledge graphs (KGs) has promising results on multiple NLP tasks. State-of-the-art approaches typically integrate the original input sentences with triples in KGs, and feed the combined representation into a BERT model. However, as the sequence length of a BERT model is limited, the framework can not contain too much knowledge besides the original input sentences and is thus forced to discard some knowledge. The problem is especially severe for those downstream tasks that input is a long paragraph or even a document, such as QA or reading comprehension tasks. To address the problem, we propose Roof-BERT, a model with two underlying BERTs and a fusion layer on them. One of the underlying BERTs encodes the knowledge resources and the other one encodes the original input sentences, and the fusion layer like a roof integrates both BERTs' encodings. Experiment results on QA task reveal the effectiveness of the proposed model.

QA|VQA|问答|对话(2篇)

【1】 Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection 标题:基于实体增强型知识注入的知识视觉答疑改进与诊断 链接:https://arxiv.org/abs/2112.06888

作者:Diego Garcia-Olano,Yasumasa Onoe,Joydeep Ghosh 机构:University of Texas At Austin 摘要:基于知识的视觉问答(KBVQA)是一项双模任务,需要外部世界的知识才能正确回答文本问题和相关图像。最近的单模态文本工作表明,将知识注入预先训练的语言模型,特别是实体增强的知识图嵌入,可以提高下游以实体为中心的任务的性能。在这项工作中,我们实证研究了在双模环境中应用这些方法如何以及是否能够提高现有VQA系统在KBVQA任务上的性能。我们使用两个大型的公开VQA数据集进行实验,(1)KVQA,其中包含大多数罕见的Wikipedia实体,(2)OKVQA,它不太以实体为中心,更符合常识推理。两者都缺乏明确的实体跨度,我们研究了不同的弱监督和手动方法对获取它们的影响。此外,我们还分析了最近提出的双模态和单模态注意解释是如何受到这种实体增强表征的影响的。我们的结果表明,KBVQA任务的性能显著提高,无需额外昂贵的预训练,我们还提供了实体知识注入何时有助于提高模型的理解的见解。我们提供代码和增强的数据集,以确保再现性。 摘要:Knowledge-Based Visual Question Answering (KBVQA) is a bi-modal task requiring external world knowledge in order to correctly answer a text question and associated image. Recent single modality text work has shown knowledge injection into pre-trained language models, specifically entity enhanced knowledge graph embeddings, can improve performance on downstream entity-centric tasks. In this work, we empirically study how and whether such methods, applied in a bi-modal setting, can improve an existing VQA system's performance on the KBVQA task. We experiment with two large publicly available VQA datasets, (1) KVQA which contains mostly rare Wikipedia entities and (2) OKVQA which is less entity-centric and more aligned with common sense reasoning. Both lack explicit entity spans and we study the effect of different weakly supervised and manual methods for obtaining them. Additionally we analyze how recently proposed bi-modal and single modal attention explanations are affected by the incorporation of such entity enhanced representations. Our results show substantial improved performance on the KBVQA task without the need for additional costly pre-training and we provide insights for when entity knowledge injection helps improve a model's understanding. We provide code and enhanced datasets for reproducibility.

【2】 Injecting Numerical Reasoning Skills into Knowledge Base Question Answering Models 标题:在知识库问答模型中注入数值推理技术 链接:https://arxiv.org/abs/2112.06109

作者:Yu Feng,Jing Zhang,Xiaokang Zhang,Lemao Liu,Cuiping Li,Hong Chen 机构:Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education, School of Information, Renmin University of China, Tencent AI Lab 摘要:基于嵌入的方法是知识库问答(KBQA)的常用方法,但目前很少有模型具有数字推理能力,因此难以回答顺序约束问题。本文提出了一种新的基于嵌入的KBQA框架,该框架特别考虑了数值推理。我们在NSM(一种最先进的基于嵌入的KBQA模型)的基础上提出了NumericalTransformer,以创建NT-NSM。为了实现更好的训练,我们提出了两个训练前任务,在两个生成的训练数据集上使用显式的面向数值的损失函数,并提出了一种基于模板的数据扩充方法来丰富有序约束QA数据集。在KBQA基准上的大量实验表明,在我们的训练算法的帮助下,NT-NSM具有数字推理技能,在回答顺序约束问题方面大大优于基线。 摘要:Embedding-based methods are popular for Knowledge Base Question Answering (KBQA), but few current models have numerical reasoning skills and thus struggle to answer ordinal constrained questions. This paper proposes a new embedding-based KBQA framework which particularly takes numerical reasoning into account. We present NumericalTransformer on top of NSM, a state-of-the-art embedding-based KBQA model, to create NT-NSM. To enable better training, we propose two pre-training tasks with explicit numerical-oriented loss functions on two generated training datasets and a template-based data augmentation method for enriching ordinal constrained QA dataset. Extensive experiments on KBQA benchmarks demonstrate that with the help of our training algorithm, NT-NSM is empowered with numerical reasoning skills and substantially outperforms the baselines in answering ordinal constrained questions.

机器翻译(2篇)

【1】 Communication-Efficient Federated Learning for Neural Machine Translation 标题:面向神经机器翻译的高效通信联邦学习 链接:https://arxiv.org/abs/2112.06135

作者:Tanya Roosta,Peyman Passban,Ankit Chadha 机构:Amazon 备注:The first two authors contributed equally 摘要:在联合学习(FL)设置中训练神经机器翻译(NMT)模型在计算和通信方面都是低效的,这是因为翻译引擎的大尺寸以及训练客户机和中央服务器所需的多轮更新。在本文中,我们通过提出一种新的解决方案来探索如何在FL设置中高效地构建NMT模型。为了减少通信开销,在所有神经层中,我们只交换我们称之为“控制器”的层。控制器是连接到我们预先训练的架构的少量附加神经组件。这些新组件放置在原始层之间。它们充当联络人,与中央服务器通信,并了解足够更新客户端的最少信息。我们在五个不同领域的数据集上评估了我们模型的性能,以将德语翻译成英语。我们注意到,配备有控制器预制件的模型与那些在中央和非FL设置中训练的模型相一致。此外,我们观察到FL管道的通信量大幅减少,这是使用控制器的直接结果。根据我们的实验,基于控制器的模型比其他同类模型便宜约6倍。当我们考虑大模型中的参数个数时,这种减少是非常重要的,当这种参数需要在FL设置中交换多个回合时,它变得更为关键。 摘要:Training neural machine translation (NMT) models in federated learning (FL) settings could be inefficient both computationally and communication-wise, due to the large size of translation engines as well as the multiple rounds of updates required to train clients and a central server. In this paper, we explore how to efficiently build NMT models in an FL setup by proposing a novel solution. In order to reduce the communication overhead, out of all neural layers we only exchange what we term "Controller" layers. Controllers are a small number of additional neural components connected to our pre-trained architectures. These new components are placed in between original layers. They act as liaisons to communicate with the central server and learn minimal information that is sufficient enough to update clients. We evaluated the performance of our models on five datasets from different domains to translate from German into English. We noted that the models equipped with Controllers preform on par with those trained in a central and non-FL setting. In addition, we observed a substantial reduction in the communication traffic of the FL pipeline, which is a direct consequence of using Controllers. Based on our experiments, Controller-based models are ~6 times less expensive than their other peers. This reduction is significantly important when we consider the number of parameters in large models and it becomes even more critical when such parameters need to be exchanged for multiple rounds in FL settings.

【2】 Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts 标题:基于单语文本的神经机器翻译领域内并行句子选择 链接:https://arxiv.org/abs/2112.06096

作者:Javad Pourmostafa Roshan Sharami,Dimitar Shterionov,Pieter Spronck 机构:∗Department of Cognitive Science and Artificial Intelligence, Tilburg University, Tilburg, The Netherlands 备注:Accepted to the CLIN Journal on Dec 6, 2021 摘要:不断增长的数据量导致更大的通用模型。特定用例通常被忽略,因为通用模型在特定领域的用例中往往表现不佳。我们的工作通过一种从通用领域(平行文本)语料库中选择领域内数据的方法来解决这一差距,用于机器翻译任务。该方法根据句子与单语领域特定数据集的余弦相似性,对并行领域数据中的句子进行排序。然后,我们选择相似度最高的前K个句子来训练一个新的机器翻译系统,该系统针对特定的领域内数据进行调整。我们的实验结果表明,基于域内数据训练的模型优于基于泛型或泛型与域数据混合训练的模型。也就是说,我们的方法以较低的计算成本和数据量选择高质量的领域特定训练实例。 摘要:Continuously-growing data volumes lead to larger generic models. Specific use-cases are usually left out, since generic models tend to perform poorly in domain-specific cases. Our work addresses this gap with a method for selecting in-domain data from generic-domain (parallel text) corpora, for the task of machine translation. The proposed method ranks sentences in parallel general-domain data according to their cosine similarity with a monolingual domain-specific data set. We then select the top K sentences with the highest similarity score to train a new machine translation system tuned to the specific in-domain data. Our experimental results show that models trained on this in-domain data outperform models trained on generic or a mixture of generic and domain data. That is, our method selects high-quality domain-specific training instances at low computational cost and data size.

语义分析(3篇)

【1】 A cognitively driven weighted-entropy model for embedding semantic categories in hyperbolic geometry 标题:双曲几何中嵌入语义范畴的认知驱动加权熵模型 链接:https://arxiv.org/abs/2112.06876

作者:Eugene Yu Ji 机构:Department of Psychology, The University of, Chicago 备注:13 Pages 摘要:本文提出了一种非监督、认知驱动的加权熵方法,用于在双曲几何中嵌入语义范畴。该模型由认知语言学的两个研究领域驱动:第一个是语言习得的统计学习理论和使用高维网络来表示认知中的语义知识的建议,第二个是语义交流的领域特定信息性方法。提出了词共现的加权条件熵作为嵌入度量,两个加权参数是搭配多样性和对应统计分布中的条件概率排序。然后将玻尔兹曼分布用于加权熵度量,并嵌入到双曲庞加莱圆盘模型中。测试主要在基本颜色和亲属词领域进行,这属于认知语义学领域特异性研究最为深入的类别。结果表明,该方法能够成功地对英语中大多数基本颜色和亲属词的流行性和相似性语义关系进行建模和映射,并有可能推广到其他语义领域和不同语言。总体而言,本文致力于计算认知语义学以及计算语言学和自然语言处理中网络和几何驱动的语言嵌入的研究。 摘要:In this paper, an unsupervised and cognitively driven weighted-entropy method for embedding semantic categories in hyperbolic geometry is proposed. The model is driven by two fields of research in cognitive linguistics: the first is the statistical learning theory of language acquisition and the proposal of using high-dimensional networks to represent semantic knowledge in cognition, and the second is the domain-specific informativeness approach to semantic communication. Weighted conditional entropy of word co-occurrence is proposed as the embedding metric, and the two weighting parameters are collocation diversity and conditional probability ranking in the corresponding statistical distribution. The Boltzmann distribution is then used on the weighted-entropy metric and embedded into a hyperbolic Poincare disk model. Testing has been mainly performed in the domains of basic color and kinship words, which belong to the classes that domain-specificity focused research in cognitive semantics has most intensively investigated. Results show that this new approach can successfully model and map the semantic relationships of popularity and similarity for most of the basic color and kinship words in English and have potential to be generalized to other semantic domains and different languages. Generally, this paper contributes to both computational cognitive semantics and the research on network and geometry-driven language embedding in computational linguistics and NLP.

【2】 Context vs Target Word: Quantifying Biases in Lexical Semantic Datasets 标题:语境与目标词:量化词汇语义数据集中的偏向 链接:https://arxiv.org/abs/2112.06733

作者:Qianchu Liu,Diana McCarthy,Anna Korhonen 机构:Language Technology Lab, TAL, University of Cambridge, UK 摘要:最先进的语境化模型(如BERT)使用任务(如WiC和WSD)在语境表示中评估单词。这本质上假设这些任务中的性能反映了模型如何很好地表示耦合的单词和上下文语义。本研究通过对主要语境词汇语义任务中测试的语境词交互进行第一次定量分析(使用探测基线)来研究这一假设。具体来说,基于探测基线性能,我们提出了计算数据集中上下文或单词偏差程度的措施,并将现有数据集绘制在连续统上。分析表明,大多数现有数据集都属于连续体的极端端(即,它们要么严重偏向上下文,要么偏向目标词),而只有AM$^2$iCo和语义检索对表示上下文和目标词的模型提出了挑战。我们对WiC的案例研究表明,人类受试者在数据集中没有共享模型的强上下文偏见(当目标词缺失时,人类发现语义判断更加困难),并且模型仅从上下文学习虚假关联。这项研究表明,在这些任务中,通常不会对模型进行上下文表示中的单词测试,因此结果容易被误解。我们推荐我们的框架作为对未来任务设计和词汇语义应用中的上下文和目标词偏差的健全性检查。 摘要:State-of-the-art contextualized models such as BERT use tasks such as WiC and WSD to evaluate their word-in-context representations. This inherently assumes that performance in these tasks reflect how well a model represents the coupled word and context semantics. This study investigates this assumption by presenting the first quantitative analysis (using probing baselines) on the context-word interaction being tested in major contextual lexical semantic tasks. Specifically, based on the probing baseline performance, we propose measures to calculate the degree of context or word biases in a dataset, and plot existing datasets on a continuum. The analysis shows most existing datasets fall into the extreme ends of the continuum (i.e. they are either heavily context-biased or target-word-biased) while only AM$^2$iCo and Sense Retrieval challenge a model to represent both the context and target words. Our case study on WiC reveals that human subjects do not share models' strong context biases in the dataset (humans found semantic judgments much more difficult when the target word is missing) and models are learning spurious correlations from context alone. This study demonstrates that models are usually not being tested for word-in-context representations as such in these tasks and results are therefore open to misinterpretation. We recommend our framework as sanity check for context and target word biases of future task design and application in lexical semantics.

【3】 Predicting Above-Sentence Discourse Structure using Distant Supervision from Topic Segmentation 标题:基于主题切分的远距离监督预测句上语篇结构 链接:https://arxiv.org/abs/2112.06196

作者:Patrick Huber,Linzi Xing,Giuseppe Carenini 机构:Department of Computer Science, University of British Columbia, Vancouver, BC, Canada, V,T ,Z 备注:AAAI 2022 摘要:RST风格的语篇分析在许多NLP任务中起着至关重要的作用,揭示了潜在复杂多样文档的潜在语义/语用结构。尽管它很重要,但现代语篇分析中最普遍的限制之一是缺乏大规模数据集。为了克服数据稀疏性问题,最近提出了情感分析和摘要等任务的远程监督方法。在这里,我们通过利用话题分割的远程监控来扩展这一研究领域,这可以为高层话语结构提供一个强大且经常互补的信号。在两个人类注释话语树库上的实验证实,我们的方案在句子和段落层面上生成了准确的树结构,在句子到文档任务上始终优于以前的远程监督模型,并且偶尔在句子到段落层面上获得更高的分数。 摘要:RST-style discourse parsing plays a vital role in many NLP tasks, revealing the underlying semantic/pragmatic structure of potentially complex and diverse documents. Despite its importance, one of the most prevailing limitations in modern day discourse parsing is the lack of large-scale datasets. To overcome the data sparsity issue, distantly supervised approaches from tasks like sentiment analysis and summarization have been recently proposed. Here, we extend this line of research by exploiting distant supervision from topic segmentation, which can arguably provide a strong and oftentimes complementary signal for high-level discourse structures. Experiments on two human-annotated discourse treebanks confirm that our proposal generates accurate tree structures on sentence and paragraph level, consistently outperforming previous distantly supervised models on the sentence-to-document task and occasionally reaching even higher scores on the sentence-to-paragraph level.

Graph|知识图谱|Knowledge(5篇)

【1】 Plurality and Quantification in Graph Representation of Meaning 标题:意义图示中的复数与量化 链接:https://arxiv.org/abs/2112.06448

作者:Yu Cao 机构:A dissertation submitted to the, School of Graduate Studies, Rutgers, The State University of New Jersey, In partial fulfillment of the requirements, For the degree of, Doctor of Philosophy, Graduate Program in Linguistics, Written under the direction of, Simon Charlow 备注:Author's PhD dissertation accepted to the School of Graduate Studies of Rutgers University 摘要:在这篇论文中,我们提出了一种基于有向图的语义表示形式,并探讨了它在多元语义和量化语义中的语言充分性和解释优势。我们的图形语言只使用一元二阶变量,涵盖了自然语言语义的基本要素。我们根据图遍历定义其模型理论解释,其中变量的相对范围来自其赋值顺序。我们提出了一种在简单的语法-语义界面上构建语义图的基于统一的机制,通过在语义和句法分布之间建立部分确定的关系,将语法作为语篇指称的划分函数,通过范畴语法来实现。该机制是自动化的,以便于将来的探索。目前的图形式主义应用于分布谓词、跨范畴连接和量化表达式的范围置换等语言问题,包括不确定词的异常范围界定行为。 摘要:In this thesis we present a semantic representation formalism based on directed graphs and explore its linguistic adequacy and explanatory benefits in the semantics of plurality and quantification. Our graph language covers the essentials of natural language semantics using only monadic second-order variables. We define its model-theoretical interpretation in terms of graph traversal, where the relative scope of variables arises from their order of valuation. We present a unification-based mechanism for constructing semantic graphs at a simple syntax-semantics interface, where syntax as a partition function on discourse referents is implemented with categorial grammars by establishing a partly deterministic relation between semantics and syntactic distribution. This mechanism is automated to facilitate future exploration. The present graph formalism is applied to linguistic issues in distributive predication, cross-categorial conjunction, and scope permutation of quantificational expressions, including the exceptional scoping behaviors of indefinites.

【2】 Sparse Structure Learning via Graph Neural Networks for Inductive Document Classification 标题:基于图神经网络的稀疏结构学习在归纳文档分类中的应用 链接:https://arxiv.org/abs/2112.06386

作者:Yinhua Piao,Sangseon Lee,Dohoon Lee,Sun Kim 机构: Department of Computer Science and Engineering, Seoul National University, Institute of Computer Technology, Seoul National University, Bioinformatics Institute, Seoul National University, AIGENDRUG Co., Ltd. 备注:Accepted by AAAI 2022 摘要:近年来,图神经网络(GNNs)在文档分类中得到了广泛的应用。然而,现有的大多数方法都是基于没有句子层次信息的静态单词共现图,这带来了三个挑战:(1)单词歧义性,(2)单词同义性,(3)动态上下文依赖性。为了应对这些挑战,我们提出了一种新的基于GNN的稀疏结构学习模型用于归纳文档分类。具体而言,文档级图最初由句子级单词共现图的不相交并集生成。我们的模型收集了一组连接句子间不相交单词的可训练边缘,并利用结构学习稀疏地选择具有动态上下文依赖性的边缘。具有稀疏结构的图可以通过GNN联合利用文档中的局部和全局上下文信息。对于归纳学习,细化的文档图进一步输入到通用的读出函数中,用于以端到端的方式进行图级分类和优化。在几个真实数据集上进行的大量实验表明,该模型的性能优于大多数最新的结果,并揭示了学习每个文档的稀疏结构的必要性。 摘要:Recently, graph neural networks (GNNs) have been widely used for document classification. However, most existing methods are based on static word co-occurrence graphs without sentence-level information, which poses three challenges:(1) word ambiguity, (2) word synonymity, and (3) dynamic contextual dependency. To address these challenges, we propose a novel GNN-based sparse structure learning model for inductive document classification. Specifically, a document-level graph is initially generated by a disjoint union of sentence-level word co-occurrence graphs. Our model collects a set of trainable edges connecting disjoint words between sentences and employs structure learning to sparsely select edges with dynamic contextual dependencies. Graphs with sparse structures can jointly exploit local and global contextual information in documents through GNNs. For inductive learning, the refined document graph is further fed into a general readout function for graph-level classification and optimization in an end-to-end manner. Extensive experiments on several real-world datasets demonstrate that the proposed model outperforms most state-of-the-art results, and reveal the necessity to learn sparse structures for each document.

【3】 Graph-based hierarchical record clustering for unsupervised entity resolution 标题:基于图的分层记录聚类在无监督实体解析中的应用 链接:https://arxiv.org/abs/2112.06331

作者:Islam Akef Ebeid,John R. Talburt,Md Abdus Salam Siddique 机构:Department of Information Science, University of Arkansas at Little Rock, Little Rock, Arkansas, USA 备注:Accepted at the 19th International Conference on Information Technology : New Generations (ITNG 2022) 摘要:本文研究了无监督实体解析中的匹配记录聚类问题。我们建立在一个名为数据清洗机(DWM)的最新概率框架之上。我们介绍了一种基于图的分层两步记录聚类方法(GDWM),该方法首先使用DWM中使用的基于图的传递闭包算法来识别匹配记录对中的大型连接组件或我们称之为软集群。然后,使用基于图的模块化优化方法,以分层方式将发现的软集群分解为更精确的实体集群。与最初的DWM实现相比,我们的方法提供了几个优势,主要是显著的速度加快、精度提高以及F1成绩的总体提高。我们通过在多个合成数据集上的实验证明了我们方法的有效性。我们的结果也证明了基于图论的算法的实用性,尽管它们在无监督实体解析的文献中很少。 摘要:Here we study the problem of matched record clustering in unsupervised entity resolution. We build upon a state-of-the-art probabilistic framework named the Data Washing Machine (DWM). We introduce a graph-based hierarchical 2-step record clustering method (GDWM) that first identifies large, connected components or, as we call them, soft clusters in the matched record pairs using a graph-based transitive closure algorithm utilized in the DWM. That is followed by breaking down the discovered soft clusters into more precise entity clusters in a hierarchical manner using an adapted graph-based modularity optimization method. Our approach provides several advantages over the original implementation of the DWM, mainly a significant speed-up, increased precision, and overall increased F1 scores. We demonstrate the efficacy of our approach using experiments on multiple synthetic datasets. Our results also provide evidence of the utility of graph theory-based algorithms despite their sparsity in the literature on unsupervised entity resolution.

【4】 Efficient Document-level Event Extraction via Pseudo-Trigger-aware Pruned Complete Graph 标题:基于伪触发器感知剪枝完全图的文档级事件高效提取 链接:https://arxiv.org/abs/2112.06013

作者:Tong Zhu,Xiaoye Qu,Wenliang Chen,Zhefeng Wang,Baoxing Huai,Nicholas Jing Yuan,Min Zhang 机构: Institute of Artificial Intelligence, School of Computer Science and Technology, Soochow University, China, Huawei Cloud, China 摘要:文档级事件提取面临两个主要挑战:1)参数实体分散在不同的句子中,2)事件触发器通常不可用。为了应对这些挑战,大多数以前的研究主要集中在以自回归方式构建论证链,这在训练和推理方面都是低效的。与以往的研究相比,我们提出了一种快速轻量级的模型PTPCG。我们设计了一种非自回归解码算法,对在自动选择的伪触发器指导下构造的修剪完整图执行事件-参数组合提取。与以前的系统相比,我们的系统以较低的资源消耗实现了具有竞争力的结果,仅需3.6%的GPU时间(pfs天)进行训练,推理速度高达8.5倍。此外,我们的方法对有(或没有)触发器的数据集显示了优越的兼容性,伪触发器可以作为带注释触发器的补充,以进一步改进。 摘要:There are two main challenges in document-level event extraction: 1) argument entities are scattered in different sentences, and 2) event triggers are often not available. To address these challenges, most previous studies mainly focus on building argument chains in an autoregressive way, which is inefficient in both training and inference. In contrast to the previous studies, we propose a fast and lightweight model named as PTPCG. We design a non-autoregressive decoding algorithm to perform event argument combination extraction on pruned complete graphs, which are constructed under the guidance of the automatically selected pseudo triggers. Compared to the previous systems, our system achieves competitive results with lower resource consumption, taking only 3.6% GPU time (pfs-days) for training and up to 8.5 times faster for inference. Besides, our approach shows superior compatibility for the datasets with (or without) triggers and the pseudo triggers can be the supplements for annotated triggers to make further improvements.

【5】 TempoQR: Temporal Question Reasoning over Knowledge Graphs 标题:TempoQR:基于知识图的时态问题推理 链接:https://arxiv.org/abs/2112.05785

作者:Costas Mavromatis,Prasanna Lakkur Subramanyam,Vassilis N. Ioannidis,Soji Adeshina,Phillip R. Howard,Tetiana Grinberg,Nagib Hakim,George Karypis 机构:University of Minnesota,University of Massachusetts Amherst,Amazon Web Services,Intel Labs 备注:AAAI 2022 摘要:知识图问答(KGQA)涉及使用自然语言查询从知识图(KG)中检索事实。KG是一组经过策划的事实,由关系连接的实体组成。某些事实还包括形成时间KG(TKG)的时间信息。尽管许多自然问题涉及明确或隐含的时间限制,但TKGs上的问题回答(QA)一直是一个相对未开发的领域。现有的解决方案主要是针对简单的时间问题设计的,这些问题可以由一个TKG事实直接回答。本文提出了一个基于嵌入的综合框架,用于回答TKGs上的复杂问题。我们称为时态问题推理(TempoQR)的方法利用TKG嵌入将问题定位到它所指的特定实体和时间范围。它通过使用三个专门的模块,使用上下文、实体和时间感知信息来扩充问题嵌入。第一种方法计算给定问题的文本表示,第二种方法将其与问题中涉及的实体的实体嵌入相结合,第三种方法生成特定于问题的时间嵌入。最后,基于转换器的编码器学习将生成的时间信息与用于答案预测的问题表示融合。大量实验表明,与最先进的方法相比,TempoQR在复杂时间问题上的准确率提高了25-45个百分点,并且更好地推广到不可见的问题类型。 摘要:Knowledge Graph Question Answering (KGQA) involves retrieving facts from a Knowledge Graph (KG) using natural language queries. A KG is a curated set of facts consisting of entities linked by relations. Certain facts include also temporal information forming a Temporal KG (TKG). Although many natural questions involve explicit or implicit time constraints, question answering (QA) over TKGs has been a relatively unexplored area. Existing solutions are mainly designed for simple temporal questions that can be answered directly by a single TKG fact. This paper puts forth a comprehensive embedding-based framework for answering complex questions over TKGs. Our method termed temporal question reasoning (TempoQR) exploits TKG embeddings to ground the question to the specific entities and time scope it refers to. It does so by augmenting the question embeddings with context, entity and time-aware information by employing three specialized modules. The first computes a textual representation of a given question, the second combines it with the entity embeddings for entities involved in the question, and the third generates question-specific time embeddings. Finally, a transformer-based encoder learns to fuse the generated temporal information with the question representation, which is used for answer predictions. Extensive experiments show that TempoQR improves accuracy by 25--45 percentage points on complex temporal questions over state-of-the-art approaches and it generalizes better to unseen question types.

推理|分析|理解|解释(4篇)

【1】 Attentive Contextual Carryover for Multi-Turn End-to-End Spoken Language Understanding 标题:多轮端到端口语理解的细心语境传递 链接:https://arxiv.org/abs/2112.06743

作者:Kai Wei,Thanh Tran,Feng-Ju Chang,Kanthashree Mysore Sathyendra,Thejaswi Muniyappa,Jing Liu,Anirudh Raju,Ross McGowan,Nathan Susanj,Ariya Rastrow,Grant P. Strimel 机构:Alexa Speech, Amazon 备注:None 摘要:近年来,端到端(E2E)口语理解(SLU)系统取得了重大进展,该系统直接从口语音频预测意图和时隙。虽然对话历史已经被用来改进传统的基于文本的自然语言理解系统,但当前的E2E SLU方法还没有在多回合和面向任务的对话中纳入这些关键的上下文信号。在这项工作中,我们提出了一个上下文E2E SLU模型架构,该架构在多回合对话的编码的先前话语和对话行为(语音助手采取的行动)上使用了一个多头部注意机制。我们详细介绍了将这些上下文集成到最先进的递归和基于转换器的模型中的替代方法。将该方法应用于语音助手收集的大量未识别的话语数据集,平均单词错误率和语义错误率分别降低了10.8%和12.6%。我们还展示了一个公开的数据集上的结果,并表明我们的方法在非文本基线上显著提高了性能 摘要:Recent years have seen significant advances in end-to-end (E2E) spoken language understanding (SLU) systems, which directly predict intents and slots from spoken audio. While dialogue history has been exploited to improve conventional text-based natural language understanding systems, current E2E SLU approaches have not yet incorporated such critical contextual signals in multi-turn and task-oriented dialogues. In this work, we propose a contextual E2E SLU model architecture that uses a multi-head attention mechanism over encoded previous utterances and dialogue acts (actions taken by the voice assistant) of a multi-turn dialogue. We detail alternative methods to integrate these contexts into the state-ofthe-art recurrent and transformer-based models. When applied to a large de-identified dataset of utterances collected by a voice assistant, our method reduces average word and semantic error rates by 10.8% and 12.6%, respectively. We also present results on a publicly available dataset and show that our method significantly improves performance over a noncontextual baseline

【2】 Understanding and Improving the Exemplar-based Generation for Open-domain Conversation 标题:理解和改进基于样本的开放领域会话生成 链接:https://arxiv.org/abs/2112.06723

作者:Seungju Han,Beomsu Kim,Seokjun Seo,Enkhbayar Erdenee,Buru Chang 机构:Hyperconnect 摘要:基于范例的开放域会话生成模型利用生成模型和检索模型,基于检索器提供的范例生成响应。然而,他们在生成响应时通常忽略检索到的示例,或者生成与检索到的示例过匹配的响应。在本文中,我们认为这些缺点是源于开放域会话的一对多问题。当检索到的范例与给定的上下文相关但与gold响应显著不同时,基于范例的生成模型被训练为忽略范例,因为范例对生成gold响应没有帮助。另一方面,当检索到的范例在词汇上与gold响应相似时,生成模型被训练为高度依赖范例。因此,我们提出了一种训练方法,选择语义上与gold响应相关但词汇上与gold响应相距较远的样本,以缓解上述缺点。在训练阶段,我们提出的训练方法首先使用gold响应而不是对话上下文作为查询来选择语义上与gold响应相关的示例。然后,它消除了词汇上与gold响应相似的示例,以减轻生成模型对该示例的依赖。其余的示例可能与给定的上下文无关,因为它们是根据gold响应进行搜索的。因此,我们提出的训练方法进一步利用给定上下文和样本之间的相关性得分来惩罚不相关的样本。大量的实验表明,我们提出的训练方法缓解了现有基于范例的生成模型的缺点,并在适当性和信息性方面显著提高了性能。 摘要:Exemplar-based generative models for open-domain conversation produce responses based on the exemplars provided by the retriever, taking advantage of generative models and retrieval models. However, they often ignore the retrieved exemplars while generating responses or produce responses over-fitted to the retrieved exemplars. In this paper, we argue that these drawbacks are derived from the one-to-many problem of the open-domain conversation. When the retrieved exemplar is relevant to the given context yet significantly different from the gold response, the exemplar-based generative models are trained to ignore the exemplar since the exemplar is not helpful for generating the gold response. On the other hand, when the retrieved exemplar is lexically similar to the gold response, the generative models are trained to rely on the exemplar highly. Therefore, we propose a training method selecting exemplars that are semantically relevant to the gold response but lexically distanced from the gold response to mitigate the above disadvantages. In the training phase, our proposed training method first uses the gold response instead of dialogue context as a query to select exemplars that are semantically relevant to the gold response. And then, it eliminates the exemplars that lexically resemble the gold responses to alleviate the dependency of the generative models on that exemplars. The remaining exemplars could be irrelevant to the given context since they are searched depending on the gold response. Thus, our proposed training method further utilizes the relevance scores between the given context and the exemplars to penalize the irrelevant exemplars. Extensive experiments demonstrate that our proposed training method alleviates the drawbacks of the existing exemplar-based generative models and significantly improves the performance in terms of appropriateness and informativeness.

【3】 Contextualized Scene Imagination for Generative Commonsense Reasoning 标题:用于生成性常识推理的情景情景想象 链接:https://arxiv.org/abs/2112.06318

作者:PeiFeng Wang,Jonathan Zamora,Junfeng Liu,Filip Ilievski,Muhao Chen,Xiang Ren 机构:Department of Computer Science, University of Southern California, Department of Computer Science, University of San Diego, Information Sciences Institute, University of Southern California 摘要:人类使用自然语言将环境中的共同概念组合成似是而非的日常场景描述。然而,这种生成性常识推理(GCSR)技能缺乏最先进的文本生成方法。关于由神经文本生成模型生成的任意概念的描述性句子(例如,预先训练的文本到文本转换器)通常语法流畅,但可能不符合人类常识,这主要是因为它们缺乏捕捉概念关系、识别隐式概念的机制,并对看不见的概念构成进行归纳推理。在本文中,我们提出了一种想象和表达(I&V)方法,该方法学习想象具有输入概念之间关系的关系场景知识图(SKG),并在生成合理场景描述时利用SKG作为约束。我们收集并协调了一组来自不同领域和模式的知识资源,为I&V提供了丰富的辅助监督信号。实验证明了I&V在改进语言模型的概念到句子和概念到故事生成任务方面的有效性,同时使模型能够从较少的任务示例中很好地学习,并生成对人类注释者有意义的SKG。 摘要:Humans use natural language to compose common concepts from their environment into plausible, day-to-day scene descriptions. However, such generative commonsense reasoning (GCSR) skills are lacking in state-of-the-art text generation methods. Descriptive sentences about arbitrary concepts generated by neural text generation models (e.g., pre-trained text-to-text Transformers) are often grammatically fluent but may not correspond to human common sense, largely due to their lack of mechanisms to capture concept relations, to identify implicit concepts, and to perform generalizable reasoning about unseen concept compositions. In this paper, we propose an Imagine-and-Verbalize (I&V) method, which learns to imagine a relational scene knowledge graph (SKG) with relations between the input concepts, and leverage the SKG as a constraint when generating a plausible scene description. We collect and harmonize a set of knowledge resources from different domains and modalities, providing a rich auxiliary supervision signal for I&V. The experiments demonstrate the effectiveness of I&V in improving language models on both concept-to-sentence and concept-to-story generation tasks, while enabling the model to learn well from fewer task examples and generate SKGs that make common sense to human annotators.

【4】 Few-Shot Out-of-Domain Transfer Learning of Natural Language Explanations 标题:自然语言解释的少量域外迁移学习 链接:https://arxiv.org/abs/2112.06204

作者:Yordan Yordanov,Vid Kocijan,Thomas Lukasiewicz,Oana-Maria Camburu 机构:Department of Computer Science, University of Oxford, Alan Turing Institute, London 备注:Accepted at the Deep Generative Models and Downstream Applications Workshop at NeurIPS 2021 摘要:最近,人们对为决策生成自然语言解释(NLE)的模型越来越感兴趣。然而,训练一个模型来提供NLE需要获得特定于任务的NLE,这需要时间和资源。一个潜在的解决方案是通过少量镜头转移学习,将NLE从具有大量NLE的域转移到具有稀少NLE但可能具有大量标签的域。在这项工作中,我们介绍了三种用于NLE少量镜头迁移学习的普通方法,以及对现有的普通微调方法的改进。我们将可解释性从自然语言推理领域转移到(1)代词解析的硬案例领域,其中我们在WinoGrande数据集(small-e-WinoGrande)和(2)常识验证(ComVE)之上引入了一个小型NLE数据集。我们的研究结果表明,NLE的迁移优于单任务方法,并且在四种确定的训练机制中建立了最佳策略。我们还研究了最佳方法在训练数据和模型大小方面的可伸缩性。 摘要:Recently, there has been an increasing interest in models that generate natural language explanations (NLEs) for their decisions. However, training a model to provide NLEs requires the acquisition of task-specific NLEs, which is time- and resource-consuming. A potential solution is the out-of-domain transfer of NLEs from a domain with a large number of NLEs to a domain with scarce NLEs but potentially a large number of labels, via few-shot transfer learning. In this work, we introduce three vanilla approaches for few-shot transfer learning of NLEs for the case of few NLEs but abundant labels, along with an adaptation of an existing vanilla fine-tuning approach. We transfer explainability from the natural language inference domain, where a large dataset of human-written NLEs exists (e-SNLI), to the domains of (1) hard cases of pronoun resolution, where we introduce a small dataset of NLEs on top of the WinoGrande dataset (small-e-WinoGrande), and (2) commonsense validation (ComVE). Our results demonstrate that the transfer of NLEs outperforms the single-task methods, and establish the best strategies out of the four identified training regimes. We also investigate the scalability of the best methods, both in terms of training data and model size.

GAN|对抗|攻击|生成相关(6篇)

【1】 Keyphrase Generation Beyond the Boundaries of Title and Abstract 标题:超越标题和摘要界限的关键词生成 链接:https://arxiv.org/abs/2112.06776

作者:Krishna Garg,Jishnu Ray Chowdhury,Cornelia Caragea 机构:Computer Science Department, University of Illinois Chicago 备注:8 pages, 1 figure, 6 tables 摘要:关键短语生成旨在生成最能描述给定文档的短语(关键短语)。在学术领域,目前这项任务的方法是神经方法,并且大部分只涉及文章的标题和摘要。在这项工作中,我们探讨了整合语义相似文章或给定文章全文中的额外数据是否有助于神经关键词生成模型。我们发现,在全文中添加句子,特别是以文章摘要的形式添加句子,可以显著提高标题和摘要中出现或不出现的两类关键短语的生成。在三个广受好评的模型上的实验结果,以及一个适合较长文档的最新Transformer模型,Longformer编码器-解码器(LED)验证了观察结果。我们还提出了一个新的大规模学术数据集FullTextKP用于关键词生成,我们将其用于我们的实验。与以前的大规模数据集不同,FullTextKP包括文章的全文以及标题和摘要。我们将发布源代码,以促进对建议想法的研究。 摘要:Keyphrase generation aims at generating phrases (keyphrases) that best describe a given document. In scholarly domains, current approaches to this task are neural approaches and have largely worked with only the title and abstract of the articles. In this work, we explore whether the integration of additional data from semantically similar articles or from the full text of the given article can be helpful for a neural keyphrase generation model. We discover that adding sentences from the full text particularly in the form of summary of the article can significantly improve the generation of both types of keyphrases that are either present or absent from the title and abstract. The experimental results on the three acclaimed models along with one of the latest transformer models suitable for longer documents, Longformer Encoder-Decoder (LED) validate the observation. We also present a new large-scale scholarly dataset FullTextKP for keyphrase generation, which we use for our experiments. Unlike prior large-scale datasets, FullTextKP includes the full text of the articles alongside title and abstract. We will release the source code to stimulate research on the proposed ideas.

【2】 Step-unrolled Denoising Autoencoders for Text Generation 标题:用于文本生成的分步展开去噪自动编码器 链接:https://arxiv.org/abs/2112.06749

作者:Nikolay Savinov,Junyoung Chung,Mikolaj Binkowski,Erich Elsen,Aaron van den Oord 机构:Mikołaj Bi´nkowski, A¨aron van den Oord, DeepMind, London, UK 摘要:在本文中,我们提出了一种新的文本生成模型,阶跃展开去噪自动编码器(SUNDAE),它不依赖于自回归模型。与去噪扩散技术类似,SUNDAE重复应用于一系列令牌,从随机输入开始,每次都对其进行改进,直到收敛。我们提出了一个简单的改进算子,它比扩散方法在更少的迭代次数内收敛,同时在自然语言数据集上定性地生成更好的样本。SUNDAE在WMT'14英德语翻译任务中取得了最先进的结果(在非自回归方法中),并在GitHub的巨大清理通用爬网数据集和Python代码数据集上取得了无条件语言建模的良好定性结果。圣代的非自回归特性通过在模板中填充任意空白模式,为从左到右的提示生成打开了可能性。 摘要:In this paper we propose a new generative model of text, Step-unrolled Denoising Autoencoder (SUNDAE), that does not rely on autoregressive models. Similarly to denoising diffusion techniques, SUNDAE is repeatedly applied on a sequence of tokens, starting from random inputs and improving them each time until convergence. We present a simple new improvement operator that converges in fewer iterations than diffusion methods, while qualitatively producing better samples on natural language datasets. SUNDAE achieves state-of-the-art results (among non-autoregressive methods) on the WMT'14 English-to-German translation task and good qualitative results on unconditional language modeling on the Colossal Cleaned Common Crawl dataset and a dataset of Python code from GitHub. The non-autoregressive nature of SUNDAE opens up possibilities beyond left-to-right prompted generation, by filling in arbitrary blank patterns in a template.

【3】 Surfer100: Generating Surveys From Web Resources on Wikipedia-style 标题:Surfer100:从维基百科风格的Web资源生成调查 链接:https://arxiv.org/abs/2112.06377

作者:Irene Li,Alexander Fabbri,Rina Kawamura,Yixin Liu,Xiangru Tang,Jaesung Tae,Chang Shen,Sally Ma,Tomoe Mizutani,Dragomir Radev 机构:Department of Computer Science, Yale University 摘要:人工智能(AI)等快速发展的领域往往超过维基百科等百科全书资源,后者要么不完全涵盖最近引入的主题,要么完全缺乏此类内容。因此,自动生成内容的方法是解决信息过载问题的有价值的工具。我们表明,预训练语言建模的最新进展可以结合为维基百科前导段落生成的两阶段提取和抽象方法。我们扩展了这种方法,以生成较长的Wikipedia风格的摘要,并通过对100个参考人类收集的调查进行详细研究,检查这些方法在这个应用程序中是如何挣扎的。据我们所知,这是第一次利用网络资源进行维基百科式的长摘要研究。 摘要:Fast-developing fields such as Artificial Intelligence (AI) often outpace the efforts of encyclopedic sources such as Wikipedia, which either do not completely cover recently-introduced topics or lack such content entirely. As a result, methods for automatically producing content are valuable tools to address this information overload. We show that recent advances in pretrained language modeling can be combined for a two-stage extractive and abstractive approach for Wikipedia lead paragraph generation. We extend this approach to generate longer Wikipedia-style summaries with sections and examine how such methods struggle in this application through detailed studies with 100 reference human-collected surveys. This is the first study on utilizing web resources for long Wikipedia-style summaries to the best of our knowledge.

【4】 Improving Code-switching Language Modeling with Artificially Generated Texts using Cycle-consistent Adversarial Networks 标题:利用循环一致对抗性网络改进人工生成文本的语码转换语言建模 链接:https://arxiv.org/abs/2112.06327

作者:Chia-Yu Li,Ngoc Thang Vu 机构:Institute for Natural Language Processing (IMS), University of Stuttgart, Germany 备注:4 pages, 1 figure, Interspeech 2020 摘要:本文介绍了我们在改进代码转换语言模型方面的最新工作,这些模型受到数据匮乏的影响。我们研究了通过人工生成代码转换训练文本数据来增加代码转换训练文本数据的方法。具体而言,我们提出了一个基于循环一致性对抗网络的框架,将单语文本转换为语码转换文本,并将语码转换视为一种说话风格。我们在SEAME语料库上的实验结果表明,利用人工生成的代码切换文本数据可以持续地提高语言模型和自动语音识别性能。 摘要:This paper presents our latest effort on improving Code-switching language models that suffer from data scarcity. We investigate methods to augment Code-switching training text data by artificially generating them. Concretely, we propose a cycle-consistent adversarial networks based framework to transfer monolingual text into Code-switching text, considering Code-switching as a speaking style. Our experimental results on the SEAME corpus show that utilising artificially generated Code-switching text data improves consistently the language model as well as the automatic speech recognition performance.

【5】 Improving Logical-Level Natural Language Generation with Topic-Conditioned Data Augmentation and Logical Form Generation 标题:利用主题条件数据增强和逻辑形式生成改进逻辑级自然语言生成 链接:https://arxiv.org/abs/2112.06240

作者:Ao Liu,Congjian Luo,Naoaki Okazaki 机构: Tokyo Institute of Technology, University of Electronic Science and Technology of China 摘要:由于生成的保真度较低,逻辑自然语言生成,即生成结构化表在逻辑上包含的文本描述一直是一个挑战。citet{chen2020logic2text}通过注释临时逻辑程序来控制生成内容和语义,解决了这个问题,并提出了表感知逻辑形式到文本(Logic2text)生成的任务。然而,尽管现实世界中有大量的表实例,但与文本描述相结合的逻辑形式需要昂贵的人工注释工作,这限制了神经模型的性能。为了缓解这种情况,我们提出了主题条件数据扩充(TopicDA),它利用GPT-2直接从表生成未配对的逻辑形式和文本描述。我们进一步介绍了逻辑表单生成(LG),这是Logic2text的一项双重任务,需要根据表的文本描述生成有效的逻辑表单。我们还提出了一种半监督学习方法,用标记数据和扩充数据联合训练Logic2text和LG模型。这两个模型通过反向翻译提供额外的监督信号,从而相互受益。在Logic2text数据集和LG任务上的实验结果表明,我们的方法可以有效地利用增强的数据,并在很大程度上优于监督基线。 摘要:Logical Natural Language Generation, i.e., generating textual descriptions that can be logically entailed by a structured table, has been a challenge due to the low fidelity of the generation. citet{chen2020logic2text} have addressed this problem by annotating interim logical programs to control the generation contents and semantics, and presented the task of table-aware logical form to text (Logic2text) generation. However, although table instances are abundant in the real world, logical forms paired with textual descriptions require costly human annotation work, which limits the performance of neural models. To mitigate this, we propose topic-conditioned data augmentation (TopicDA), which utilizes GPT-2 to generate unpaired logical forms and textual descriptions directly from tables. We further introduce logical form generation (LG), a dual task of Logic2text that requires generating a valid logical form based on a text description of a table. We also propose a semi-supervised learning approach to jointly train a Logic2text and an LG model with both labeled and augmented data. The two models benefit from each other by providing extra supervision signals through back-translation. Experimental results on the Logic2text dataset and the LG task demonstrate that our approach can effectively utilize the augmented data and outperform supervised baselines by a substantial margin.

【6】 Show and Write: Entity-aware News Generation with Image Information 标题:展示与写作:具有图像信息的实体感知新闻生成 链接:https://arxiv.org/abs/2112.05917

作者:Zhongping Zhang,Yiwen Gu,Bryan A. Plummer 机构:Boston University 摘要:自动撰写长篇文章是一项复杂且具有挑战性的语言生成任务。以前的工作主要集中在使用人工编写的提示符生成这些文章,以提供一些主题上下文和有关文章的一些元数据。也就是说,对于许多应用程序,例如生成新闻故事,这些文章通常与图像及其标题或alt文本配对,而这些图像和标题或alt文本又基于真实世界的事件,可能引用许多难以通过语言模型正确识别和预测的不同命名实体。为了解决这两个问题,本文引入了一种基于图像信息的实体感知新闻生成方法Engin,将新闻图像信息融入到语言模型中。Engin根据元数据和信息(如标题和从图像中提取的命名实体)生成新闻文章。我们还提出了一种实体感知机制来帮助我们的模型更好地识别和预测新闻中的实体名称。我们在两个公共大规模新闻数据集GoodNews和VisualNews上进行了实验。定量结果表明,我们的方法比基本模型提高了4-5个百分点。定性结果表明,Engin生成的文本与新闻图像更为一致。我们还对生成的文章进行了文章质量注释实验,以验证我们的模型能够生成更高质量的文章。最后,我们研究了Engin对自动检测机器生成文章的方法的影响。 摘要:Automatically writing long articles is a complex and challenging language generation task. Prior work has primarily focused on generating these articles using human-written prompt to provide some topical context and some metadata about the article. That said, for many applications, such as generating news stories, these articles are often paired with images and their captions or alt-text, which in turn are based on real-world events and may reference many different named entities that are difficult to be correctly recognized and predicted by language models. To address these two problems, this paper introduces an Entity-aware News Generation method with Image iNformation, Engin, to incorporate news image information into language models. Engin produces news articles conditioned on both metadata and information such as captions and named entities extracted from images. We also propose an Entity-aware mechanism to help our model better recognize and predict the entity names in news. We perform experiments on two public large-scale news datasets, GoodNews and VisualNews. Quantitative results show that our approach improves article perplexity by 4-5 points over the base models. Qualitative results demonstrate the text generated by Engin is more consistent with news images. We also perform article quality annotation experiment on the generated articles to validate that our model produces higher-quality articles. Finally, we investigate the effect Engin has on methods that automatically detect machine-generated articles.

半/弱/无监督|不确定性(2篇)

【1】 Weakly Supervised Mapping of Natural Language to SQL through Question Decomposition 标题:基于问题分解的自然语言到SQL的弱监督映射 链接:https://arxiv.org/abs/2112.06311

作者:Tomer Wolfson,Jonathan Berant,Daniel Deutch 机构:Tel Aviv University, Allen Institute for AI 备注:Preprint 摘要:数据库的自然语言接口(NLIDB),其中用户以自然语言(NL)提出查询,对于使非专家能够从数据中获得见解至关重要。相比之下,开发这样的接口依赖于专家,他们经常编写启发式代码来将NL映射到SQL。或者,基于机器学习模型的NLIDB依赖于NL到SQL映射(NL-SQL对)的监督示例作为训练数据。这样的例子再次使用专家获取,这通常不仅仅是一次性的互动。即,部署NLIDB的每个数据域可能具有不同的特征,因此需要专用的启发式或特定于域的训练示例。为此,我们提出了一种基于机器学习的NLIDB训练的替代方法,即使用弱监督。我们使用最近提出的问题分解表示法QDMR,它是NL和形式查询语言之间的中间语言。最近的工作表明,非专家通常能够成功地将NL翻译成QDMR。因此,我们使用NL-QDMR对以及问题答案作为自动合成SQL查询的监督。然后使用NL问题和合成SQL来训练NL到SQL模型,并在五个基准数据集上进行测试。大量的实验表明,我们的解决方案不需要专家注释,与在专家注释数据上训练的模型相比具有竞争力。 摘要:Natural Language Interfaces to Databases (NLIDBs), where users pose queries in Natural Language (NL), are crucial for enabling non-experts to gain insights from data. Developing such interfaces, by contrast, is dependent on experts who often code heuristics for mapping NL to SQL. Alternatively, NLIDBs based on machine learning models rely on supervised examples of NL to SQL mappings (NL-SQL pairs) used as training data. Such examples are again procured using experts, which typically involves more than a one-off interaction. Namely, each data domain in which the NLIDB is deployed may have different characteristics and therefore require either dedicated heuristics or domain-specific training examples. To this end, we propose an alternative approach for training machine learning-based NLIDBs, using weak supervision. We use the recently proposed question decomposition representation called QDMR, an intermediate between NL and formal query languages. Recent work has shown that non-experts are generally successful in translating NL to QDMR. We consequently use NL-QDMR pairs, along with the question answers, as supervision for automatically synthesizing SQL queries. The NL questions and synthesized SQL are then used to train NL-to-SQL models, which we test on five benchmark datasets. Extensive experiments show that our solution, requiring zero expert annotations, performs competitively with models trained on expert annotated data.

【2】 Sequence-level self-learning with multiple hypotheses 标题:具有多个假设的序列级自学习 链接:https://arxiv.org/abs/2112.05826

作者:Kenichi Kumatani,Dimitrios Dimitriadis,Yashesh Gaur,Robert Gmyr,Sefik Emre Eskimez,Jinyu Li,Michael Zeng 机构:Microsoft, WA, USA 备注:Published in Interspeech 2020: this https URL 摘要:在这项工作中,我们开发了一种新的基于注意的自动语音识别(ASR)序列对序列(seq2seq)模型的自学习技术。对于未翻译的语音数据,来自ASR系统的假设必须用作标签。然而,不完善的ASR结果使得无监督学习难以持续提高识别性能,特别是在多个强大的教师模型不可用的情况下。与传统的无监督学习方法相比,我们采用了多任务学习(MTL)框架,其中第n个最佳ASR假设被用作每个任务的标签。seq2seq网络通过MTL框架进行更新,以便找到能够覆盖多个假设的通用表示。通过这样做,可以减轻emph{hard decision}错误的影响。我们首先通过在美国和英国英语语音之间的口音适应任务中的ASR实验来证明我们的自学习方法的有效性。我们的实验结果表明,与仅使用美国英语数据训练的基线模型相比,我们的方法可以将英国语音数据的WER从14.55%降低到10.36%。此外,我们还研究了我们提出的方法在联邦学习场景中的效果。 摘要:In this work, we develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR). For untranscribed speech data, the hypothesis from an ASR system must be used as a label. However, the imperfect ASR result makes unsupervised learning difficult to consistently improve recognition performance especially in the case that multiple powerful teacher models are unavailable. In contrast to conventional unsupervised learning approaches, we adopt the emph{multi-task learning} (MTL) framework where the $n$-th best ASR hypothesis is used as the label of each task. The seq2seq network is updated through the MTL framework so as to find the common representation that can cover multiple hypotheses. By doing so, the effect of the emph{hard-decision} errors can be alleviated. We first demonstrate the effectiveness of our self-learning methods through ASR experiments in an accent adaptation task between the US and British English speech. Our experiment results show that our method can reduce the WER on the British speech data from 14.55\% to 10.36\% compared to the baseline model trained with the US English data only. Moreover, we investigate the effect of our proposed methods in a federated learning scenario.

检测相关(3篇)

【1】 Detecting Emotion Carriers by Combining Acoustic and Lexical Representations 标题:声学和词汇相结合的情感载体检测方法 链接:https://arxiv.org/abs/2112.06603

作者:Sebastian P. Bayerl,Aniruddha Tammewar,Korbinian Riedhammer,Giuseppe Riccardi 机构: Technische Hochschule N¨urnberg Georg Simon Ohm, Germany, Signals and Interactive Systems Lab, University of Trento 备注:Accepted at ASRU 2021 this https URL 摘要:个人叙述(PN)——口头或书面——是对事实、人物、事件和个人经历的回忆。情绪识别和情绪分析任务通常在话语或文档级别定义。然而,在这项工作中,我们将重点放在情感载体(EC)上,它被定义为最能解释叙述者情感状态的片段(言语或文本)(“失去父亲”,“让我选择”)。一旦提取出来,这种EC可以提供更丰富的用户状态表示,以改进自然语言理解和对话建模。在以前的工作中,已经证明了使用词汇特征可以识别EC。然而,口语叙事应该提供更丰富的上下文描述和用户的情绪状态。在本文中,我们利用基于单词的声学和文本嵌入以及早期和晚期融合技术来检测口语叙事中的ECs。对于声学词级表示,我们使用残差神经网络(ResNet)对单独的语音情感语料库进行预训练,并对其进行微调以检测EC。使用不同融合和系统组合策略的实验表明,后期融合可以显著改善该任务。 摘要:Personal narratives (PN) - spoken or written - are recollections of facts, people, events, and thoughts from one's own experience. Emotion recognition and sentiment analysis tasks are usually defined at the utterance or document level. However, in this work, we focus on Emotion Carriers (EC) defined as the segments (speech or text) that best explain the emotional state of the narrator ("loss of father", "made me choose"). Once extracted, such EC can provide a richer representation of the user state to improve natural language understanding and dialogue modeling. In previous work, it has been shown that EC can be identified using lexical features. However, spoken narratives should provide a richer description of the context and the users' emotional state. In this paper, we leverage word-based acoustic and textual embeddings as well as early and late fusion techniques for the detection of ECs in spoken narratives. For the acoustic word-level representations, we use Residual Neural Networks (ResNet) pretrained on separate speech emotion corpora and fine-tuned to detect EC. Experiments with different fusion and system combination strategies show that late fusion leads to significant improvements for this task.

【2】 Automated Evidence Collection for Fake News Detection 标题:假新闻检测中的自动取证 链接:https://arxiv.org/abs/2112.06507

作者:Mrinal Rawat,Diptesh Kanojia 机构:UpGrad Education Pvt. Ltd., Mumbai, India., Liverpool John Moores University, Liverpool, United Kingdom., Centre for Translation Studies, University of Surrey, Surrey, United Kingdom. 备注:Accepted at ICON 2021 摘要:2019冠状病毒疾病、社交媒体上的虚假信息、传播不实的信息,传播不和谐,影响社会,尤其是在应对像CVID-19这样的流行病时。假新闻检测的任务旨在通过将新闻项目分类为假新闻或真实新闻来解决此类错误信息的影响。在本文中,我们提出了一种新的方法,通过自动收集每个声明的证据来改进当前的自动假新闻检测方法。我们的方法从web文章中提取支持证据,然后选择适当的文本作为证据集处理。我们对这些证据集使用预先训练的摘要器,然后使用提取的摘要作为支持证据来帮助分类任务。我们的实验使用机器学习和基于深度学习的方法,有助于对我们的方法进行广泛的评估。结果表明,与CONSTRAINT-2021共享任务提供的数据集相比,我们的方法在虚假新闻检测方面优于最先进的方法,F1得分为99.25。我们还发布了增强数据集、我们的代码和模型,以供进一步研究。 摘要:Fake news, misinformation, and unverifiable facts on social media platforms propagate disharmony and affect society, especially when dealing with an epidemic like COVID-19. The task of Fake News Detection aims to tackle the effects of such misinformation by classifying news items as fake or real. In this paper, we propose a novel approach that improves over the current automatic fake news detection approaches by automatically gathering evidence for each claim. Our approach extracts supporting evidence from the web articles and then selects appropriate text to be treated as evidence sets. We use a pre-trained summarizer on these evidence sets and then use the extracted summary as supporting evidence to aid the classification task. Our experiments, using both machine learning and deep learning-based methods, help perform an extensive evaluation of our approach. The results show that our approach outperforms the state-of-the-art methods in fake news detection to achieve an F1-score of 99.25 over the dataset provided for the CONSTRAINT-2021 Shared Task. We also release the augmented dataset, our code and models for any further research.

【3】 Topic Detection and Tracking with Time-Aware Document Embeddings 标题:基于时间感知文档嵌入的主题检测与跟踪 链接:https://arxiv.org/abs/2112.06166

作者:Hang Jiang,Doug Beeferman,Weiquan Mao,Deb Roy 机构: MIT , Stanford University 摘要:在许多现实世界的自然语言处理任务(如主题检测和跟踪(TDT))中,消息传递的时间是元数据的一个重要部分。TDT系统的目标是按事件对新闻文章语料库进行分类,在这种情况下,描述同一事件的故事很可能是在同一时间写成的。先前关于TDT时间建模的工作考虑到了这一点,但并没有很好地捕捉到时间如何与事件的语义性质交互。例如,关于热带风暴的故事可能在短时间间隔内完成,而关于电影发行的故事可能在数周或数月内完成。在我们的工作中,我们设计了一种神经方法,将时间和文本信息融合到新闻文档的单一表示中,用于事件检测。我们使用三重丢失体系结构微调这些时间感知文档嵌入,将模型集成到下游TDT系统中,并在两个英文基准TDT数据集上评估系统。在回顾性设置中,我们将聚类算法应用于时间感知嵌入,并在News2013数据集上显示了对基线的实质性改进。在在线流媒体设置中,我们将文档编码器添加到现有的最先进的TDT管道中,并证明它可以提高整体性能。我们对时间表示和融合算法策略进行了研究,结果表明我们提出的模型优于其他策略。最后,我们探索该模型,以检查它如何比以前的TDT系统更有效地处理重复事件。 摘要:The time at which a message is communicated is a vital piece of metadata in many real-world natural language processing tasks such as Topic Detection and Tracking (TDT). TDT systems aim to cluster a corpus of news articles by event, and in that context, stories that describe the same event are likely to have been written at around the same time. Prior work on time modeling for TDT takes this into account, but does not well capture how time interacts with the semantic nature of the event. For example, stories about a tropical storm are likely to be written within a short time interval, while stories about a movie release may appear over weeks or months. In our work, we design a neural method that fuses temporal and textual information into a single representation of news documents for event detection. We fine-tune these time-aware document embeddings with a triplet loss architecture, integrate the model into downstream TDT systems, and evaluate the systems on two benchmark TDT data sets in English. In the retrospective setting, we apply clustering algorithms to the time-aware embeddings and show substantial improvements over baselines on the News2013 data set. In the online streaming setting, we add our document encoder to an existing state-of-the-art TDT pipeline and demonstrate that it can benefit the overall performance. We conduct ablation studies on the time representation and fusion algorithm strategies, showing that our proposed model outperforms alternative strategies. Finally, we probe the model to examine how it handles recurring events more effectively than previous TDT systems.

识别/分类(9篇)

【1】 Khmer Text Classification Using Word Embedding and Neural Networks 标题:基于词嵌入和神经网络的高棉文文本分类 链接:https://arxiv.org/abs/2112.06748

作者:Rina Buoy,Nguonly Taing,Sovisal Chenda 机构:Techo Startup Center (TSC) 摘要:文本分类是自然语言处理中标记开放文本的基本任务之一,对于情感分析等各种应用非常有用。在本文中,我们讨论了高棉文本的各种分类方法,从带有支持向量机分类器的经典TF-IDF算法到现代基于单词嵌入的神经网络分类器,包括线性层模型、递归神经网络和卷积神经网络。在3000万高棉词语料库上训练了一个高棉词嵌入模型,以构建用于训练三种不同神经网络分类器的词向量表示。我们在新闻文章数据集上评估了不同方法在多类别和多标签文本分类任务中的性能。结果表明,使用单词嵌入模型的神经网络分类器始终优于使用TF-IDF的传统分类器。与卷积网络和线性层网络相比,递归神经网络分类器提供了稍好的结果。 摘要:Text classification is one of the fundamental tasks in natural language processing to label an open-ended text and is useful for various applications such as sentiment analysis. In this paper, we discuss various classification approaches for Khmer text, ranging from a classical TF-IDF algorithm with support vector machine classifier to modern word embedding-based neural network classifiers including linear layer model, recurrent neural network and convolutional neural network. A Khmer word embedding model is trained on a 30-million-Khmer-word corpus to construct word vector representations that are used to train three different neural network classifiers. We evaluate the performance of different approaches on a news article dataset for both multi-class and multi-label text classification tasks. The result suggests that neural network classifiers using a word embedding model consistently outperform the traditional classifier using TF-IDF. The recurrent neural network classifier provides a slightly better result compared to the convolutional network and the linear layer network.

【2】 PM-MMUT: Boosted Phone-mask Data Augmentation using Multi-modeing Unit Training for Robust Uyghur E2E Speech Recognition 标题:PM-MMUT:用于稳健维吾尔语E2E语音识别的多模型单元训练增强电话掩码数据 链接:https://arxiv.org/abs/2112.06721

作者:Guodong Ma,Pengfei Hu,Nurmemet Yolwas,Shen Huang,Hao Huang 机构:School of Information Science and Engineering, Xinjiang University, Urumqi, China, Tencent Minority-Mandarin Translation, Beijing, China, Xinjiang Provincial Key Laboratory of Multi-lingual Information Technology, Urumqi, China 备注:Subbmitted to ICASSP 2022 摘要:维吾尔语语音中经常会出现辅音和元音的减少,这可能会导致维吾尔语自动语音识别(ASR)的性能下降。我们最近提出的基于掩蔽的学习策略,电话掩蔽训练(PMT),缓解了这种现象对维吾尔族ASR的影响。虽然PMT取得了显著的改进,但是由于PMT(音素)的掩蔽单元和建模单元(词块)之间的粒度不匹配,仍然存在进一步的改进空间。为了提高PMT的性能,我们提出了多建模单元训练(MMUT)体系结构与PMT(PM-MMUT)的融合。MMUT框架的思想是将编码器分成两部分,包括声学特征序列到音素级表示(AF到PLR)和音素级表示到词段级表示(PLR到WPLR)。它允许AF-to-PLR通过基于中间音素的CTC损失进行优化,以学习PMT带来的丰富音素级上下文信息。对维吾尔族ASR的实验结果表明,所提出的方法有显著的改善,优于纯PMT(阅读测试从24.0下降到23.7,口语测试从38.4下降到36.8)。我们还使用ESPnet1在960小时的Librispeech基准上进行了实验,与最新的官方ESPnet1预训练模型相比,在没有LM fusion的情况下,所有测试集的相对功耗降低了约10%。 摘要:Consonant and vowel reduction are often encountered in Uyghur speech, which might cause performance degradation in Uyghur automatic speech recognition (ASR). Our recently proposed learning strategy based on masking, Phone Masking Training (PMT), alleviates the impact of such phenomenon in Uyghur ASR. Although PMT achieves remarkably improvements, there still exists room for further gains due to the granularity mismatch between masking unit of PMT (phoneme) and modeling unit (word-piece). To boost the performance of PMT, we propose multi-modeling unit training (MMUT) architecture fusion with PMT (PM-MMUT). The idea of MMUT framework is to split the Encoder into two parts including acoustic feature sequences to phoneme-level representation (AF-to-PLR) and phoneme-level representation to word-piece-level representation (PLR-to-WPLR). It allows AF-to-PLR to be optimized by an intermediate phoneme-based CTC loss to learn the rich phoneme-level context information brought by PMT. Experi-mental results on Uyghur ASR show that the proposed approaches improve significantly, outperforming the pure PMT (reduction WER from 24.0 to 23.7 on Read-Test and from 38.4 to 36.8 on Oral-Test respectively). We also conduct experiments on the 960-hour Librispeech benchmark using ESPnet1, which achieves about 10% relative WER reduction on all the test sets without LM fusion comparing with the latest official ESPnet1 pre-trained model.

【3】 ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition 标题:ITA:面向多模态命名实体识别的图文对齐 链接:https://arxiv.org/abs/2112.06482

作者:Xinyu Wang,Min Gui,Yong Jiang,Zixia Jia,Nguyen Bach,Tao Wang,Zhongqiang Huang,Fei Huang,Kewei Tu 机构:⋄School of Information Science and Technology, ShanghaiTech University, Shanghai Engineering Research Center of Intelligent Vision and Imaging, †DAMO Academy, Alibaba Group 备注:10 pages 摘要:近年来,多模态命名实体识别(MNER)引起了人们的广泛关注。大部分工作通过从预训练对象检测器获得的区域级视觉表示来利用图像信息,并依赖注意机制来建模图像和文本表示之间的交互。然而,很难对此类交互进行建模,因为图像和文本表示分别基于各自模态的数据进行训练,并且不在同一空间中对齐。由于文本表示在MNER中扮演着最重要的角色,本文提出了{f I}mage-{f t}ext{f A}对齐(ITA)将图像特征对齐到文本空间,以便更好地利用基于变换器的预训练文本嵌入中的注意机制。ITA首先将区域对象标记和图像级标题作为视觉上下文进行局部和全局对齐,将它们与输入文本连接起来作为新的跨模态输入,然后将其输入到预先训练的文本嵌入模型中。这使得预先训练的文本嵌入模型的注意模块更容易模拟两种模式之间的交互,因为它们都在文本空间中表示。ITA进一步调整从跨模态输入和文本输入视图预测的输出分布,从而使MNER模型更实用,对图像噪声更鲁棒。在我们的实验中,我们表明ITA模型可以在多模态命名实体识别数据集上实现最先进的精度,即使没有图像信息。 摘要:Recently, Multi-modal Named Entity Recognition (MNER) has attracted a lot of attention. Most of the work utilizes image information through region-level visual representations obtained from a pretrained object detector and relies on an attention mechanism to model the interactions between image and text representations. However, it is difficult to model such interactions as image and text representations are trained separately on the data of their respective modality and are not aligned in the same space. As text representations take the most important role in MNER, in this paper, we propose {f I}mage-{f t}ext {f A}lignments (ITA) to align image features into the textual space, so that the attention mechanism in transformer-based pretrained textual embeddings can be better utilized. ITA first locally and globally aligns regional object tags and image-level captions as visual contexts, concatenates them with the input texts as a new cross-modal input, and then feeds it into a pretrained textual embedding model. This makes it easier for the attention module of a pretrained textual embedding model to model the interaction between the two modalities since they are both represented in the textual space. ITA further aligns the output distributions predicted from the cross-modal input and textual input views so that the MNER model can be more practical and robust to noises from images. In our experiments, we show that ITA models can achieve state-of-the-art accuracy on multi-modal Named Entity Recognition datasets, even without image information.

【4】 A Survey of Toxic Comment Classification Methods 标题:有毒评论分类方法综述 链接:https://arxiv.org/abs/2112.06412

作者:Kehan Wang,Jiaxi Yang,Hongjun Wu 备注:5 pages, 3 figures, 2 tables, for Cornell Tech Applied Machine Learning 摘要:虽然在现实生活中,每个人至少在某种程度上都会表现出自己的行为,但期望人们在互联网上表现出自己的行为要困难得多,因为向他人发布有毒内容几乎没有检查或后果。然而,对于另一方的人来说,有毒文本往往会导致严重的心理后果。检测此类有毒文本具有挑战性。在本文中,我们尝试使用机器学习方法(包括CNN、朴素贝叶斯模型以及LSTM)构建毒性检测器。虽然其他人已经打下了许多基础,但我们的目标是建立比以前更精确的模型。我们使用LSTM和CNN生成了非常高精度的模型,并将它们与语言处理中的go-to解决方案NaiveBayes模型进行了比较。我们还采用了单词嵌入方法来增强模型的准确性。 摘要:While in real life everyone behaves themselves at least to some extent, it is much more difficult to expect people to behave themselves on the internet, because there are few checks or consequences for posting something toxic to others. Yet, for people on the other side, toxic texts often lead to serious psychological consequences. Detecting such toxic texts is challenging. In this paper, we attempt to build a toxicity detector using machine learning methods including CNN, Naive Bayes model, as well as LSTM. While there has been numerous groundwork laid by others, we aim to build models that provide higher accuracy than the predecessors. We produced very high accuracy models using LSTM and CNN, and compared them to the go-to solutions in language processing, the Naive Bayes model. A word embedding approach is also applied to empower the accuracy of our models.

【5】 Reading Task Classification Using EEG and Eye-Tracking Data 标题:基于脑电和眼动数据的阅读任务分类 链接:https://arxiv.org/abs/2112.06310

作者:Nora Hollenstein,Marius Tröndle,Martyna Plomecka,Samuel Kiegeland,Yilmazcan Özyurt,Lena A. Jäger,Nicolas Langer 机构:Center for Language Technology, University of Copenhagen, Department of Psychology, University of Zurich, Department of Computer Science, ETH Zurich, Department of Computational Linguistics, University of Zurich 摘要:苏黎世认知语言处理语料库(ZuCo)提供来自两种阅读范式的眼睛跟踪和EEG信号,即正常阅读和任务特定阅读。我们分析了机器学习方法是否能够利用眼睛跟踪和脑电特征对这两个任务进行分类。我们实现了具有聚合句子级特征和细粒度单词级特征的模型。我们在主题内和跨主题评估场景中测试模型。所有模型均在ZuCo 1.0和ZuCo 2.0数据子集上进行测试,这些数据子集的特点是记录程序不同,因此具有不同的通用性。最后,我们提供了一系列的控制实验来更详细地分析结果。 摘要:The Zurich Cognitive Language Processing Corpus (ZuCo) provides eye-tracking and EEG signals from two reading paradigms, normal reading and task-specific reading. We analyze whether machine learning methods are able to classify these two tasks using eye-tracking and EEG features. We implement models with aggregated sentence-level features as well as fine-grained word-level features. We test the models in within-subject and cross-subject evaluation scenarios. All models are tested on the ZuCo 1.0 and ZuCo 2.0 data subsets, which are characterized by differing recording procedures and thus allow for different levels of generalizability. Finally, we provide a series of control experiments to analyze the results in more detail.

【6】 Improving Speech Recognition on Noisy Speech via Speech Enhancement with Multi-Discriminators CycleGAN 标题:基于多鉴别器CycleGAN的语音增强改善含噪语音识别 链接:https://arxiv.org/abs/2112.06309

作者:Chia-Yu Li,Ngoc Thang Vu 机构:Institute of Natural Language Processing, University of Stuttgart, Germany 备注:6 pages, 9 figures, ASRU 2021 摘要:本文介绍了我们通过语音增强改进含噪语音自动识别的最新研究。我们提出了一种新的方法称为多鉴别器CycleGAN来降低输入语音的噪声,从而提高自动语音识别的性能。我们提出的方法利用CycleGAN框架进行语音增强,无需任何并行数据,并通过引入检查不同频率区域的多个鉴别器对其进行改进。此外,我们还证明了在训练数据的齐次子集上训练多个生成器比在所有训练数据上训练一个生成器要好。我们在CHiME-3数据集上评估了我们的方法,并观察到在开发集上高达10.03%,在评估集上高达14.09%。 摘要:This paper presents our latest investigations on improving automatic speech recognition for noisy speech via speech enhancement. We propose a novel method named Multi-discriminators CycleGAN to reduce noise of input speech and therefore improve the automatic speech recognition performance. Our proposed method leverages the CycleGAN framework for speech enhancement without any parallel data and improve it by introducing multiple discriminators that check different frequency areas. Furthermore, we show that training multiple generators on homogeneous subset of the training data is better than training one generator on all the training data. We evaluate our method on CHiME-3 data set and observe up to 10.03% relatively WER improvement on the development set and up to 14.09% on the evaluation set.

【7】 Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition 标题:构建稀疏混合专家语音识别多语种教师队伍 链接:https://arxiv.org/abs/2112.05820

作者:Kenichi Kumatani,Robert Gmyr,Felipe Cruz Salinas,Linquan Liu,Wei Zuo,Devang Patel,Eric Sun,Yu Shi 机构:Microsoft 摘要:稀疏门控混合专家(MoE)可以在计算复杂度较低的情况下放大网络容量。在这项工作中,我们研究了如何通过简单的路由算法来扩展多语言自动语音识别(ASR)网络,以获得更好的准确度。更具体地说,我们将稀疏选通MoE技术应用于两种类型的网络:顺序-顺序Transformer(S2S-T)和Transformer-传感器(T-T)。我们通过一组多语言数据的ASR实验证明,使用S2S-T和T-T,MoE网络可以分别将相对单词错误率降低16.5%和4.7%。此外,我们还深入研究了MoE在不同条件下对T-T体系结构的影响:流模式、非流模式、使用语言ID以及使用MoE的标签解码器。 摘要:The sparsely-gated Mixture of Experts (MoE) can magnify a network capacity with a little computational complexity. In this work, we investigate how multi-lingual Automatic Speech Recognition (ASR) networks can be scaled up with a simple routing algorithm in order to achieve better accuracy. More specifically, we apply the sparsely-gated MoE technique to two types of networks: Sequence-to-Sequence Transformer (S2S-T) and Transformer Transducer (T-T). We demonstrate through a set of ASR experiments on multiple language data that the MoE networks can reduce the relative word error rates by 16.5\% and 4.7\% with the S2S-T and T-T, respectively. Moreover, we thoroughly investigate the effect of the MoE on the T-T architecture in various conditions: streaming mode, non-streaming mode, the use of language ID and the label decoder with the MoE.

【8】 Computer-Assisted Creation of Boolean Search Rules for Text Classification in the Legal Domain 标题:法律领域文本分类布尔搜索规则的计算机辅助生成 链接:https://arxiv.org/abs/2112.05807

作者:Hannes Westermann,Jaromir Savelka,Vern R. Walker,Kevin D. Ashley,Karim Benyekhlef 机构:Cyberjustice Laboratory, Facult´e de droit, Universit´e de Montr´eal, ISP, School of Computing and Information, University of Pittsburgh, LLT Lab, Maurice A. Deane School of Law, Hofstra University 备注:None 摘要:在本文中,我们提出了一种以布尔搜索规则的形式构建强大的、可解释的分类器的方法。我们开发了一个称为CASE(计算机辅助语义探索)的交互式环境,该环境利用单词共现来指导注释者选择相关的搜索词。该系统无缝地促进了分类规则的迭代评估和改进。该过程使人类注释者能够利用统计信息的优势,同时将他们的专家直觉融入到此类规则的创建中。我们在4个数据集上评估了使用我们的案例系统创建的分类器,并将结果与机器学习方法进行比较,包括SKOPE规则、随机森林、支持向量机和fastText分类器。这些结果推动了关于布尔搜索规则优越的紧凑性、简单性和直观性与用于文本分类的最先进的机器学习模型的更好性能之间的权衡的讨论。 摘要:In this paper, we present a method of building strong, explainable classifiers in the form of Boolean search rules. We developed an interactive environment called CASE (Computer Assisted Semantic Exploration) which exploits word co-occurrence to guide human annotators in selection of relevant search terms. The system seamlessly facilitates iterative evaluation and improvement of the classification rules. The process enables the human annotators to leverage the benefits of statistical information while incorporating their expert intuition into the creation of such rules. We evaluate classifiers created with our CASE system on 4 datasets, and compare the results to machine learning methods, including SKOPE rules, Random forest, Support Vector Machine, and fastText classifiers. The results drive the discussion on trade-offs between superior compactness, simplicity, and intuitiveness of the Boolean search rules versus the better performance of state-of-the-art machine learning models for text classification.

【9】 Directed Speech Separation for Automatic Speech Recognition of Long Form Conversational Speech 标题:用于长格式会话语音自动识别的有向语音分离 链接:https://arxiv.org/abs/2112.05863

作者:Rohit Paturi,Sundararajan Srinivasan,Katrin Kirchhoff 机构:Amazon AWS AI 摘要:语音分离的许多最新进展主要是针对具有高度重叠的短音频话语的合成混合。这些数据集与真实会话数据有显著差异,因此,在这些数据集上训练和评估的模型不能推广到真实会话场景。对长格式语音使用这些模型的另一个问题是,由于时频掩码的无监督聚类或排列不变训练(PIT)损失,分离语音段的顺序不确定。这导致难以为自动语音识别(ASR)等下游任务准确拼接同质说话人片段。在本文中,我们提出了一种基于直接从混合信号中提取的说话人嵌入的说话人条件分离器。我们使用定向损失来训练该模型,该定向损失调节分离段的顺序。使用该模型,我们在不需要额外的重新拼接步骤的情况下,显著提高了真实会话数据的字错误率(WER)。 摘要:Many of the recent advances in speech separation are primarily aimed at synthetic mixtures of short audio utterances with high degrees of overlap. These datasets significantly differ from the real conversational data and hence, the models trained and evaluated on these datasets do not generalize to real conversational scenarios. Another issue with using most of these models for long form speech is the nondeterministic ordering of separated speech segments due to either unsupervised clustering for time-frequency masks or Permutation Invariant training (PIT) loss. This leads to difficulty in accurately stitching homogenous speaker segments for downstream tasks like Automatic Speech Recognition (ASR). In this paper, we propose a speaker conditioned separator trained on speaker embeddings extracted directly from the mixed signal. We train this model using a directed loss which regulates the order of the separated segments. With this model, we achieve significant improvements on Word error rate (WER) for real conversational data without the need for an additional re-stitching step.

Zero/Few/One-Shot|迁移|自适应(1篇)

【1】 VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks 标题:VL-Adapter:视觉和语言任务的参数高效迁移学习 链接:https://arxiv.org/abs/2112.06825

作者:Yi-Lin Sung,Jaemin Cho,Mohit Bansal 机构:UNC Chapel Hill 备注:13 pages 摘要:最近,在大型文本语料库上预先训练的微调语言模型已经在视觉和语言(V&L)任务以及纯语言任务上提供了巨大的改进。然而,由于模型尺寸正在快速增长,因此微调预训练模型的整个参数集变得不切实际。因此,在本文中,我们将基于适配器的参数有效迁移学习技术引入V&L模型,如VL-BART和VL-T5。我们在一个统一的多任务设置中对四种不同的V&L任务(VQAv2、GQA、NLVR2和MSCOCO图像字幕)评估我们的方法。通过仔细的训练和彻底的实验,我们将三种流行的基于适配器的方法(适配器、Hyperformer、Compacter)与标准的完全微调和最近提出的快速调优方法进行了对比。我们还通过共享适配器的权重来跨任务获取知识,从而提高适配器的效率和性能。我们的结果表明,使用权重共享技术(占总参数的4.4%)训练适配器可以匹配微调整个模型的性能。最后,我们提出了一个全面的分析,包括适配器和任务特定提示的组合以及V&L预训练对适配器的影响。我们的代码可从以下网址获得:https://github.com/ylsung/VL_adapter. 摘要:Recently, fine-tuning language models pre-trained on large text corpora have provided huge improvements on vision-and-language (V&L) tasks as well as on pure language tasks. However, fine-tuning the entire parameter set of pre-trained models becomes impractical since the model size is growing rapidly. Hence, in this paper, we introduce adapter-based parameter-efficient transfer learning techniques to V&L models such as VL-BART and VL-T5. We evaluate our methods in a unified multi-task setup on four diverse V&L tasks: VQAv2, GQA, NLVR2 , and MSCOCO image captioning. With careful training and thorough experiments, we benchmark three popular adapter-based methods (Adapter, Hyperformer, Compacter) against the standard full fine-tuning and the recently proposed prompt-tuning approach. We also enhance the efficiency and performance of adapters by sharing their weights to attain knowledge across tasks. Our results demonstrate that training the adapter with the weight-sharing technique (4.4% of total parameters) can match the performance of fine-tuning the entire model. Lastly, we present a comprehensive analysis including the combination of adapter and task-specific prompts and the impact of V&L pre-training on adapters. Our code is available at: https://github.com/ylsung/VL_adapter.

语料库(1篇)

【1】 Learning Nigerian accent embeddings from speech: preliminary results based on SautiDB-Naija corpus 标题:基于SautiDB-Naija语料库的尼日利亚口音嵌入学习初步结果 链接:https://arxiv.org/abs/2112.06199

作者:Tejumade Afonja,Oladimeji Mudele,Iroro Orife,Kenechi Dukor,Lawrence Francis,Duru Goodness,Oluwafemi Azeez,Ademola Malomo,Clinton Mbataku 机构: Niger-Volta Language Technologies Institute 摘要:本文描述了SautiDB Naija的基础性工作,这是一个新的非母语(L2)尼日利亚英语语音语料库。我们描述了语料库是如何创建和管理的,以及口音分类和学习尼日利亚口音嵌入的初步实验。语料库的初始版本包括来自尼日利亚语言的第二语言英语使用者的900多条录音,如约鲁巴语、伊博语、江户语、埃菲克·伊比比奥语和伊加拉语。我们进一步演示了如何在预先训练好的模型(如wav2vec)上进行微调,以产生适合相关语音任务(如口音分类)的表示。SautiDB Naija已发布给Zenodo,以供灵活的知识共享许可证下的一般使用。 摘要:This paper describes foundational efforts with SautiDB-Naija, a novel corpus of non-native (L2) Nigerian English speech. We describe how the corpus was created and curated as well as preliminary experiments with accent classification and learning Nigerian accent embeddings. The initial version of the corpus includes over 900 recordings from L2 English speakers of Nigerian languages, such as Yoruba, Igbo, Edo, Efik-Ibibio, and Igala. We further demonstrate how fine-tuning on a pre-trained model like wav2vec can yield representations suitable for related speech tasks such as accent classification. SautiDB-Naija has been published to Zenodo for general use under a flexible Creative Commons License.

表征(1篇)

【1】 Representation Learning for Conversational Data using Discourse Mutual Information Maximization 标题:基于话语互信息最大化的会话数据表征学习 链接:https://arxiv.org/abs/2112.05787

作者:Bishal Santra,Sumegh Roychowdhury,Aishik Mandal,Vasu Gurram,Atharva Naik,Manish Gupta,Pawan Goyal 机构: Computer Science and Engineering Dept., Indian Institute of Technology Kharagpur, India, Microsoft, India 备注:Preprint, 15 pages 摘要:尽管存在许多针对文本或图像的预训练模型,但专门针对对话理解训练表示的尝试相对较少。以前的工作通常依赖于基于通用文本表示模型(如BERT或GPT-2)的精细表示。但是,现有的预训练目标没有考虑到文本的结构信息。虽然生成性对话模型也可以学习结构特征,但我们认为不知道结构的逐字生成不适合有效的对话建模。我们的经验表明,这种表述在不同的对话理解任务中并不一致。因此,我们提出了一种基于结构感知互信息的损失函数DMI(话语互信息),用于训练对话表示模型,该函数还捕获了响应预测中固有的不确定性。对九个不同对话建模任务的广泛评估表明,我们提出的基于DMI的模型在很大程度上优于强基线,即使是小规模的预训练。我们的模型在对话评估任务DailyDialog++上显示了最有希望的性能,无论是在随机的还是敌对的负面场景中。 摘要:Although many pretrained models exist for text or images, there have been relatively fewer attempts to train representations specifically for dialog understanding. Prior works usually relied on finetuned representations based on generic text representation models like BERT or GPT-2. But, existing pretraining objectives do not take the structural information of text into consideration. Although generative dialog models can learn structural features too, we argue that the structure-unaware word-by-word generation is not suitable for effective conversation modeling. We empirically demonstrate that such representations do not perform consistently across various dialog understanding tasks. Hence, we propose a structure-aware Mutual Information based loss-function DMI (Discourse Mutual Information) for training dialog-representation models, that additionally captures the inherent uncertainty in response prediction. Extensive evaluation on nine diverse dialog modeling tasks shows that our proposed DMI-based models outperform strong baselines by significant margins, even with small-scale pretraining. Our models show the most promising performance on the dialog evaluation task DailyDialog++, in both random and adversarial negative scenarios.

Word2Vec|文本|单词(1篇)

【1】 ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts 标题:ANEA:德语领域特定文本的自动(命名)实体标注 链接:https://arxiv.org/abs/2112.06724

作者:Anastasia Zhukova,Felix Hamborg,Bela Gipp 机构:University of Wuppertal, Germany, University of Konstanz 备注:None 摘要:命名实体识别(NER)是一项重要任务,旨在解决命名实体的通用类别,例如人员、地点、组织和时间。尽管NER在许多用例中都有常见且可行的用途,但它几乎不适用于一般类别不理想的领域,如工程或医学。为了促进特定领域类型的NER,我们提出了一种自动(命名)实体注释器ANEA,当给定一组特定领域文本时,它可以帮助人类注释器为德语文本集合创建特定领域的NER语料库。在我们的评估中,我们发现ANEA自动识别最能代表文本内容的术语,识别连贯术语组,并为这些组提取和分配描述性标签,即将文本数据集注释到域(命名)实体中。 摘要:Named entity recognition (NER) is an important task that aims to resolve universal categories of named entities, e.g., persons, locations, organizations, and times. Despite its common and viable use in many use cases, NER is barely applicable in domains where general categories are suboptimal, such as engineering or medicine. To facilitate NER of domain-specific types, we propose ANEA, an automated (named) entity annotator to assist human annotators in creating domain-specific NER corpora for German text collections when given a set of domain-specific texts. In our evaluation, we find that ANEA automatically identifies terms that best represent the texts' content, identifies groups of coherent terms, and extracts and assigns descriptive labels to these groups, i.e., annotates text datasets into the domain (named) entities.

其他神经网络|深度学习|模型|建模(4篇)

【1】 GLaM: Efficient Scaling of Language Models with Mixture-of-Experts 标题:GLAM:使用混合专家高效扩展语言模型 链接:https://arxiv.org/abs/2112.06905

作者:Nan Du,Yanping Huang,Andrew M. Dai,Simon Tong,Dmitry Lepikhin,Yuanzhong Xu,Maxim Krikun,Yanqi Zhou,Adams Wei Yu,Orhan Firat,Barret Zoph,Liam Fedus,Maarten Bosma,Zongwei Zhou,Tao Wang,Yu Emma Wang,Kellie Webster,Marie Pellat,Kevin Robinson,Kathy Meier-Hellstern,Toju Duke,Lucas Dixon,Kun Zhang,Quoc V Le,Yonghui Wu,Zhifeng Chen,Claire Cui 机构:Google Inc 摘要:用更多的数据、计算和参数扩展语言模型,推动了自然语言处理的重大进展。例如,得益于规模化,GPT-3能够在情境学习任务上取得显著成绩。然而,训练这些大型密集模型需要大量的计算资源。在本文中,我们提出并开发了一系列名为GLaM(通才语言模型)的语言模型,该模型使用稀疏激活的混合专家体系结构来扩展模型容量,同时与密集变体相比,产生的训练成本也大大降低。最大的GLaM有1.2万亿个参数,大约比GPT-3大7倍。它只消耗训练GPT-3所用能量的1/3,推理需要一半的计算次数,同时在29个NLP任务中仍能获得更好的整体零触发和一触发性能。 摘要:Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources. In this paper, we propose and develop a family of language models named GLaM (Generalist Language Model), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. The largest GLaM has 1.2 trillion parameters, which is approximately 7x larger than GPT-3. It consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.

【2】 Sparse Interventions in Language Models with Differentiable Masking 标题:具有可微掩蔽的语言模型中的稀疏干预 链接:https://arxiv.org/abs/2112.06837

作者:Nicola De Cao,Leon Schmid,Dieuwke Hupkes,Ivan Titov 机构:University of Amsterdam,University of Edinburgh, University of Osnabrück,Facebook AI Research,Innopolis University 备注:12 pages, 4 figures, 6 tables 摘要:人们对理解语言模型的隐藏表示(LMs)捕获了哪些信息非常感兴趣。通常,解释方法i)不能保证模型实际使用编码信息,ii)不能发现导致所考虑的现象的神经元的小子集。受因果中介分析的启发,我们提出了一种方法,在神经LM中发现负责特定语言现象的一小部分神经元,即引起相应标记发射概率变化的子集。我们使用可微松弛近似搜索组合空间。$L_0$正则化项确保搜索收敛到离散和稀疏解。我们应用我们的方法来分析LSTMs中的主谓数一致性和性别偏见检测。我们观察到,它比替代方案(强化)更快,并找到更好的解决方案。我们的实验证实,这些现象中的每一个都是通过一小部分神经元介导的,这些神经元不起任何其他可识别的作用。 摘要:There has been a lot of interest in understanding what information is captured by hidden representations of language models (LMs). Typically, interpretation methods i) do not guarantee that the model actually uses the encoded information, and ii) do not discover small subsets of neurons responsible for a considered phenomenon. Inspired by causal mediation analysis, we propose a method that discovers within a neural LM a small subset of neurons responsible for a particular linguistic phenomenon, i.e., subsets causing a change in the corresponding token emission probabilities. We use a differentiable relaxation to approximately search through the combinatorial space. An $L_0$ regularization term ensures that the search converges to discrete and sparse solutions. We apply our method to analyze subject-verb number agreement and gender bias detection in LSTMs. We observe that it is fast and finds better solutions than the alternative (REINFORCE). Our experiments confirm that each of these phenomenons is mediated through a small subset of neurons that do not play any other discernible role.

【3】 WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models 标题:Wechsel:用于单语模型跨语言迁移的子词嵌入的有效初始化 链接:https://arxiv.org/abs/2112.06598

作者:Benjamin Minixhofer,Fabian Paischer,Navid Rekabsaz 机构:Institute of Computational Perception, Johannes Kepler University Linz, Institute for Machine Learning, Johannes Kepler University Linz, ELLIS Unit Linz and LIT AI Lab 摘要:近年来,大型预训练语言模型(LMs)得到了广泛的应用。训练这些模型需要更多的计算资源,而且大多数现有的模型只训练英文文本。用其他语言训练这些模型是非常昂贵的。为了缓解这个问题,我们引入了一种称为WECHSEL的方法,将英语模型转换成新的语言。我们将英语模型的标记器与目标语言中的标记器交换,并通过使用覆盖英语和目标语言的多语言静态单词嵌入,初始化标记嵌入,使其接近语义相似的英语标记。我们使用WECHSEL将GPT-2和RoBERTa模型转换为其他4种语言(法语、德语、汉语和斯瓦希里语)。WECHSEL改进了先前提出的跨语言参数转移方法,并优于在目标语言中从头开始训练的大小相当的模型,训练工作量最多减少64倍。我们的方法使得为新语言训练大型语言模型更容易获得,对环境的破坏也更小。我们公开我们的代码和模型。 摘要:Recently, large pretrained language models (LMs) have gained popularity. Training these models requires ever more computational resources and most of the existing models are trained on English text only. It is exceedingly expensive to train these models in other languages. To alleviate this problem, we introduce a method -- called WECHSEL -- to transfer English models to new languages. We exchange the tokenizer of the English model with a tokenizer in the target language and initialize token embeddings such that they are close to semantically similar English tokens by utilizing multilingual static word embeddings covering English and the target language. We use WECHSEL to transfer GPT-2 and RoBERTa models to 4 other languages (French, German, Chinese and Swahili). WECHSEL improves over a previously proposed method for cross-lingual parameter transfer and outperforms models of comparable size trained from scratch in the target language with up to 64x less training effort. Our method makes training large language models for new languages more accessible and less damaging to the environment. We make our code and models publicly available.

【4】 Am I Me or You? State-of-the-Art Dialogue Models Cannot Maintain an Identity 标题:我是我还是你?最先进的对话模式不能保持身份 链接:https://arxiv.org/abs/2112.05843

作者:Kurt Shuster,Jack Urbanek,Arthur Szlam,Jason Weston 机构:Facebook AI Research 摘要:最先进的对话模式在事实准确性和自相矛盾性方面仍然经常出现问题。有趣的是,他们被观察到无法在整个话语中保持人物身份;更具体地说,他们可能扮演对话者的角色。在这项工作中,我们对这一缺陷进行了形式化和量化,并通过人类评估实验证明这确实是一个问题。相比之下,我们发现,经过专门训练以识别谁在说话的辨别模型可以表现良好;此外,这些可以用作自动度量。最后,我们评估了各种缓解方法,包括对模型架构、训练协议和解码策略的更改。根据人类注释者的说法,我们最好的模型减少了近65%的错误识别问题,同时提高了吸引力。尽管有这些结果,我们发现维持角色身份仍然是一个具有挑战性的问题。 摘要:State-of-the-art dialogue models still often stumble with regards to factual accuracy and self-contradiction. Anecdotally, they have been observed to fail to maintain character identity throughout discourse; and more specifically, may take on the role of their interlocutor. In this work we formalize and quantify this deficiency, and show experimentally through human evaluations that this is indeed a problem. In contrast, we show that discriminative models trained specifically to recognize who is speaking can perform well; and further, these can be used as automated metrics. Finally, we evaluate a wide variety of mitigation methods, including changes to model architecture, training protocol, and decoding strategy. Our best models reduce mistaken identity issues by nearly 65% according to human annotators, while simultaneously improving engagingness. Despite these results, we find that maintaining character identity still remains a challenging problem.

其他(12篇)

【1】 Unraveling Social Perceptions & Behaviors towards Migrants on Twitter 标题:推特上对移民的社会认知和行为的瓦解 链接:https://arxiv.org/abs/2112.06642

作者:Aparup Khatua,Wolfgang Nejdl 机构:L,S Research Center, Leibniz Universität Hannover, Hannover, Germany, [At least , carriage returns must separate the author affiliation block from the beginning of the two-column text block] 备注:This work has been accepted to appear at International Conference on Web and Social Media ICWSM-2022 摘要:我们从社会心理学文献中汲取见解,以确定推特对移民的两个方面的讨论,即对移民的看法和对移民的行为。我们的理论锚定帮助我们确定了社交媒体用户对移民的两种普遍看法(即同情和反感)和两种主要行为(即团结和敌意)。我们采用了无监督和有监督的方法来识别这些感知和行为。在应用NLP领域,我们对FER的研究对移民相关的推特去自由化有着微妙的理解。我们提出的基于Transformer的模型,即BERT+CNN,F1得分为0.76,outper形成了其他模型。此外,我们认为,表达反感或敌意的推特可以被广泛认为是针对移民的仇恨言论,但它们并不相同。因此,我们的方法通过突出仇恨言论的感知和行为方面的细微差异,对二元仇恨言论检测任务进行了微调。 摘要:We draw insights from the social psychology literature to identify two facets of Twitter deliberations about migrants, i.e., perceptions about migrants and behaviors towards mi-grants. Our theoretical anchoring helped us in identifying two prevailing perceptions (i.e., sympathy and antipathy) and two dominant behaviors (i.e., solidarity and animosity) of social media users towards migrants. We have employed unsuper-vised and supervised approaches to identify these perceptions and behaviors. In the domain of applied NLP, our study of-fers a nuanced understanding of migrant-related Twitter de-liberations. Our proposed transformer-based model, i.e., BERT + CNN, has reported an F1-score of 0.76 and outper-formed other models. Additionally, we argue that tweets con-veying antipathy or animosity can be broadly considered hate speech towards migrants, but they are not the same. Thus, our approach has fine-tuned the binary hate speech detection task by highlighting the granular differences between perceptual and behavioral aspects of hate speeches.

【2】 A Study on Token Pruning for ColBERT 标题:Colbert的令牌剪枝研究 链接:https://arxiv.org/abs/2112.06540

作者:Carlos Lassance,Maroua Maachou,Joohee Park,Stéphane Clinchant 机构:†Naver Labs Europe, ‡ Naver Corp 备注:5 pages 摘要:ColBERT模型最近被提出作为一种有效的基于BERT的ranker模型。通过采用后期交互机制,ColBERT的一个主要优点是可以预先计算文档表示。然而,该模型最大的缺点是索引大小,它与集合中令牌的数量成线性比例。在本文中,我们研究了科尔BERT模型的各种设计,以解决这个问题。虽然压缩技术已被探索,以减少索引大小,在本文中,我们研究了令牌修剪技术的科尔BERT。我们比较了简单的启发式方法和单层注意机制,以选择索引时要保留的标记。我们的实验表明,在MS MARCO passage集合中,ColBERT索引可以被删减多达30\%,而不会显著降低性能。最后,我们对MARCO女士的文件进行了实验,揭示了这种机制面临的一些挑战。 摘要:The ColBERT model has recently been proposed as an effective BERT based ranker. By adopting a late interaction mechanism, a major advantage of ColBERT is that document representations can be precomputed in advance. However, the big downside of the model is the index size, which scales linearly with the number of tokens in the collection. In this paper, we study various designs for ColBERT models in order to attack this problem. While compression techniques have been explored to reduce the index size, in this paper we study token pruning techniques for ColBERT. We compare simple heuristics, as well as a single layer of attention mechanism to select the tokens to keep at indexing time. Our experiments show that ColBERT indexes can be pruned up to 30\% on the MS MARCO passage collection without a significant drop in performance. Finally, we experiment on MS MARCO documents, which reveal several challenges for such mechanism.

【3】 Do Data-based Curricula Work? 标题:基于数据的课程有效吗? 链接:https://arxiv.org/abs/2112.06510

作者:Maxim K. Surkov,Vladislav D. Mosin,Ivan P. Yamshchikov 机构:LEYA Lab, Yandex, Higher School of Economics 摘要:当前最先进的NLP系统使用大型神经网络,需要大量计算资源进行训练。受人类知识获取的启发,研究人员提出了课程学习、任务排序(基于任务的课程)或数据集排序和取样(基于数据的课程),以促进训练。这项工作调查了基于数据的课程学习对于大型现代语言模型(如BERT和T5)的好处。我们根据一系列的复杂性度量和不同的抽样策略对不同的课程进行了实验。对不同NLP任务的大量实验表明,基于各种复杂度度量的课程很少有任何好处,而随机抽样的表现与课程一样好或更好。 摘要:Current state-of-the-art NLP systems use large neural networks that require lots of computational resources for training. Inspired by human knowledge acquisition, researchers have proposed curriculum learning, - sequencing of tasks (task-based curricula) or ordering and sampling of the datasets (data-based curricula) that facilitate training. This work investigates the benefits of data-based curriculum learning for large modern language models such as BERT and T5. We experiment with various curricula based on a range of complexity measures and different sampling strategies. Extensive experiments on different NLP tasks show that curricula based on various complexity measures rarely has any benefits while random sampling performs either as well or better than curricula.

【4】 Native Chinese Reader: A Dataset TowardsNative-Level Chinese Machine ReadingComprehension 标题:汉语母语阅读器:面向汉语母语机器阅读理解的数据集 链接:https://arxiv.org/abs/2112.06494

作者:Shusheng Xu,Yichen Liu,Xiaoyu Yi,Siyuan Zhou,Huizi Li,Yi Wu 机构: IIIS, Tsinghua University, New York University , Shenzhen University, Peking University, Haihua Institute for Frontier Information Technology, Shanghai Qi Zhi Institute 备注:17 pages, 1 fiugres, accepted by NeurIPS 2021 Track on Datasets and Benchmarks 摘要:我们介绍了一个新的机器阅读理解(MRC)数据集,即“母语汉语读者”(NCR),其中包含大量现代汉语和古典汉语的文章。NCR是从中国高中汉语课程的试题中收集的,旨在评估中国本土年轻人的语言水平。现有的中文MRC数据集要么是特定领域的,要么只关注现代汉语中几百个字符的短上下文。相比之下,NCR包含8390份文件,平均长度为1024个字符,涵盖了广泛的中国写作风格,包括现代文章、古典文学和古典诗歌。这些文件中总共有20477个问题需要很强的推理能力和常识才能找到正确答案。我们使用流行的中国预训练模型实现了多个基线模型,并使用我们的数据集启动了一个在线竞赛,以检查当前方法的局限性。最佳模型的测试准确率为59%,而人工评估的平均准确率为79%,这表明当前的MRC模型与以汉语为母语的人之间存在显著的性能差距。我们在以下位置发布数据集:https://sites.google.com/view/native-chinese-reader/. 摘要:We present Native Chinese Reader (NCR), a new machine reading comprehension (MRC) dataset with particularly long articles in both modern and classical Chinese. NCR is collected from the exam questions for the Chinese course in China's high schools, which are designed to evaluate the language proficiency of native Chinese youth. Existing Chinese MRC datasets are either domain-specific or focusing on short contexts of a few hundreds of characters in modern Chinese only. By contrast, NCR contains 8390 documents with an average length of 1024 characters covering a wide range of Chinese writing styles, including modern articles, classical literature and classical poetry. A total of 20477 questions on these documents also require strong reasoning abilities and common sense to figure out the correct answers. We implemented multiple baseline models using popular Chinese pre-trained models and additionally launched an online competition using our dataset to examine the limit of current methods. The best model achieves 59% test accuracy while human evaluation shows an average accuracy of 79%, which indicates a significant performance gap between current MRC models and native Chinese speakers. We release the dataset at https://sites.google.com/view/native-chinese-reader/.

【5】 Predicting User Code-Switching Level from Sociological and Psychological Profiles 标题:从社会学和心理学特征预测用户语码转换水平 链接:https://arxiv.org/abs/2112.06462

作者:Injy Hamed,Alia El Bolock,Nader Rizk,Cornelia Herbert,Slim Abdennadher,Ngoc Thang Vu 机构:Institute for Natural Language Processing, University of Stuttgart, Stuttgart, Germany, Applied Emotion and Motivation Psychology, Ulm University, Ulm, Germany, Informatics and Computer Science, German International University, Cairo, Egypt 备注:To be published in the proceedings of the International Conference on Asian Language Information Processing 摘要:说多种语言的人往往在对话中交替使用不同的语言,这种现象被称为“语码转换”(CS)。语篇转换是一种复杂的现象,它不仅包含了语言上的挑战,而且在说话人之间的动态行为方面也包含了大量的复杂性。社会学家和心理学家对这种动态行为进行了研究,确定了影响CS的因素。在这篇文章中,我们提供了一个关于阿拉伯语-英语CS的实证研究,我们展示了用户CS频率和性格特征之间的相关性。我们使用机器学习(ML)来验证研究结果,告知和确认现有的理论。预测模型能够预测用户的CS频率,准确率高于55%,其中旅行经历和个性特征在建模过程中发挥了最大的作用。 摘要:Multilingual speakers tend to alternate between languages within a conversation, a phenomenon referred to as "code-switching" (CS). CS is a complex phenomenon that not only encompasses linguistic challenges, but also contains a great deal of complexity in terms of its dynamic behaviour across speakers. This dynamic behaviour has been studied by sociologists and psychologists, identifying factors affecting CS. In this paper, we provide an empirical user study on Arabic-English CS, where we show the correlation between users' CS frequency and character traits. We use machine learning (ML) to validate the findings, informing and confirming existing theories. The predictive models were able to predict users' CS frequency with an accuracy higher than 55%, where travel experiences and personality traits played the biggest role in the modeling process.

【6】 ValueNet: A New Dataset for Human Value Driven Dialogue System 标题:ValueNet:一种新的人类价值驱动对话系统数据集 链接:https://arxiv.org/abs/2112.06346

作者:Liang Qiu,Yizhou Zhao,Jinchao Li,Pan Lu,Baolin Peng,Jianfeng Gao,Song-Chun Zhu 机构:UCLA Center for Vision, Cognition, Learning, and Autonomy, Microsoft Research, Redmond 备注:Paper accepted by AAAI 2022 摘要:构建一个具有社会智能的代理涉及到许多挑战,其中之一就是教代理像人一样在其价值观的指导下说话。然而,价值驱动的聊天机器人在对话系统领域仍然没有得到充分的研究。大多数现有数据集集中于常识推理或社会规范建模。在这项工作中,我们提出了一个新的大规模人类价值数据集ValueNet,其中包含21374个文本场景中的人类态度。该数据集按十个维度组织,符合跨文化研究中的基本人类价值理论。我们进一步在ValueNet上开发了基于Transformer的价值回归模型,以了解效用分布。综合实证结果表明,学习价值模型可以使广泛的对话任务受益。例如,通过使用强化学习和价值模型的奖励教授生成代理,我们的方法在个性化对话生成数据集:Persona Chat上实现了最先进的性能。现有的情感识别模型以值作为附加特征,能够捕获上下文中丰富的人类情感,从而进一步提高移情对话数据集中移情反应生成的性能。据我们所知,ValueNet是第一个用于人类价值建模的大型文本数据集,我们也是第一个尝试将价值模型纳入情感智能对话系统的人。该数据集可在https://liang-qiu.github.io/ValueNet/. 摘要:Building a socially intelligent agent involves many challenges, one of which is to teach the agent to speak guided by its value like a human. However, value-driven chatbots are still understudied in the area of dialogue systems. Most existing datasets focus on commonsense reasoning or social norm modeling. In this work, we present a new large-scale human value dataset called ValueNet, which contains human attitudes on 21,374 text scenarios. The dataset is organized in ten dimensions that conform to the basic human value theory in intercultural research. We further develop a Transformer-based value regression model on ValueNet to learn the utility distribution. Comprehensive empirical results show that the learned value model could benefit a wide range of dialogue tasks. For example, by teaching a generative agent with reinforcement learning and the rewards from the value model, our method attains state-of-the-art performance on the personalized dialog generation dataset: Persona-Chat. With values as additional features, existing emotion recognition models enable capturing rich human emotions in the context, which further improves the empathetic response generation performance in the EmpatheticDialogues dataset. To the best of our knowledge, ValueNet is the first large-scale text dataset for human value modeling, and we are the first one trying to incorporate a value model into emotionally intelligent dialogue systems. The dataset is available at https://liang-qiu.github.io/ValueNet/.

【7】 ASCEND: A Spontaneous Chinese-English Dataset for Code-switching in Multi-turn Conversation 标题:ASCEND:一种用于多话轮会话中语码转换的自发汉英数据集 链接:https://arxiv.org/abs/2112.06223

作者:Holy Lovenia,Samuel Cahyawijaya,Genta Indra Winata,Peng Xu,Xu Yan,Zihan Liu,Rita Frieske,Tiezheng Yu,Wenliang Dai,Elham J. Barezi,Pascale Fung 机构:The Hong Kong University of Science and Technology 摘要:语码转换是说话人在谈话中转换语言的一种言语现象。尽管会话口语中语码转换是自发的,但现有的大多数研究都是通过阅读语音而不是自发语音来收集语码转换数据的。AgStand(自发性汉英数据集)在香港收集了一个高质量的自发性多轮对话对话资源。我们报告ASCEND收集语音数据的设计和过程,包括本工作中的注释。ASCEND包括23名中英文流利的双语者,包括9.23小时的干净语音语料库。 摘要:Code-switching is a speech phenomenon when a speaker switches language during a conversation. Despite the spontaneous nature of code-switching in conversational spoken language, most existing works collect code-switching data through read speech instead of spontaneous speech. ASCEND (A Spontaneous Chinese-English Dataset) introduces a high-quality resource of spontaneous multi-turn conversational dialogue Chinese-English code-switching corpus collected in Hong Kong. We report ASCEND's design and procedure of collecting the speech data, including the annotations in this work. ASCEND includes 23 bilinguals that are fluent in both Chinese and English and consists of 9.23 hours clean speech corpus.

【8】 Prosody Labelled Dataset for Hindi using Semi-Automated Approach 标题:采用半自动方法的印地语韵律标注数据集 链接:https://arxiv.org/abs/2112.05973

作者:Esha Banerjee,Atul Kr. Ojha,Girish Nath Jha 备注:6 摘要:本研究旨在开发一个半自动标记的印地语韵律数据库,以增强ASR和TTS系统中的语调成分,这也有助于建立语音到语音机器翻译系统。虽然印度语中没有单一的韵律标记标准,但研究人员过去在文献中使用了感知和统计方法来推断印度语韵律模式的行为。基于现有的研究和印地语语调理论,本研究试图首先开发一个印地语语音数据的手动注释韵律语料库,然后用于训练预测模型以生成自动韵律标签。共有5000个陈述式和疑问式句子(23500字)被标注。训练模型对音高重音、中间短语边界和重音短语边界的准确率分别为73.40%、93.20%和43%。 摘要:This study aims to develop a semi-automatically labelled prosody database for Hindi, for enhancing the intonation component in ASR and TTS systems, which is also helpful for building Speech to Speech Machine Translation systems. Although no single standard for prosody labelling exists in Hindi, researchers in the past have employed perceptual and statistical methods in literature to draw inferences about the behaviour of prosody patterns in Hindi. Based on such existing research and largely agreed upon theories of intonation in Hindi, this study attempts to first develop a manually annotated prosodic corpus of Hindi speech data, which is then used for training prediction models for generating automatic prosodic labels. A total of 5,000 sentences (23,500 words) for declarative and interrogative types have been labelled. The accuracy of the trained models for pitch accent, intermediate phrase boundaries and accentual phrase boundaries is 73.40%, 93.20%, and 43% respectively.

【9】 An Empirical Study on Relation Extraction in the Biomedical Domain 标题:生物医学领域关系抽取的实证研究 链接:https://arxiv.org/abs/2112.05910

作者:Yongkang Li 机构:Peking University 备注:5 pages 摘要:关系抽取是自然语言处理中的一个基本问题。大多数现有的模型都是为一般领域中的关系提取而定义的。然而,它们在特定领域(如生物医学)的表现尚不清楚。为了填补这一空白,本文对生物医学研究论文中的关系抽取进行了实证研究。具体来说,我们考虑句子级和文档级关系提取,并在几个基准数据集上运行一些最先进的方法。结果表明:(1)现有的文档级关系抽取方法具有较强的泛化能力;(2) 现有的方法需要大量的标记数据用于生物医学中的模型微调。我们的观察结果可能会启发这一领域的人们开发更有效的生物医学关系提取模型。 摘要:Relation extraction is a fundamental problem in natural language processing. Most existing models are defined for relation extraction in the general domain. However, their performance on specific domains (e.g., biomedicine) is yet unclear. To fill this gap, this paper carries out an empirical study on relation extraction in biomedical research articles. Specifically, we consider both sentence-level and document-level relation extraction, and run a few state-of-the-art methods on several benchmark datasets. Our results show that (1) current document-level relation extraction methods have strong generalization ability; (2) existing methods require a large amount of labeled data for model fine-tuning in biomedicine. Our observations may inspire people in this field to develop more effective models for biomedical relation extraction.

【10】 Revisiting the Boundary between ASR and NLU in the Age of Conversational Dialog Systems 标题:重新审视对话系统时代ASR与NLU的界限 链接:https://arxiv.org/abs/2112.05842

作者:Manaal Faruqui,Dilek Hakkani-Tür 机构:Google Assistant, Amazon Alexa AI 备注:Accepted to be published at Computational Linguistics Journal 2022 摘要:随着世界各地越来越多的用户在日常生活中与对话代理进行交互,需要更好的语音理解,这就需要重新关注自动语音识别(ASR)和自然语言理解(NLU)研究之间的动态关系。我们简要回顾了这些研究领域,并阐述了它们之间的当前关系。根据我们在本文中所做的观察,我们认为:(1)NLU应该认识到对话系统管道上游使用的ASR模型的存在,(2)ASR应该能够从NLU中发现的错误中学习,(3)需要提供语音输入语义注释的端到端数据集,(4)ASR和NLU研究社区之间应加强合作。 摘要:As more users across the world are interacting with dialog agents in their daily life, there is a need for better speech understanding that calls for renewed attention to the dynamics between research in automatic speech recognition (ASR) and natural language understanding (NLU). We briefly review these research areas and lay out the current relationship between them. In light of the observations we make in this paper, we argue that (1) NLU should be cognizant of the presence of ASR models being used upstream in a dialog system's pipeline, (2) ASR should be able to learn from errors found in NLU, (3) there is a need for end-to-end datasets that provide semantic annotations on spoken input, (4) there should be stronger collaboration between ASR and NLU research communities.

【11】 The Hierarchical Organization of Syntax 标题:语法的层次化组织 链接:https://arxiv.org/abs/2112.05783

作者:Babak Ravandi,Valentina Concu 机构:Network Science Institute, Northeastern University, Boston, USA, Department of Physics, Northeastern University, Boston, USA, Department of Foreign Languages, Universidad del Norte, Barranquilla, Colombia 摘要:层次结构是复杂系统的支柱,通过对它们的分析,可以更深入地了解它们的结构以及它们是如何演化的。我们认为语言也是复杂的自适应系统。因此,我们分析了德语历史句法网络的层次结构,这些句法网络是从11世纪到17世纪的文本语料库中创建的。我们跟踪了这些网络中句法结构的出现,并将其映射到特定的交际需求。我们将这些新兴结构命名为交际层次结构。我们假设说话人的交际需要是句法的组织力量。我们认为,这些多重交际层次的出现是形成语法的原因,而这些层次是齐夫定律的前提。交际层次的出现表明语言进化的目标不仅仅是提高信息传递的效率。随着我们作为一个物种的进步,语言也在进化,以提高我们交流更复杂抽象概念的能力。 摘要:Hierarchies are the backbones of complex systems and their analysis allows for a deeper understanding of their structure and how they evolve. We consider languages to be also complex adaptive systems. Hence, we analyzed the hierarchical organization of historical syntactic networks from German that were created from a corpus of texts from the 11th to 17th centuries. We tracked the emergence of syntactic structures in these networks and mapped them to specific communicative needs. We named these emerging structures communicative hierarchies. We hypothesise that the communicative needs of speakers are the organizational force of syntax. We propose that the emergence of these multiple communicative hierarchies is what shapes syntax, and that these hierarchies are the prerequisite to the Zipf's law. The emergence of communicative hierarchies indicates that the objective of language evolution is not only to increase the efficiency of transferring information. Language is also evolving to increase our capacity to communicate more sophisticated abstractions as we advance as a species.

【12】 A Scoping Review of Publicly Available Language Tasks in Clinical Natural Language Processing 标题:临床自然语言处理中可公开语言任务的范围研究综述 链接:https://arxiv.org/abs/2112.05780

作者:Yanjun Gao,Dmitriy Dligach,Leslie Christensen,Samuel Tesch,Ryan Laffin,Dongfang Xu,Timothy Miller,Ozlem Uzuner,Matthew M Churpek,Majid Afshar 机构: ICU Data Science Lab, School of Medicine and Public Health, Department of Computer Science, Loyola University Chicago, Chicago, IL, School of Medicine and Public Health, University of Wisconsin, Madison, WI 备注:Paper submitted to Journal of American Medical Informatics Association (JAMIA) 摘要:目的:对临床自然语言处理(NLP)任务的论文进行范围界定,这些论文使用了来自患者队列的公开电子健康记录数据。材料与方法:检索生物医学研究和计算机科学文献数据库等6个数据库。两名评审员进行了一轮标题/摘要筛选和全文筛选。我们的方法遵循系统评价和荟萃分析(PRISMA)指南的首选报告项目。结果:共有35篇文献,47篇临床NLP任务符合纳入标准,在2007和2021之间。我们根据NLP问题的类型对任务进行分类,包括名称实体识别、摘要和其他NLP任务。以临床决策支持应用为主题介绍了一些任务,如药物滥用、表型、临床试验队列选择。我们通过出版物和数据集信息总结了任务。讨论:随着语言系统的进步,NLP领域的发展,临床NLP任务的范围不断扩大。然而,在一般领域NLP社区和临床信息学社区之间的不同兴趣以及数据源的普遍性方面存在差距。我们还发现了数据选择和准备中的问题,包括缺乏时间敏感数据,以及问题规模和评估无效。结论:现有的临床NLP任务涵盖了广泛的主题,该领域将继续发展,并吸引普通领域NLP和临床信息学界的更多关注。我们鼓励未来的工作将多学科协作、报告透明度和数据准备标准化结合起来。 摘要:Objective: to provide a scoping review of papers on clinical natural language processing (NLP) tasks that use publicly available electronic health record data from a cohort of patients. Materials and Methods: We searched six databases, including biomedical research and computer science literature database. A round of title/abstract screening and full-text screening were conducted by two reviewers. Our method followed the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines. Results: A total of 35 papers with 47 clinical NLP tasks met inclusion criteria between 2007 and 2021. We categorized the tasks by the type of NLP problems, including name entity recognition, summarization, and other NLP tasks. Some tasks were introduced with a topic of clinical decision support applications, such as substance abuse, phenotyping, cohort selection for clinical trial. We summarized the tasks by publication and dataset information. Discussion: The breadth of clinical NLP tasks keeps growing as the field of NLP evolves with advancements in language systems. However, gaps exist in divergent interests between general domain NLP community and clinical informatics community, and in generalizability of the data sources. We also identified issues in data selection and preparation including the lack of time-sensitive data, and invalidity of problem size and evaluation. Conclusions: The existing clinical NLP tasks cover a wide range of topics and the field will continue to grow and attract more attention from both general domain NLP and clinical informatics community. We encourage future work to incorporate multi-disciplinary collaboration, reporting transparency, and standardization in data preparation.