zl程序教程

您现在的位置是:首页 >  硬件

当前栏目

『2021语言与智能技术竞赛』-机器阅读理解任务基线系统详解

机器技术系统语言智能 详解 理解 任务
2023-09-11 14:14:27 时间
阅读理解 DuReaderchecklist
该示例展示了如何使用PaddleNLP快速实现LIC2021机器阅读理解比赛基线并进阶优化基线。


机器阅读理解 (Machine Reading Comprehension) 是指让机器阅读文本,然后回答和阅读内容相关的问题。阅读理解是自然语言处理和人工智能领域的重要前沿课题,对于提升机器的智能水平、使机器具有持续知识获取的能力等具有重要价值,近年来受到学术界和工业界的广泛关注。

自然语言理解对机器学习模型各方面的能力均有极高的要求。然而,当前的机器阅读理解数据集大多都只采用单一的指标来评测模型的好坏,缺乏对模型语言理解能力的细粒度、多维度评测,导致模型的具体缺陷很难被发现和改进。为了解决这个问题,中国计算机学会、中国中文信息学会和百度公司建立了细粒度的、多维度的评测数据集,从词汇理解、短语理解、语义角色理解、逻辑推理等多个维度检测模型的不足之处,从而推动阅读理解评测进入“精细化“时代。该数据集中的样本均来自于实际的应用场景,难度大,考察点丰富,覆盖了真实应用中诸多难以解决的问题。

DuReaderchecklist数据集包含训练集、开发集以及测试集。其中开发集和测试集中,既包含和训练集同分布的in-domain样本,也包含了按照checklist体系分类后的样本。对于一个给定的问题q、一个篇章p及其标题t,系统需要根据篇章内容,判断该篇章p中是否包含给定问题的答案,如果是,则给出该问题的答案a;否则输出“no answer”。数据集中的每个样本,是一个四元组<q, p, t, a>,例如:

问题 q: 番石榴汁热量

篇章 p: 番石榴性温,味甜、酸、涩…,最重要的是番石榴所含的脂肪热量较低,一个番石榴所含的脂肪约0.9克重或84卡路里。比起苹果,番石榴所含有的脂肪少38%,卡路里少42%。

标题 t: 番石榴汁的热量 - 妈妈网百科

参考答案 a: [‘一个番石榴所含的脂肪约0.9克重或84卡路里’]

问题 q: 云南文山市多少人口?

篇章 p: 云南省下辖8个市、8个少数民族自治州,面积39万平方千米,总人口4596万人,云南汉族人口为3062.9万人,占云南省总人口的66.63%...

标题 t: 云南总人口数多少人,2019年云南人口数量统计(最新)

参考答案 a: [‘无答案’]

DuReaderchecklist数据集旨在通过建立checklist评测体系,系统性地评估当前模型能力的不足之处。目前checklist体系中涉及到的自然语言理解能力包含:词汇理解、短语理解、语义角色理解以及推理能力等等。具体的分类体系可参考下图: checklist_framwork


安装说明
PaddlePaddle 安装

本项目依赖于 PaddlePaddle 2.0 及以上版本,请参考 安装指南 进行安装

PaddleNLP 安装

pip install --upgrade paddlenlp -i https://pypi.org/simple
环境依赖

Python的版本要求 3.6+

AI Studio平台后续会默认安装PaddleNLP,在此之前可使用如下命令安装

In [1]
!pip install --upgrade paddlenlp -i https://pypi.org/simple
Collecting paddlenlp
  Downloading https://files.pythonhosted.org/packages/a0/a2/64352288e9e8ae98b76edd3c7732fd9ccca8a9a76041ac4ca2441e6af721/paddlenlp-2.0.0rc17-py3-none-any.whl (257kB)
     |████████████████████████████████| 266kB 206kB/s eta 0:00:01
Requirement already satisfied, skipping upgrade: visualdl in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (2.1.1)
Requirement already satisfied, skipping upgrade: jieba in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.42.1)
Requirement already satisfied, skipping upgrade: colorama in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.4.4)
Requirement already satisfied, skipping upgrade: colorlog in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (4.1.0)
Requirement already satisfied, skipping upgrade: h5py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (2.9.0)
Requirement already satisfied, skipping upgrade: seqeval in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (1.2.2)
Requirement already satisfied, skipping upgrade: protobuf>=3.11.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (3.14.0)
Requirement already satisfied, skipping upgrade: requests in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (2.22.0)
Requirement already satisfied, skipping upgrade: six>=1.14.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (1.15.0)
Requirement already satisfied, skipping upgrade: Pillow>=7.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (7.1.2)
Requirement already satisfied, skipping upgrade: shellcheck-py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (0.7.1.1)
Requirement already satisfied, skipping upgrade: flake8>=3.7.9 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (3.8.2)
Requirement already satisfied, skipping upgrade: pre-commit in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (1.21.0)
Requirement already satisfied, skipping upgrade: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (1.16.4)
Requirement already satisfied, skipping upgrade: flask>=1.1.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (1.1.1)
Requirement already satisfied, skipping upgrade: bce-python-sdk in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (0.8.53)
Requirement already satisfied, skipping upgrade: Flask-Babel>=1.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (1.0.0)
Requirement already satisfied, skipping upgrade: scikit-learn>=0.21.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp) (0.22.1)
Requirement already satisfied, skipping upgrade: idna<2.9,>=2.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->visualdl->paddlenlp) (2.8)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->visualdl->paddlenlp) (2019.9.11)
Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->visualdl->paddlenlp) (1.25.6)
Requirement already satisfied, skipping upgrade: chardet<3.1.0,>=3.0.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->visualdl->paddlenlp) (3.0.4)
Requirement already satisfied, skipping upgrade: mccabe<0.7.0,>=0.6.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flake8>=3.7.9->visualdl->paddlenlp) (0.6.1)
Requirement already satisfied, skipping upgrade: pyflakes<2.3.0,>=2.2.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flake8>=3.7.9->visualdl->paddlenlp) (2.2.0)
Requirement already satisfied, skipping upgrade: importlib-metadata; python_version < "3.8" in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flake8>=3.7.9->visualdl->paddlenlp) (0.23)
Requirement already satisfied, skipping upgrade: pycodestyle<2.7.0,>=2.6.0a1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flake8>=3.7.9->visualdl->paddlenlp) (2.6.0)
Requirement already satisfied, skipping upgrade: identify>=1.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (1.4.10)
Requirement already satisfied, skipping upgrade: cfgv>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (2.0.1)
Requirement already satisfied, skipping upgrade: pyyaml in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (5.1.2)
Requirement already satisfied, skipping upgrade: nodeenv>=0.11.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (1.3.4)
Requirement already satisfied, skipping upgrade: toml in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (0.10.0)
Requirement already satisfied, skipping upgrade: aspy.yaml in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (1.3.0)
Requirement already satisfied, skipping upgrade: virtualenv>=15.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (16.7.9)
Requirement already satisfied, skipping upgrade: Werkzeug>=0.15 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flask>=1.1.1->visualdl->paddlenlp) (0.16.0)
Requirement already satisfied, skipping upgrade: click>=5.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flask>=1.1.1->visualdl->paddlenlp) (7.0)
Requirement already satisfied, skipping upgrade: Jinja2>=2.10.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flask>=1.1.1->visualdl->paddlenlp) (2.10.1)
Requirement already satisfied, skipping upgrade: itsdangerous>=0.24 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flask>=1.1.1->visualdl->paddlenlp) (1.1.0)
Requirement already satisfied, skipping upgrade: pycryptodome>=3.8.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from bce-python-sdk->visualdl->paddlenlp) (3.9.9)
Requirement already satisfied, skipping upgrade: future>=0.6.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from bce-python-sdk->visualdl->paddlenlp) (0.18.0)
Requirement already satisfied, skipping upgrade: pytz in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from Flask-Babel>=1.0.0->visualdl->paddlenlp) (2019.3)
Requirement already satisfied, skipping upgrade: Babel>=2.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from Flask-Babel>=1.0.0->visualdl->paddlenlp) (2.8.0)
Requirement already satisfied, skipping upgrade: scipy>=0.17.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (1.3.0)
Requirement already satisfied, skipping upgrade: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (0.14.1)
Requirement already satisfied, skipping upgrade: zipp>=0.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from importlib-metadata; python_version < "3.8"->flake8>=3.7.9->visualdl->paddlenlp) (0.6.0)
Requirement already satisfied, skipping upgrade: MarkupSafe>=0.23 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from Jinja2>=2.10.1->flask>=1.1.1->visualdl->paddlenlp) (1.1.1)
Requirement already satisfied, skipping upgrade: more-itertools in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from zipp>=0.5->importlib-metadata; python_version < "3.8"->flake8>=3.7.9->visualdl->paddlenlp) (7.2.0)
Installing collected packages: paddlenlp
  Found existing installation: paddlenlp 2.0.0rc7
    Uninstalling paddlenlp-2.0.0rc7:
      Successfully uninstalled paddlenlp-2.0.0rc7
Successfully installed paddlenlp-2.0.0rc17
下载数据文件
在运行基线之前,需要下载DuReaderchecklist数据集

执行以下脚本,数据集会被保存到dataset/文件夹中。此外,基于ERNIE-1.0微调后的基线模型参数也会被保存在finetuned_model/ 文件夹中,可供直接预测使用。

In [2]
!sh download.sh
Download DuReader-checklist dataset
--2021-04-13 21:10:16--  https://dataset-bj.cdn.bcebos.com/lic2021/dureader_checklist.dataset.tar.gz
Resolving dataset-bj.cdn.bcebos.com (dataset-bj.cdn.bcebos.com)... 182.61.128.166
Connecting to dataset-bj.cdn.bcebos.com (dataset-bj.cdn.bcebos.com)|182.61.128.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1636803 (1.6M) [application/x-gzip]
Saving to: ‘dureader_checklist.dataset.tar.gz’

dureader_checklist. 100%[===================>]   1.56M  --.-KB/s    in 0.02s   

2021-04-13 21:10:16 (66.6 MB/s) - ‘dureader_checklist.dataset.tar.gz’ saved [1636803/1636803]

dataset/
dataset/dev.json
dataset/train.json
dataset/License.docx
Download fine-tuned parameters
--2021-04-13 21:10:16--  https://dataset-bj.cdn.bcebos.com/lic2021/dureader_checklist.finetuned_model.tar.gz
Resolving dataset-bj.cdn.bcebos.com (dataset-bj.cdn.bcebos.com)... 182.61.128.166
Connecting to dataset-bj.cdn.bcebos.com (dataset-bj.cdn.bcebos.com)|182.61.128.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 439315863 (419M) [application/x-gzip]
Saving to: ‘dureader_checklist.finetuned_model.tar.gz’

dureader_checklist. 100%[===================>] 418.96M   117MB/s    in 3.6s    

2021-04-13 21:10:20 (117 MB/s) - ‘dureader_checklist.finetuned_model.tar.gz’ saved [439315863/439315863]

finetuned_model/
finetuned_model/tokenizer_config.json
finetuned_model/model_state.pdparams
finetuned_model/model_config.json
finetuned_model/vocab.txt
数据准备
数据准备流程如下:



使用load_dataset()自定义数据集
DuReaderchecklist数据集采用SQuAD数据格式,我们可以写出数据文件读取函数,传入load_dataset()。即可创建数据集。

In [3]
from paddlenlp.datasets import load_dataset
import json

# 根据本地文件格式定义数据读取生成器
def read(filename):
    with open(filename, "r", encoding="utf8") as f:
        input_data = json.load(f)
        for entry in input_data['data']:
            title = entry.get("title", "").strip()
            for paragraph in entry["paragraphs"]:
                context = paragraph["context"].strip()
                for qa in paragraph["qas"]:
                    qas_id = qa["id"]
                    question = qa["question"].strip()
                    answer_starts = []
                    answers = []
                    is_impossible = False

                    if "is_impossible" in qa.keys():
                        is_impossible = qa["is_impossible"]

                    answer_starts = [
                        answer["answer_start"] for answer in qa.get("answers",[])
                    ]
                    answers = [
                        answer["text"].strip() for answer in qa.get("answers",[])
                    ]

                    yield {
                        'id': qas_id,
                        'title': title,
                        'context': context,
                        'question': question,
                        'answers': answers,
                        'answer_starts': answer_starts,
                        'is_impossible': is_impossible
                    }

# 将生成器传入load_dataset
train_ds = load_dataset(read, filename='dataset/train.json', lazy=False)
dev_ds = load_dataset(read, filename='dataset/dev.json', lazy=False)

for idx in range(2):
    print(train_ds[idx]['question'])
    print(train_ds[idx]['context'])
    print(train_ds[idx]['answers'])
    print(train_ds[idx]['answer_starts'])
    print(train_ds[idx]['is_impossible'])
    print()
属鼠小名多米的寓意
——使用叠字的小名是很好听的,对于刚刚出生的男孩来说,这样的名字是很方便记忆的,这也就拉近了孩子与家长之间的距离。糕,代表的是“用面粉所制成的食品”,那么此字在含义上是带有“多米”的含义,作为属鼠男孩的小名,寓意上是很不错的。
['']
[-1]
True

红烧肉煮鲍鱼的做法
将锅中倒入油,小火爆香蒜末,一片姜的姜末,倒入上一步骤中蒸鲍鱼的汤汁,倒入蚝油,鲍鱼汁,蒸鱼豉油,东古一品鲜,味极鲜,,鸡精,鸡粉调味。如果汤汁少,放入少许水。待汤汁快要沸腾时,放入鲍鱼,均匀的裹上汤汁。最后,将鲍鱼放入已经摆盘的西兰花上面,将汤汁淋到鲍鱼上面就大功告成啦🤓
['']
[-1]
True

关于更多自定义数据集相关内容,请移步如何自定义数据集

使用 paddlenlp.transformers.ErnieTokenizer将数据转为Feature
DuReaderchecklist数据集采用SQuAD数据格式,InputFeature使用滑动窗口的方法生成,即一个example可能对应多个InputFeature。

由于文章加问题的文本长度可能大于max_seq_length,答案出现的位置有可能出现在文章最后,所以不能简单的对文章进行截断。

那么对于过长的文章,则采用滑动窗口将文章分成多段,分别与问题组合。再用对应的tokenizer转化为模型可接受的feature。doc_stride参数就是每次滑动的距离。滑动窗口生成InputFeature的过程如下图:



本基线中,我们使用的预训练模型是ERNIE,ERNIE对中文数据的处理是以字为单位。PaddleNLP对于各种预训练模型已经内置了相应的tokenizer,指定想要使用的模型名字即可加载对应的tokenizer。

tokenizer的作用是将原始输入文本转化成模型可以接受的输入数据形式。

In [4]
import paddlenlp

# 设置模型名称
MODEL_NAME = 'ernie-1.0'
tokenizer = paddlenlp.transformers.ErnieTokenizer.from_pretrained(MODEL_NAME)
[2021-04-13 21:10:28,643] [    INFO] - Downloading vocab.txt from https://paddlenlp.bj.bcebos.com/models/transformers/ernie/vocab.txt
100%|██████████| 89/89 [00:00<00:00, 3553.28it/s]
调用map()方法批量处理数据
由于我们传入了lazy=False,所以我们使用load_dataset()自定义的数据集是MapDataset对象,MapDataset是paddle.io.Dataset的功能增强版本。其内置的map()方法适合用来进行批量数据集处理。map()方法传入的是一个用于数据处理的function。正好可以与tokenizer相配合。

以下是本基线中的用法:

In [5]
from src.utils import prepare_train_features, prepare_validation_features
from functools import partial

max_seq_length = 512
doc_stride = 128

train_trans_func = partial(prepare_train_features, 
                           max_seq_length=max_seq_length, 
                           doc_stride=doc_stride,
                           tokenizer=tokenizer)

train_ds.map(train_trans_func, batched=True)

dev_trans_func = partial(prepare_validation_features, 
                           max_seq_length=max_seq_length, 
                           doc_stride=doc_stride,
                           tokenizer=tokenizer)
                           
dev_ds.map(dev_trans_func, batched=True)
<paddlenlp.datasets.dataset.MapDataset at 0x7ff7e7259050>
In [6]
for idx in range(2):
    print(train_ds[idx]['input_ids'])
    print(train_ds[idx]['token_type_ids'])
    print(train_ds[idx]['overflow_to_sample'])
    print(train_ds[idx]['offset_mapping'])
    print(train_ds[idx]['start_positions'])
    print(train_ds[idx]['end_positions'])
    print(train_ds[idx]['answerable_label'])
    print()
[1, 479, 1706, 96, 132, 65, 256, 5, 1804, 221, 2, 17963, 17963, 175, 29, 2053, 436, 5, 96, 132, 10, 321, 170, 818, 5, 30, 51, 37, 1082, 1082, 39, 21, 5, 654, 751, 61, 178, 30, 47, 314, 5, 132, 436, 10, 321, 58, 518, 374, 1347, 5, 30, 47, 105, 113, 630, 432, 15, 751, 85, 54, 50, 84, 46, 143, 5, 711, 417, 12043, 2578, 30, 140, 197, 5, 10, 23, 29, 76, 996, 110, 108, 33, 5, 494, 100, 24, 30, 312, 356, 198, 436, 11, 718, 393, 28, 10, 360, 9, 23, 65, 256, 24, 5, 718, 393, 30, 25, 13, 479, 1706, 654, 751, 5, 96, 132, 30, 1804, 221, 28, 10, 321, 16, 990, 5, 12043, 2]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
0
[(0, 0), (0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9), (0, 0), (0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9), (9, 10), (10, 11), (11, 12), (12, 13), (13, 14), (14, 15), (15, 16), (16, 17), (17, 18), (18, 19), (19, 20), (20, 21), (21, 22), (22, 23), (23, 24), (24, 25), (25, 26), (26, 27), (27, 28), (28, 29), (29, 30), (30, 31), (31, 32), (32, 33), (33, 34), (34, 35), (35, 36), (36, 37), (37, 38), (38, 39), (39, 40), (40, 41), (41, 42), (42, 43), (43, 44), (44, 45), (45, 46), (46, 47), (47, 48), (48, 49), (49, 50), (50, 51), (51, 52), (52, 53), (53, 54), (54, 55), (55, 56), (56, 57), (57, 58), (58, 59), (59, 60), (60, 61), (61, 62), (62, 63), (63, 64), (64, 65), (65, 66), (66, 67), (67, 68), (68, 69), (69, 70), (70, 71), (71, 72), (72, 73), (73, 74), (74, 75), (75, 76), (76, 77), (77, 78), (78, 79), (79, 80), (80, 81), (81, 82), (82, 83), (83, 84), (84, 85), (85, 86), (86, 87), (87, 88), (88, 89), (89, 90), (90, 91), (91, 92), (92, 93), (93, 94), (94, 95), (95, 96), (96, 97), (97, 98), (98, 99), (99, 100), (100, 101), (101, 102), (102, 103), (103, 104), (104, 105), (105, 106), (106, 107), (107, 108), (108, 109), (109, 110), (110, 111), (111, 112), (112, 113), (0, 0)]
0
0
0

[1, 536, 1234, 805, 2086, 2838, 881, 5, 388, 72, 2, 174, 1603, 12, 1099, 109, 665, 30, 96, 610, 1380, 673, 2892, 989, 30, 7, 433, 1909, 5, 1909, 989, 30, 1099, 109, 28, 7, 439, 2416, 12, 1743, 2838, 881, 5, 1462, 1990, 30, 1099, 109, 5433, 665, 30, 2838, 881, 1990, 30, 1743, 881, 5340, 665, 30, 242, 422, 7, 100, 993, 30, 775, 456, 993, 30, 1237, 30, 1291, 326, 30, 1291, 996, 290, 775, 12043, 142, 228, 1462, 1990, 332, 30, 364, 109, 332, 576, 101, 12043, 849, 1462, 1990, 532, 41, 2603, 1782, 36, 30, 364, 109, 2838, 881, 30, 428, 1891, 5, 2741, 28, 1462, 1990, 12043, 134, 49, 30, 174, 2838, 881, 364, 109, 265, 60, 1698, 966, 5, 213, 784, 283, 28, 76, 30, 174, 1462, 1990, 1820, 45, 2838, 881, 28, 76, 113, 19, 369, 612, 33, 2340, 17963, 2]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
1
[(0, 0), (0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9), (0, 0), (0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9), (9, 10), (10, 11), (11, 12), (12, 13), (13, 14), (14, 15), (15, 16), (16, 17), (17, 18), (18, 19), (19, 20), (20, 21), (21, 22), (22, 23), (23, 24), (24, 25), (25, 26), (26, 27), (27, 28), (28, 29), (29, 30), (30, 31), (31, 32), (32, 33), (33, 34), (34, 35), (35, 36), (36, 37), (37, 38), (38, 39), (39, 40), (40, 41), (41, 42), (42, 43), (43, 44), (44, 45), (45, 46), (46, 47), (47, 48), (48, 49), (49, 50), (50, 51), (51, 52), (52, 53), (53, 54), (54, 55), (55, 56), (56, 57), (57, 58), (58, 59), (59, 60), (60, 61), (61, 62), (62, 63), (63, 64), (64, 65), (65, 66), (66, 67), (67, 68), (68, 69), (69, 70), (70, 71), (71, 72), (72, 73), (73, 74), (74, 75), (75, 76), (76, 77), (77, 78), (78, 79), (79, 80), (80, 81), (81, 82), (82, 83), (83, 84), (84, 85), (85, 86), (86, 87), (87, 88), (88, 89), (89, 90), (90, 91), (91, 92), (92, 93), (93, 94), (94, 95), (95, 96), (96, 97), (97, 98), (98, 99), (99, 100), (100, 101), (101, 102), (102, 103), (103, 104), (104, 105), (105, 106), (106, 107), (107, 108), (108, 109), (109, 110), (110, 111), (111, 112), (112, 113), (113, 114), (114, 115), (115, 116), (116, 117), (117, 118), (118, 119), (119, 120), (120, 121), (121, 122), (122, 123), (123, 124), (124, 125), (125, 126), (126, 127), (127, 128), (128, 129), (129, 130), (130, 131), (131, 132), (132, 133), (133, 134), (134, 135), (135, 136), (136, 137), (137, 138), (0, 0)]
0
0
0

从以上结果可以看出,数据集中的example已经被转换成了模型可以接收的feature,包括input_ids、token_type_ids、答案的起始位置等信息。 其中:

input_ids: 表示输入文本的token ID。
token_type_ids: 表示对应的token属于输入的问题还是答案。(Transformer类预训练模型支持单句以及句对输入)。
overflow_to_sample: feature对应的example的编号。
offset_mapping: 每个token的起始字符和结束字符在原文中对应的index(用于生成答案文本)。
start_positions: 答案在这个feature中的开始位置。
end_positions: 答案在这个feature中的结束位置。
answerable_label: 答案在这个feature中是否存在,存在为1,不存在为0。
关于本基线中数据处理的详细过程请参见src/utils.py。

更多有关数据处理的内容,请移步数据处理。

Batchify和数据读入
使用paddle.io.BatchSampler和paddlenlp.data中提供的方法把数据组成batch。

然后使用paddle.io.DataLoader接口多线程异步加载数据。

batchify_fn详解:



In [7]
import paddle
from paddlenlp.data import Stack, Dict, Pad

batch_size = 2

# 初始化BatchSampler
train_batch_sampler = paddle.io.BatchSampler(
    train_ds, batch_size=batch_size, shuffle=True)

dev_batch_sampler = paddle.io.BatchSampler(
    dev_ds, batch_size=batch_size, shuffle=False)

# 定义batchify_fn
train_batchify_fn = lambda samples, fn=Dict({
    "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), 
    "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
    "start_positions": Stack(dtype="int64"),  
    "end_positions": Stack(dtype="int64"),  
    "answerable_label": Stack(dtype="int64")  
}): fn(samples)

dev_batchify_fn = lambda samples, fn=Dict({
    "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), 
    "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id)
}): fn(samples)

# 初始化DataLoader
train_data_loader = paddle.io.DataLoader(
    dataset=train_ds,
    batch_sampler=train_batch_sampler,
    collate_fn=train_batchify_fn,
    return_list=True)

dev_data_loader = paddle.io.DataLoader(
    dataset=dev_ds,
    batch_sampler=dev_batch_sampler,
    collate_fn=dev_batchify_fn,
    return_list=True)
组网训练
使用PaddleNLP一键加载预训练模型
以下项目以ERNIE为例,介绍如何将预训练模型Fine-tune完成DuReaderchecklist阅读理解任务。

DuReaderchecklist阅读理解任务的本质是答案抽取任务和句对分类任务。根据输入的问题和文章,预测答案在文章中的起始位置和结束位置以及判断答案是否存在。原理如下图所示:



In [8]
from src.models import ErnieForQuestionAnswering

model = ErnieForQuestionAnswering.from_pretrained(MODEL_NAME)
[2021-04-13 21:10:42,274] [    INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/ernie/ernie_v1_chn_base.pdparams and saved to /home/aistudio/.paddlenlp/models/ernie-1.0
[2021-04-13 21:10:42,276] [    INFO] - Downloading ernie_v1_chn_base.pdparams from https://paddlenlp.bj.bcebos.com/models/transformers/ernie/ernie_v1_chn_base.pdparams
100%|██████████| 390123/390123 [00:07<00:00, 51038.76it/s]
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1303: UserWarning: Skip loading for classifier.weight. classifier.weight is not found in the provided dict.
  warnings.warn(("Skip loading for {}. ".format(key) + str(err)))
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1303: UserWarning: Skip loading for classifier.bias. classifier.bias is not found in the provided dict.
  warnings.warn(("Skip loading for {}. ".format(key) + str(err)))
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1303: UserWarning: Skip loading for classifier_cls.weight. classifier_cls.weight is not found in the provided dict.
  warnings.warn(("Skip loading for {}. ".format(key) + str(err)))
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1303: UserWarning: Skip loading for classifier_cls.bias. classifier_cls.bias is not found in the provided dict.
  warnings.warn(("Skip loading for {}. ".format(key) + str(err)))
设计loss function
ErineForQuestionAnswering模型对将BertModel的sequence_output拆开成start_logits和end_logits输出,并将pooled_output作为cls_logits输出,所以DuReaderchecklist的loss由start_loss和end_loss和cls_logits三部分组成,我们需要自己定义loss function。

对于答案起始位置和结束位置和答案是否存在的预测可以分别看成三个分类任务。所以设计的loss function如下:

In [9]
class CrossEntropyLossForChecklist(paddle.nn.Layer):
    def __init__(self):
        super(CrossEntropyLossForChecklist, self).__init__()
        
    def forward(self, y, label):
        start_logits, end_logits, cls_logits = y
        start_position, end_position, answerable_label = label
        start_position = paddle.unsqueeze(start_position, axis=-1)
        end_position = paddle.unsqueeze(end_position, axis=-1)
        answerable_label = paddle.unsqueeze(answerable_label, axis=-1)
        start_loss = paddle.nn.functional.softmax_with_cross_entropy(
            logits=start_logits, label=start_position, soft_label=False)
        start_loss = paddle.mean(start_loss)
        end_loss = paddle.nn.functional.softmax_with_cross_entropy(
            logits=end_logits, label=end_position, soft_label=False)
        end_loss = paddle.mean(end_loss)
        cls_loss = paddle.nn.functional.softmax_with_cross_entropy(
            logits=cls_logits, label=answerable_label, soft_label=False)
        cls_loss = paddle.mean(cls_loss)
        mrc_loss = (start_loss + end_loss) / 2
        loss = (mrc_loss + cls_loss) /2
        return loss
模型配置
配置优化策略
适用于ERNIE/BERT这类Transformer模型的学习率为warmup的动态学习率。

 

图3:动态学习率示意图

In [10]
# 训练过程中的最大学习率
learning_rate = 3e-5 

# 训练轮次
epochs = 2

# 学习率预热比例
warmup_proportion = 0.1

# 权重衰减系数,类似模型正则项策略,避免模型过拟合
weight_decay = 0.01

num_training_steps = len(train_data_loader) * epochs

# 学习率衰减策略
lr_scheduler = paddlenlp.transformers.LinearDecayWithWarmup(learning_rate, num_training_steps,
                                    warmup_proportion)

decay_params = [
    p.name for n, p in model.named_parameters()
    if not any(nd in n for nd in ["bias", "norm"])
]

# 定义优化器
optimizer = paddle.optimizer.AdamW(
    learning_rate=lr_scheduler,
    parameters=model.parameters(),
    weight_decay=weight_decay,
    apply_decay_param_fun=lambda x: x in decay_params)
模型训练
模型训练的过程通常有以下步骤:

从dataloader中取出一个batch data
将batch data喂给model,做前向计算
将前向计算结果传给损失函数,计算loss。
loss反向回传,更新梯度。重复以上步骤。
每训练一个epoch后,程序通过evaluate()调用compute_prediction_checklist()生成可提交的答案。

In [11]
from src.utils import evaluate

criterion = CrossEntropyLossForChecklist()
global_step = 0
for epoch in range(1, epochs + 1):
    for step, batch in enumerate(train_data_loader, start=1):
        
        global_step += 1
        input_ids, token_type_ids, start_positions, end_positions, answerable_label = batch
        logits = model(input_ids=input_ids, token_type_ids=token_type_ids)
        loss = criterion(logits, (start_positions, end_positions,answerable_label))

        if global_step % 100 == 0 :
            print("global step %d, epoch: %d, batch: %d, loss: %.5f" % (global_step, epoch, step, loss))
            
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.clear_grad()

evaluate(model, dev_data_loader)
global step 100, epoch: 1, batch: 100, loss: 3.13126
global step 200, epoch: 1, batch: 200, loss: 1.07519
global step 300, epoch: 1, batch: 300, loss: 2.92307
global step 400, epoch: 1, batch: 400, loss: 1.49061
global step 500, epoch: 1, batch: 500, loss: 0.80704
global step 600, epoch: 1, batch: 600, loss: 1.24086
global step 700, epoch: 1, batch: 700, loss: 1.70297
global step 800, epoch: 1, batch: 800, loss: 1.46156
global step 900, epoch: 1, batch: 900, loss: 1.65384
global step 1000, epoch: 1, batch: 1000, loss: 0.69806
global step 1100, epoch: 1, batch: 1100, loss: 0.95720
global step 1200, epoch: 1, batch: 1200, loss: 1.03567
global step 1300, epoch: 1, batch: 1300, loss: 0.94834
global step 1400, epoch: 1, batch: 1400, loss: 1.19020
global step 1500, epoch: 1, batch: 1500, loss: 1.16149
global step 1600, epoch: 1, batch: 1600, loss: 2.81661
global step 1700, epoch: 2, batch: 31, loss: 1.17132
global step 1800, epoch: 2, batch: 131, loss: 0.63365
global step 1900, epoch: 2, batch: 231, loss: 2.26916
global step 2000, epoch: 2, batch: 331, loss: 0.57249
global step 2100, epoch: 2, batch: 431, loss: 0.83320
global step 2200, epoch: 2, batch: 531, loss: 0.63347
global step 2300, epoch: 2, batch: 631, loss: 0.79612
global step 2400, epoch: 2, batch: 731, loss: 1.12112
global step 2500, epoch: 2, batch: 831, loss: 0.78194
global step 2600, epoch: 2, batch: 931, loss: 0.71913
global step 2700, epoch: 2, batch: 1031, loss: 0.38682
global step 2800, epoch: 2, batch: 1131, loss: 0.66487
global step 2900, epoch: 2, batch: 1231, loss: 0.61467
global step 3000, epoch: 2, batch: 1331, loss: 2.53447
global step 3100, epoch: 2, batch: 1431, loss: 1.17716
global step 3200, epoch: 2, batch: 1531, loss: 1.01474
global step 3300, epoch: 2, batch: 1631, loss: 0.79080
Processing example: 1000
time per 1000: 11.89399242401123

问题: 张家港汽车站在哪里
原文: 1.张家港北站:位于南丰镇辖区内,北至内河泗兴港,南至市铁路专用线,西至沪通铁路,东至规划经四路(双丰公路西侧),面积约2.6平方公里。 是货运站,沪通铁路沿线办理货运的中间站,设正线2,到发线3,有效长1050m。运输品类主要为集装箱、零担、笨重粗杂等。还设货场一处,牵出线1,有效长350m。货场初期占地185,其中围墙内装卸区100.7亩。货场内设货物线两条,具备笨重粗杂线装卸、仓库站台线装、粗杂货区装卸和仓库等功能。 站点最新进展:目前,张家港北站范围的路基施工作业已完成,货场暂时用作铺轨基地,站区配套用房计划与张家港站同步实施。图片来源:张家港新闻 2.张家港站:位于塘桥镇新204国道东侧,人民路北侧。 站前广场为两层结构,负二层是地铁车站,负一层是地下停车场。站前广场工程涉及地上空间广场、道路以及地下停车场与地铁站,其中,地下停车场可容纳800辆车停放。本项目建设内容分为站前核心区(主要为广场和道路等)和铁路站场桥下区(主要为停车场和地铁区间等),总用地面积约7.4公顷,总建筑面积约6.7万㎡。 站点最新进展:202071,站房和站前广场将与沪通铁路同步投运。 ▽效果图
答案: no answer

问题: 高铁站可以充电吗
原文: 高铁和动车上是可以充电的,充电插头就在座位下边或者是前边。高铁动车上的充电插座排布与车型新旧有关。有些车座位是每排座位两个电源插座,有些新型车比如说“复兴号”是每两个座位有一个电源。祝旅途愉快!
答案: 可以

问题: 鳄鱼宝宝吃什么
原文: 鳄鱼是高等的爬行动物,平时喜欢栖息在湖泊沼泽的滩地或丘陵山涧乱草蓬蒿中的潮湿地带,看起来它比较迟钝,但是发现食物后会特别聪明。当它在水中发现动物后,它会悄悄把自己的身体躲到水底,然后慢慢的向动物的方向游过去,当捕捉合适的时候会纵身一跃很快的捉住动物。当它把动物叼在嘴里后,会把动物拖进水中让其溺死,这样动物死后就可以成为它们的美餐了。
答案: no answer

问题: 股票怎么赚钱?
原文: 证券交易软件pc电脑版:交易—信用交易—担保品管理—担保品划入/划出;证券交易软件app手机版:交易—信用—担保品划转—划入/划出。划转证券的数量可为1股(或1份)的整数倍。
答案: no answer

问题: 梦见老婆在洗澡
原文: 梦见和别人一起洗澡,提醒你要慎重交友,避免交友不慎。梦见和别人洗澡,是在警告你应该避免结交品性不良的坏朋友,否则你很有可能跟坏朋友一起去做一些不良勾当,品德从此就染上了污点。
答案: no answer
模型评估
本次评测所采用的评价指标为F1-score和Exact Match (EM)。其中F1作为主要评价指标,EM作为辅助评价指标。

评估脚本的运行参考以下命令:

sh run_eval.sh dataset_file pred_file
其中dataset_file 是数据集文件,pred_file是模型预测结果,例如

sh run_eval.sh dataset/dev.json prediction.json
In [12]
!sh run_eval.sh dataset/dev.json prediction.json
{"F1": "62.451", "EM": "53.186", "TOTAL": 1130, "SKIP": 0}
{"F1": "64.457", "EM": "55.200", "TOTAL": 1000, "SKIP": 0, "TAG": "in-domain"}
{"F1": "41.255", "EM": "40.000", "TOTAL": 35, "SKIP": 0, "TAG": "vocab"}
{"F1": "60.172", "EM": "57.143", "TOTAL": 35, "SKIP": 0, "TAG": "phrase"}
{"F1": "30.380", "EM": "10.000", "TOTAL": 20, "SKIP": 0, "TAG": "semantic-role"}
{"F1": "39.413", "EM": "25.000", "TOTAL": 20, "SKIP": 0, "TAG": "fault-tolerant"}
{"F1": "58.370", "EM": "40.000", "TOTAL": 20, "SKIP": 0, "TAG": "reasoning"}
如何提升结果
数据增强
PaddleNLP内置的Dureader-robust数据集与该比赛的数据集格式相似。可以通过PaddleNLP内置的load_dataset()方法一键加载数据集用于数据集增强(需要进行一些处理保证格式一致)

In [13]
from paddlenlp.datasets import load_dataset

train_robust, dev_robust = load_dataset('dureader_robust',splits=('train','dev'))

for idx in range(2):
    print(train_robust[idx]['question'])
    print(train_robust[idx]['context'])
    print(train_robust[idx]['answers'])
    print(train_robust[idx]['answer_starts'])
    print()
2021-04-13 21:15:14,529 - INFO - unique_endpoints {''}
2021-04-13 21:15:14,530 - INFO - Downloading dureader_robust-data.tar.gz from https://dataset-bj.cdn.bcebos.com/qianyan/dureader_robust-data.tar.gz
100%|██████████| 20038/20038 [00:00<00:00, 65768.78it/s]
2021-04-13 21:15:14,959 - INFO - Decompressing /home/aistudio/.paddlenlp/datasets/DuReaderRobust/dureader_robust-data.tar.gz...
仙剑奇侠传3第几集上天界
第35集雪见缓缓张开眼睛,景天又惊又喜之际,长卿和紫萱的仙船驶至,见众人无恙,也十分高兴。众人登船,用尽合力把自身的真气和水分输给她。雪见终于醒过来了,但却一脸木然,全无反应。众人向常胤求助,却发现人世界竟没有雪见的身世纪录。长卿询问清微的身世,清微语带双关说一切上了天界便有答案。长卿驾驶仙船,众人决定立马动身,往天界而去。众人来到一荒山,长卿指出,魔界和天界相连。由魔界进入通过神魔之井,便可登天。众人至魔界入口,仿若一黑色的蝙蝠洞,但始终无法进入。后来花楹发现只要有翅膀便能飞入。于是景天等人打下许多乌鸦,模仿重楼的翅膀,制作数对翅膀状巨物。刚佩戴在身,便被吸入洞口。众人摔落在地,抬头发现魔界守卫。景天和众魔套交情,自称和魔尊重楼相熟,众魔不理,打了起来。
['第35集']
[0]

燃气热水器哪个牌子好
选择燃气热水器时,一定要关注这几个问题:1、出水稳定性要好,不能出现忽热忽冷的现象2、快速到达设定的需求水温3、操作要智能、方便4、安全性要好,要装有安全报警装置 市场上燃气热水器品牌众多,购买时还需多加对比和仔细鉴别。方太今年主打的磁化恒温热水器在使用体验方面做了全面升级:9秒速热,可快速进入洗浴模式;水温持久稳定,不会出现忽热忽冷的现象,并通过水量伺服技术将出水温度精确控制在±0.5℃,可满足家里宝贝敏感肌肤洗护需求;配备CO和CH4双气体报警装置更安全(市场上一般多为CO单气体报警)。另外,这款热水器还有智能WIFI互联功能,只需下载个手机APP即可用手机远程操作热水器,实现精准调节水温,满足家人多样化的洗浴需求。当然方太的磁化恒温系列主要的是增加磁化功能,可以有效吸附水中的铁锈、铁屑等微小杂质,防止细菌滋生,使沐浴水质更洁净,长期使用磁化水沐浴更利于身体健康。
['方太']
[110]

使用更大的模型
PaddleNLP不仅支持ERNIE预训练模型,还支持BERT、RoBERTa等预训练模型,加载方式都与ERNIE的加载方式相同。 使用更大的模型(例如roberta-wwm-ext-large)通常可以获得更好的效果。

更多预训练模型参考:https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/transformers.md

对抗训练
通过在词向量中添加干扰词向量来生成干扰样本。

重新训练模型,以增强模型的鲁棒性。

知识蒸馏,模型融合等
以上基线实现基于PaddleNLP,开源不易,希望大家多多支持~ 记得给PaddleNLP点个小小的Star⭐

GitHub地址:https://github.com/PaddlePaddle/PaddleNLP 

更多使用方法可参考PaddleNLP教程

使用seq2vec模块进行句子情感分类
使用预训练模型ERNIE优化情感分析
使用BiGRU-CRF模型完成快递单信息抽取
使用预训练模型ERNIE优化快递单信息抽取
使用Seq2Seq模型完成自动对联
使用预训练模型ERNIE-GEN自动写诗
使用TCN网络完成新冠疫情病例数预测
自定义数据集实现文本多分类任务