zl程序教程

您现在的位置是:首页 >  其他

当前栏目

NLP-分类模型-2016-文档分类:HAN Attention【层次Attention应用于“文档级别的长数据”(LSTM/GRU最多处理长度为300的文档);HAN可用于其他领域】

文档应用数据领域 处理 模型 分类 2016
2023-09-27 14:20:38 时间

《原始论文:Hierarchical attention networks for document classification》

一、概述

HAN 模型的灵感来源于人在阅读 document 的时候,不同的词和句子对人理解 document 信息有不同的影响。因为,词和句子的重要性是和上下文息息相关的,即使是相同的词和句子,在不同的上下文中重要性也不一样。人在阅读一篇文章时,对 document 不同的内容是有着不同的注意度的。

1、《Hierarchical attention networks for document classification》论文背景

  1. 文本分类是自然语言处理的基础任务之一,近期的研究者逐渐开始使用基于深度学习的文本分类模型。
  2. 虽然基于深度学习的文本分类模型取得了非常好的效果,但是它们没有注意到文档的结构,并且 没有注意到文档中不同部分对于分类的影响程度不一样。
  3. 为了解决这一问题,我们提出了一种层次注意力网络来学习文档的层次结构,并且使用两种注意 力机制学习基于上下文的结构重要性。
  4. 我们的工作和前人工作的主要区别是我们使用上下文来区分句子或者单词的重要性,而不是仅仅 使用单个句子或者单个的词。

2、《Hierarchical attention networks for document classification》论文摘要

  1. 本文提出了一种层次注意力网络来做文档分类,它有两个特点。
  2. 第一个特点是这种层次结构对应着文档的层次结构。
  3. 第二个特点是它具有词级别和句子级别两种注意力机制,这使得网络能够区分文档中重要的部分从而更好地生成文档表示。
    • 词级别的注意力机制可以区分句子中重要的词;
    • 句子级别注意力机制可以区分文档中重要的句子;
  4. 我们在六个大型数据集上的实验结果表明,我们的模型能够大幅度提高文档分类效果
    在这里插入图片描述
  5. 可视化发现我们确实能够选择出文档中重要的句子和单词:越红的句子越重要,越蓝的单词越重要。
    在这里插入图片描述

3、《Hierarchical attention networks for document classification》论文研究意义

  1. 基于Attention的文本分类模型得到了很多的关注。
  2. 通过层次方式处理长文档的方式逐渐流行。
  3. 推动了注意力机制在非Seq2Seq模型上的应用。

4、《Hierarchical attention networks for document classification》论文启发点

  • 我们模型背后的直觉是并不是文档的所有部分对于回答查询具有同样重 要的作用, 确定相关部分涉及对单词的相互作用进行建模, 而不仅仅是对单独存在的单词进行建模。
    The intuition underlying our model is that not all parts of a document are equally relevant for answering a query and that determining the relevant sections involves modeling the interactions of the words, not just their presence in isolation (Introduction P2)
  • 此外,单词和句子的重要性是上下文相关的,同样的词或者句子在不同的上下文情景下重要性也不同。
    Moreover, the importance of words and sentences are highly context dependent, i.e. the same word or sentence may be differentially important in different context.(Introduction P3)

二、HAN Attention 模型结构

在这里插入图片描述
其实模型是相对简单且好理解的。从下往上看,其实只是一些层的叠加。

  • word encoder:对词汇进行编码,建立词向量。接着用双向 GRU 从单词的两个方向汇总信息来获取单词的注释,因此将上下文信息合并到句子向量中。
  • word attention:接着对句子向量使用 Attention 机制。
  • sentence encoder:与上面一样,根据句子向量,使用双向 GRU 构建文档向量。
  • sentence attention:对文档向量使用 Attention 机制。
  • softmax:常规的输出分类结果。

1、Word Encoder

在这里插入图片描述
在这里插入图片描述

  • Given a sentence with words w i t , t ∈ [ 0 , T ] w_{it}, t ∈ [0, T ] wit,t[0,T], we first embed the words to vectors through an embedding matrix W e W_e We, x i j = W e ⋅ w i j x_{ij} = W_e·w_{ij} xij=Wewij.
  • The bidirectional GRU contains the forward G R U → \overrightarrow{GRU} GRU which reads the sentence s i s_i si from w i 1 w_{i1} wi1 to w i T w_{iT} wiT and a backward G R U ← \overleftarrow{GRU} GRU which reads from w i T w_{iT} wiT to w i 1 w_{i1} wi1
  • h i t = [ h i t → , h i t ← ] h_{it} = [\overrightarrow{h_{it}}, \overleftarrow{h_{it}}] hit=[hit ,hit ], which summarizes the information of the whole sen- tence centered around w i t w_{it} wit.

2、Word Attention

Not all words contribute equally to the representation of the sentence meaning.

Hence, we introduce attention mechanism to extract such words that are important to the meaning of the sentence and aggregate the representation of those informative words to form a sentence vector.
在这里插入图片描述
在这里插入图片描述

  • 设一个 Query 为 u w u_w uw(随机初始化,训练时不断优化)
    • The context vector u w u_w uw can be seen as a high level representation of a fixed query “what is the informative word” over the words 。( 上下文向量 u w u_w uw 可以被视为固定查询语句 “what is the informative word” 的向量表示。)
    • The word context vector u w u_w uw is randomly initialized and jointly learned during the training process。
    • u w u_w uw 随着模型的梯度下降,一起训练。
  • Key可以直接用 h i t h_{it} hit,或者将 h i t h_{it} hit 经过一个全连接层转为 u i t u_{it} uit,使用 u i t u_{it} uit 作为Key

3、Sentence Encoder

在这里插入图片描述
h i h_i hi summarizes the neighbor sentences around sentence i i i but still focus on sentence i i i.

4、Sentence Attention

在这里插入图片描述
在这里插入图片描述

  • 设一个 Query 为 u s u_s us(随机初始化,训练时不断优化)
    • The context vector u s u_s us can be seen as a high level representation of a fixed query “what is the informative sentence” over the sentences。( 上下文向量 u s u_s us 可以被视为固定查询语句 “what is the informative sentence” 的向量表示。)
    • the sentence level context vector u s u_s us can be randomly initialized and jointly learned during the training process.
    • u s u_s us 随着模型的梯度下降,一起训练。
  • Key可以直接用 h i h_{i} hi,或者将 h i h_{i} hi 经过一个全连接层转为 u s u_s us,使用 u s u_s us 作为Key
  • v v v is the document vector that summarizes all the information of sentences in a document.

三、HAN Attention模型的应用

  1. 注意力机制在文本分类,以及其他分类和句子表示上大规模应用。
  2. 推动了注意力机制在非Seq2Seq场景下的应用。【不需要像Seq2Seq中用Decoder的上一个时间步的输出作为Query,而可以直接随机初始化一个Query向量,在模型训练中一起训练该Query即可】
  3. 层次结构Attention应用于文档级别数据 【文档级别的数据特别长,LSTM/GRU最多处理200-300长度的文档】;

四、相关模型

1、文档级别的情感分类

《Document modeling with gated recurrent neural network for sentiment classification》
在这里插入图片描述

五、HAN Attention模型代码实现

1、数据读取

get_word2vec.py

from gensim.models import KeyedVectors
from gensim.models import word2vec
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = []
datas = open("./imdb/imdb-train.txt.ss",encoding="utf-8").read().splitlines()
datas = [data.split("		")[-1].split() for data in datas]
model = word2vec.Word2Vec(datas, min_count=5)
model.wv.save_word2vec_format('imdb.model', binary=True)
wvmodel = KeyedVectors.load_word2vec_format("imdb.model",binary=True)
print (wvmodel.get_vector("good"))

IMDB_Data_Loader.py

#coding:utf-8
from torch.utils import data
import os
import torch
import nltk
import numpy as np
from gensim.models import KeyedVectors
import nltk
class IMDB_Data(data.DataLoader):
    def __init__(self,data_name,min_count,word2id = None,max_sentence_length = 100,batch_size=64,is_pretrain=False):
        self.path = os.path.abspath(".")
        if "data" not in self.path:
            self.path += "/data"
        self.data_name = "/imdb/"+data_name
        self.min_count = min_count
        self.word2id = word2id
        self.max_sentence_length = max_sentence_length
        self.batch_size = batch_size
        self.datas,self.labels= self.load_data()
        if is_pretrain:
            self.get_word2vec()
        else:
            self.weight=None
        for i in range(len(self.datas)):
            self.datas[i] = np.array(self.datas[i])

    def load_data(self):
        datas = open(self.path+self.data_name,encoding="utf-8").read().splitlines()
        datas = [data.split("		")[-1].split()+[data.split("		")[2]] for data in datas]
        datas = sorted(datas,key = lambda x:len(x),reverse=True)
        labels  = [int(data[-1])-1 for data in datas]
        datas = [data[0:-1] for data in datas]
        if self.word2id ==None:
            self.get_word2id(datas)
        for i,data in enumerate(datas):
            datas[i] = " ".join(data).split("<sssss>")
            for j,sentence in enumerate(datas[i]):
                datas[i][j] = sentence.split()
        datas = self.convert_data2id(datas)
        return datas,labels
    def get_word2id(self,datas):
        word_freq = {}
        for data in datas:
            for word in data:
                word_freq[word] = word_freq.get(word,0)+1
        word2id = {"<pad>":0,"<unk>":1}
        for word in word_freq:
            if word_freq[word]<self.min_count:
                continue
            else:
                word2id[word] = len(word2id)
        self.word2id = word2id
    def convert_data2id(self,datas):
        for i,document in enumerate(datas):
            if i%10000==0:
                print (i,len(datas))
            for j,sentence in enumerate(document):
                for k,word in enumerate(sentence):
                    datas[i][j][k] = self.word2id.get(word,self.word2id["<unk>"])
                datas[i][j] = datas[i][j][0:self.max_sentence_length] + \
                              [self.word2id["<pad>"]]*(self.max_sentence_length-len(datas[i][j]))
        for i in range(0,len(datas),self.batch_size):
            max_data_length = max([len(x) for x in datas[i:i+self.batch_size]])
            for j in range(i,min(i+self.batch_size,len(datas))):
                datas[j] = datas[j] + [[self.word2id["<pad>"]]*self.max_sentence_length]*(max_data_length-len(datas[j]))
                datas[j] = datas[j]
        return datas

    def get_word2vec(self):
        '''
        生成word2vec词向量
        :return: 根据词表生成的词向量
        '''
        print("Reading word2vec Embedding...")
        wvmodel = KeyedVectors.load_word2vec_format(self.path + "/imdb.model",binary=True)
        tmp = []
        for word, index in self.word2id.items():
            try:
                tmp.append(wvmodel.get_vector(word))
            except:
                pass
        mean = np.mean(np.array(tmp))
        std = np.std(np.array(tmp))
        print(mean, std)
        vocab_size = len(self.word2id)
        embed_size = 200
        np.random.seed(2)
        embedding_weights = np.random.normal(mean, std, [vocab_size, embed_size])  # 正太分布初始化方法
        for word, index in self.word2id.items():
            try:
                embedding_weights[index, :] = wvmodel.get_vector(word)
            except:
                pass
        self.weight = torch.from_numpy(embedding_weights).float()

    def __getitem__(self, idx):
        return self.datas[idx], self.labels[idx]

    def __len__(self):
        return len(self.labels)
if __name__=="__main__":
    imdb_data = IMDB_Data(data_name="imdb-train.txt.ss",min_count=5,is_pretrain=True)
    training_iter = torch.utils.data.DataLoader(dataset=imdb_data,
                                                batch_size=64,
                                                shuffle=False,
                                                num_workers=0)
    for data, label in training_iter:
        print (np.array(data).shape)

2、HAN Attention模型构建

HAN_Model.py

# -*- coding: utf-8 -*-
import torch
import torch.nn as nn
import numpy as np
from torch.nn import functional as F
from torch.autograd import Variable
class HAN_Model(nn.Module):
    def __init__(self,vocab_size,embedding_size,gru_size,class_num,is_pretrain=False,weights=None):
        super(HAN_Model, self).__init__()
        if is_pretrain:
            self.embedding = nn.Embedding.from_pretrained(weights, freeze=False)
        else:
            self.embedding = nn.Embedding(vocab_size, embedding_size)
        self.word_gru = nn.GRU(input_size=embedding_size,hidden_size=gru_size,num_layers=1,
                               bidirectional=True,batch_first=True)
        self.word_context = nn.Parameter(torch.Tensor(2*gru_size, 1),requires_grad=True)
        self.word_dense = nn.Linear(2*gru_size,2*gru_size)

        self.sentence_gru = nn.GRU(input_size=2*gru_size,hidden_size=gru_size,num_layers=1,
                               bidirectional=True,batch_first=True)
        self.sentence_context = nn.Parameter(torch.Tensor(2*gru_size, 1),requires_grad=True)
        self.sentence_dense = nn.Linear(2*gru_size,2*gru_size)
        self.fc = nn.Linear(2*gru_size,class_num)
    def forward(self, x,gpu=False):
        sentence_num = x.shape[1]
        sentence_length = x.shape[2]
        x = x.view([-1,sentence_length])
        x_embedding = self.embedding(x)
        word_outputs, word_hidden = self.word_gru(x_embedding)
        attention_word_outputs = torch.tanh(self.word_dense(word_outputs))
        weights = torch.matmul(attention_word_outputs,self.word_context)
        weights = F.softmax(weights,dim=1)
        x = x.unsqueeze(2)
        if gpu:
            weights = torch.where(x!=0,weights,torch.full_like(x,0,dtype=torch.float).cuda())
        else:
            weights = torch.where(x != 0, weights, torch.full_like(x, 0, dtype=torch.float))

        weights = weights/(torch.sum(weights,dim=1).unsqueeze(1)+1e-4)

        sentence_vector = torch.sum(word_outputs*weights,dim=1).view([-1,sentence_num,word_outputs.shape[-1]])
        sentence_outputs, sentence_hidden = self.sentence_gru(sentence_vector)
        attention_sentence_outputs = torch.tanh(self.sentence_dense(sentence_outputs))
        weights = torch.matmul(attention_sentence_outputs,self.sentence_context)
        weights = F.softmax(weights,dim=1)
        x = x.view(-1, sentence_num, x.shape[1])
        x = torch.sum(x, dim=2).unsqueeze(2)
        if gpu:
            weights = torch.where(x!=0,weights,torch.full_like(x,0,dtype=torch.float).cuda())
        else:
            weights = torch.where(x != 0, weights, torch.full_like(x, 0, dtype=torch.float))
        weights = weights / (torch.sum(weights,dim=1).unsqueeze(1)+1e-4)
        document_vector = torch.sum(sentence_outputs*weights,dim=1)
        output = self.fc(document_vector)
        return output

if __name__=="__main__":
    han_model = HAN_Model(vocab_size=30000,embedding_size=200,gru_size=50,class_num=4)
    x = torch.Tensor(np.zeros([64,50,100])).long()
    x[0][0][0:10] = 1
    output = han_model(x)
    print (output.shape)

3、模型训练与调试

config.py

# —*- coding: utf-8 -*-

import argparse

def ArgumentParser():
    parser = argparse.ArgumentParser()
    parser.add_argument('--embed_size', type=int, default=10, help="embedding size of word embedding")
    parser.add_argument("--epoch",type=int,default=200,help="epoch of training")
    parser.add_argument("--cuda",type=bool,default=True,help="whether use gpu")
    parser.add_argument("--gpu",type=int,default=2,help="gpu num")
    parser.add_argument("--learning_rate",type=float,default=0.001,help="learning rate during training")
    parser.add_argument("--batch_size",type=int,default=64,help="batch size during training")
    parser.add_argument("--seed",type=int,default=0,help="seed of random")
    parser.add_argument("--min_count",type=int,default=5,help="min count of words")
    parser.add_argument("--max_sentence_length",type=int,default=100,help="max sentence length")
    parser.add_argument("--embedding_size",type=int,default=200,help="word embedding size")
    parser.add_argument("--gru_size",type=int,default=50,help="gru size")
    parser.add_argument("--class_num",type=int,default=10,help="class num")
    return parser.parse_args()

main.py

# -*- coding: utf-8 -*-
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.optim as optim
from model import HAN_Model
from data import IMDB_Data
import numpy as np
from tqdm import tqdm
import config as argumentparser
config = argumentparser.ArgumentParser()
torch.manual_seed(config.seed)

if config.cuda and torch.cuda.is_available():
    torch.cuda.set_device(config.gpu)
def get_test_result(data_iter,data_set):
    # 生成测试结果
    model.eval()
    true_sample_num = 0
    for data, label in data_iter:
        if config.cuda and torch.cuda.is_available():
            data = data.cuda()
            label = label.cuda()
        else:
            data = torch.autograd.Variable(data).long()
        if config.cuda and torch.cuda.is_available():
            out = model(data, gpu=True)
        else:
            out = model(data)
        true_sample_num += np.sum((torch.argmax(out, 1) == label).cpu().numpy())
    acc = true_sample_num / data_set.__len__()
    return acc
training_set = IMDB_Data("imdb-train.txt.ss",min_count=config.min_count,
                         max_sentence_length = config.max_sentence_length,batch_size=config.batch_size,is_pretrain=True)
training_iter = torch.utils.data.DataLoader(dataset=training_set,
                                            batch_size=config.batch_size,
                                            shuffle=False,
                                            num_workers=0)
test_set = IMDB_Data("imdb-test.txt.ss",min_count=config.min_count,word2id=training_set.word2id,
                         max_sentence_length = config.max_sentence_length,batch_size=config.batch_size)
test_iter = torch.utils.data.DataLoader(dataset=test_set,
                                        batch_size=config.batch_size,
                                        shuffle=False,
                                        num_workers=0)
if config.cuda and torch.cuda.is_available():
    training_set.weight = training_set.weight.cuda()
model = HAN_Model(vocab_size=len(training_set.word2id),
                  embedding_size=config.embedding_size,
                  gru_size = config.gru_size,class_num=config.class_num,weights=training_set.weight,is_pretrain=True)
if config.cuda and torch.cuda.is_available():
    model.cuda()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=config.learning_rate)
loss = -1
for epoch in range(config.epoch):
    model.train()
    process_bar = tqdm(training_iter)
    for data, label in process_bar:
        if config.cuda and torch.cuda.is_available():
            data = data.cuda()
            label = label.cuda()
        else:
            data = torch.autograd.Variable(data).long()
        label = torch.autograd.Variable(label).squeeze()
        if config.cuda and torch.cuda.is_available():
            out = model(data,gpu=True)
        else:
            out = model(data)
        loss_now = criterion(out, autograd.Variable(label.long()))
        if loss == -1:
            loss = loss_now.data.item()
        else:
            loss = 0.95*loss+0.05*loss_now.data.item()
        process_bar.set_postfix(loss=loss_now.data.item())
        process_bar.update()
        optimizer.zero_grad()
        loss_now.backward()
        optimizer.step()
    test_acc = get_test_result(test_iter, test_set)
    print("The test acc is: %.5f" % test_acc)



参考资料:
《Document modeling with gated recurrent neural network for sentiment classification》
Keras-TextClassification