您现在的位置是：首页 > 其它

当前栏目

[Tensorflow] RNN - 02. Movie Review Sentiment Prediction with LSTM

with 02 Tensorflow lstm RNN Review

2023-09-27 14:23:25 时间

From: Predicting Movie Review Sentiment with TensorFlow and TensorBoard

Ref: http://www.cnblogs.com/libinggen/p/6939577.html

Ref: https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/

使用LSTM的原因之一是: 解决RNN Deep Network的Gradient错误累积太多，以至于Gradient归零或者成为无穷大，所以无法继续进行优化的问题。

Thanks to Jürgen Schmidhuber

Using the data from an old Kaggle competition “Bag of Words Meets Bags of Popcorn”

import pandas as pd
import numpy as np
import tensorflow as tf
import nltk, re, time
from nltk.corpus import stopwords
from collections import defaultdict
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from collections import namedtuple

Preprocessing

The data is formatted as .tsv

remove stopwords
Convert words to lower case

def clean_text(text, remove_stopwords=True):
    '''Clean the text, with the option to remove stopwords'''
    
    # Convert words to lower case and split them
    text = text.lower().split()

    # Optionally, remove stop words
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        text = [w for w in text if not w in stops]
    
    text = " ".join(text)

    # Clean the text
    text = re.sub(r"<br />", " ", text)
    text = re.sub(r"[^a-z]", " ", text)
    text = re.sub(r"   ", " ", text) # Remove any extra spaces
    text = re.sub(r"  ", " ", text)
    
    # Return a list of words
    return(text)

Data clean

Tokenize

# Tokenize the reviews
all_reviews = train_clean + test_clean
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_reviews)
print("Fitting is complete.")

train_seq = tokenizer.texts_to_sequences(train_clean)
print("train_seq is complete.")

test_seq = tokenizer.texts_to_sequences(test_clean)
print("test_seq is complete")

word_index = tokenizer.word_index

NB: punctuation is useful!

[“The”, “cat”, “went”, “to”, “the”, “zoo”, “.”] --> [1, 2, 3, 4, 1, 5, 6]

Limiting your vocabulary

Your model should benefit from limiting your vocabulary to more common words

because it has seen each word in the text multiple times.

Reviews with the same length

I limited mine to 200 to increase the training speed of my model.

Build Graph with LSTM

def build_rnn(n_words, embed_size, batch_size, lstm_size, num_layers, dropout, learning_rate, multiple_fc, fc_units):
    '''Build the Recurrent Neural Network'''

    tf.reset_default_graph()

    # Declare placeholders we'll feed into the graph
    with tf.name_scope('inputs'):
        inputs = tf.placeholder(tf.int32, [None, None], name='inputs')

    with tf.name_scope('labels'):
        labels = tf.placeholder(tf.int32, [None, None], name='labels')

    keep_prob = tf.placeholder(tf.float32, name='keep_prob')

    # Create the embeddings
    with tf.name_scope("embeddings"):
        embedding = tf.Variable(tf.random_uniform((n_words, embed_size), -1, 1))
        embed = tf.nn.embedding_lookup(embedding, inputs)

    # Build the RNN layers
    with tf.name_scope("RNN_layers"):
        lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
        drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
        cell = tf.contrib.rnn.MultiRNNCell([drop] * num_layers)
    
    # Set the initial state
    with tf.name_scope("RNN_init_state"):
        initial_state = cell.zero_state(batch_size, tf.float32)

    # Run the data through the RNN layers
    with tf.name_scope("RNN_forward"):
        outputs, final_state = tf.nn.dynamic_rnn(
                                        cell,         
                                        embed,
                                        initial_state=initial_state)    
    
    # Create the fully connected layers
    with tf.name_scope("fully_connected"):
        
        # Initialize the weights and biases
        weights = tf.truncated_normal_initializer(stddev=0.1)
        biases  = tf.zeros_initializer()
        
        dense = tf.contrib.layers.fully_connected(outputs[:, -1],
                    num_outputs = fc_units,
                    activation_fn = tf.sigmoid,
                    weights_initializer = weights,
                    biases_initializer = biases)
        
        dense = tf.contrib.layers.dropout(dense, keep_prob)
        
        # Depending on the iteration, use a second fully connected 
          layer
        if multiple_fc == True:
            dense = tf.contrib.layers.fully_connected(dense,
                        num_outputs = fc_units,
                        activation_fn = tf.sigmoid,
                        weights_initializer = weights,
                        biases_initializer = biases)
            
            dense = tf.contrib.layers.dropout(dense, keep_prob)
    
    # Make the predictions
    with tf.name_scope('predictions'):
        predictions = tf.contrib.layers.fully_connected(dense, 
                          num_outputs = 1, 
                          activation_fn=tf.sigmoid,
                          weights_initializer = weights,
                          biases_initializer = biases)
        
        tf.summary.histogram('predictions', predictions)
    
    # Calculate the cost
    with tf.name_scope('cost'):
        cost = tf.losses.mean_squared_error(labels, predictions)
        tf.summary.scalar('cost', cost)
    
    # Train the model
    with tf.name_scope('train'):    
        optimizer = 
            tf.train.AdamOptimizer(learning_rate).minimize(cost)

    # Determine the accuracy
    with tf.name_scope("accuracy"):
        correct_pred = tf.equal(tf.cast(tf.round(predictions), 
                                        tf.int32), 
                                        labels)
        accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
        tf.summary.scalar('accuracy', accuracy)
    
    # Merge all of the summaries
    merged = tf.summary.merge_all()    

    # Export the nodes 
    export_nodes = ['inputs', 'labels', 'keep_prob','initial_state',        
                    'final_state','accuracy', 'predictions', 'cost', 
                    'optimizer', 'merged']
    Graph = namedtuple('Graph', export_nodes)
    local_dict = locals()
    graph = Graph(*[local_dict[each] for each in export_nodes])
    
    return graph

Ref: https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/

这里提到了几种思路：

Simple LSTM for Sequence Classification

model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))

Epoch 1/3

16750/16750 [==============================] - 107s - loss: 0.5570 - acc: 0.7149

Epoch 2/3

16750/16750 [==============================] - 107s - loss: 0.3530 - acc: 0.8577

Epoch 3/3

16750/16750 [==============================] - 107s - loss: 0.2559 - acc: 0.9019

Accuracy: 86.79%

LSTM For Sequence Classification With Dropout

model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

Epoch 1/3

16750/16750 [==============================] - 108s - loss: 0.5802 - acc: 0.6898

Epoch 2/3

16750/16750 [==============================] - 108s - loss: 0.4112 - acc: 0.8232

Epoch 3/3

16750/16750 [==============================] - 108s - loss: 0.3825 - acc: 0.8365

Accuracy: 85.56%

LSTM and Convolutional Neural Network For Sequence Classification

model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))

Epoch 1/3

16750/16750 [==============================] - 58s - loss: 0.5186 - acc: 0.7263

Epoch 2/3

16750/16750 [==============================] - 58s - loss: 0.2946 - acc: 0.8825

Epoch 3/3

16750/16750 [==============================] - 58s - loss: 0.2291 - acc: 0.9126

Accuracy: 86.36%

1D卷积code参考：http://spaces.ac.cn/archives/4195/

可以结合特征处理，进一步提高performence。

Ref: http://www.cnblogs.com/en-heng/p/6899820.html

Ref: http://blog.csdn.net/han_xiaoyang/article/details/50629608

Ref: https://zhuanlan.zhihu.com/p/26645088

特征处理

在文本挖掘中做了很大的努力，比如提取关键词、情感分析、word embedding聚类之类都尝试过，但效果都不是很好,

对于文本的特征的建议还是去找出一些除了停用词以外的高频词汇，寻找与这个房屋分类问题的具体联系。

到了头疼的部分了，数据有了，我们得想办法从数据里面拿到有区分度的特征。

比如说Kaggle该问题的引导页提供的word2vec就是一种文本到数值域的特征抽取方式，
比如说我们在第6小节提到的用户信息提取关键字也是提取特征的一种。
比如说在这里，我们打算用在文本检索系统中非常有效的一种特征：TF-IDF(term frequency-interdocument frequency)向量。每一个电影评论最后转化成一个TF-IDF向量。

稍加解释一下，TF-IDF是一种统计方法，用以评估一字词(或者n-gram)对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。这是一个能很有效地判定对评论褒贬影响大的词或短语的方法。

那个…博主打算继续偷懒，把scikit-learn中TFIDF向量化方法直接拿来用，想详细了解的同学可以戳sklearn TFIDF向量类。对了，再多说几句我的处理细节，停用词被我掐掉了，同时我在单词的级别上又拓展到2元语言模型，恩，你可以再加3元4元语言模型…单机内存不够了，先就2元上，凑活用吧…

End.

猜你喜欢

Java系列之JNDI
【VS开发】学习VS2010 ------ 多种类型的视图集合CTabView
kvm虚拟化之kvm虚拟机快照备份
2023-04-11 monetdb-BAT及投影限制处理-分析
Integer.parseInt(String s, int radix)方法介绍（修正版）
木棉花炖猪骨头祛湿汤
C++基础语法----多态
RulersGuides.js – 网站中实现 Photoshop 标尺效果
从0到1精通自动化测试，pytest自动化测试框架，测试用例setup和teardown（三）
ISP基础（02）：宽动态范围WDR
Linux环境下gdb程序调试
html转义表
JSP EL表达式详细介绍
基础数据结构和算法概念
[转载]数据库基础学习随记

相关主题

python的with语句
with open
SQL中with的用法
WITH (NOLOCK)
sql 之 with as
python with用法
by,with

zl程序教程

当前栏目

[Tensorflow] RNN - 02. Movie Review Sentiment Prediction with LSTM

Preprocessing

Tokenize

Limiting your vocabulary

Reviews with the same length

Build Graph with LSTM

Simple LSTM for Sequence Classification

LSTM For Sequence Classification With Dropout

LSTM and Convolutional Neural Network For Sequence Classification

特征处理

相关文章