【TF-IDF、word2vec、svm、cnn、textcnn、bilstm、cnn+bilstm、bilstm+attention实现】英文长文本分类
作者:互联网
项目来源:https://www.kaggle.com/c/word2vec-nlp-tutorial/
之前我写过几篇博客:
就这?word2vec+BiLSTM、TextCNN、CNN+BiLSTM、BiLSTM+Attention实现中英文情感分类代码详解
就这?word2vec+SVM(支持向量机)实现中英文情感分类代码详解
这两篇博客主要是基于中文进行情感分类的,那么本篇博客,我会以这个kaggle项目来介绍如何实现英文长文本情感分类。
1 实验数据
本次数据集来源于kaggle项目“Bag of Words Meets Bags of Popcorn”提供的IMDB 情感分析数据集,共有25000条电影评论,其中正面评论为12500条,负面评论为12500条,如图所示。
其中正面评论主要包含的关键词如下图所示。
具体的热门词及其对应的统计量如下。
其中负面评论主要包含的关键词如下图所示。
具体的热门词及其对应的统计量如下。
由上面4张图不难看出,正面和负面评论中包含的热门词语基本一致,仅有少数词如“well”、“love”是正面评论的热门词,“bad”是负面评论的热门词。而这样的情况导致了在后续的分类过程中容易混淆文本的情感含义,这也间接说明了此次分类的任务具有较大的挑战性。
同时,该数据集的文本总体来看相对较长,具体如图所示。
其中句长的最小值、句长的最大值、句长的中位数和平均数如下图所示。
从上面两张图我们可以看到数据集的文本句长主要集中在50-200之间,这也为我们后续建模提供了数据支撑(max_len)。
另外,本次实验的测试集为kaggle项目提供的25000条文本,而上述数据集全部用于模型训练。
2 数据预处理
2.1 数据清洗
上述文本是25000条数据中的其中一条。
本次实验采用两种方式对数据进行清洗。第一种是首先利用Python的第三方模块bs4提供的BeautifulSoup方法除去文本内包含的<br /><br />
标签,再删去除英文字母外的一切字符,在利用空格将词分开,最后去除停用词;第二种首先同样是利用Python的第三方模块bs4提供的BeautifulSoup方法除去文本内包含的<br /><br />
标签,最后去除停用词,保留包括特殊符号在内的标点符号。
def tokenizer(reviews):
Words = []
for review in reviews:
review_text = BeautifulSoup(review, 'html.parser').get_text() # 除去标签
review_text = re.sub("[^a-zA-Z]", " ", review_text) # 除去标点符号等非英文字母
words = review_text.lower().split() # 小写化且按空格分词
stops = set(stopwords.words("english"))
words = [w for w in words if not w in stops] # 除去停用词
Words.append(words)
return Words
2.2 文本特征提取
本次实验主要利用到了两种特征提取的方法,分别为传统的特征提取方法——TF-IDF,以及双层神经网络模型——Word2vec。
2.2.1 TF-IDF
出于机器性能的限制,本次实验在利用TF-IDF进行特征提取时,仅提取词频数在500以上的词语,最终词向量的维度数为1648。有关TF-IDF的算法详细介绍可参考这篇博客的内容。
2.2.2 Word2vec
本次实验主要利用到的文本特征方法就是word2vec模型提取文本特征,并且,我们将word2vec的词向量维度分别设置为100和200,以找到更优的实验结果。有关word2vec的算法详细介绍可参考这篇博客的内容。
3 评价指标
本次的模型评价指标为AUC,被定义为ROC曲线下与坐标轴围成的,其中,ROC曲线全称为受试者工作特征曲线(receiver operating characteristic curve)。
ROC曲线是基于样本的真实类别和预测概率来画的,具体来说,ROC曲线的x轴是伪正率(false positive rate),y轴是真正率(true positive rate)。那么问题来了,什么是真、伪阳性率呢?对于二分类问题,一个样本的类别只有两种,我们用0,1分别表示两种类别,0和1也可以分别叫做负面和正面。当我们用一个分类器进行概率的预测的时候,对于真实为0的样本,我们可能预测其为0或1,同样对于真实为1的样本,我们也可能预测其为0或1,这样就有四种可能性,具体如下表所示。
如上表,TP表示预测为正面,而实际也是正面的样例数;FN表示预测为负面,而实际是正面的样例数;FP表示预测为正面,而实际是负面的样例数;TN表示预测为负面,而实际也是负面的样例数;
所以,上面这四个数就形成了一个矩阵,称为混淆矩阵。那么接下来,我们如何利用混淆矩阵来计算ROC呢?
首先我们需要定义下面两个变量:
上述FPR表示,在所有的负面数据中,被预测成正面的比例。称为伪正率。伪正率告诉我们,随机拿一个负面的样本,有多大概率会将其预测成正面数据。显然我们会希望FPR越小越好。
上述TPR表示,在所有正面数据中,被预测为正面的比例。称为真正率。真正率告诉我们,随机拿一个正面的数据时,有多大的概率会将其预测为正面数据。显然我们会希望TPR越大越好。
如果以FPR为横坐标,TPR为纵坐标,就可以得到下面的坐标系:
FPR=0时说明FP=0,即没有假正例。TPR=1时说明FN=0,即没有假反例。那么如上图所示,如果一个点越接近左上角,那么说明模型的预测效果越好。如果能达到左上角,那就是最完美的结果了。
我们知道,在二分类(0,1)的模型中,一般我们最后的输出是一个概率值,表示结果是1的概率。那么我们最后怎么决定输入的x是属于0或1呢?我们需要一个阈值,超过这个阈值则归类为1,低于这个阈值就归类为0。所以,不同的阈值会导致分类的结果不同,也就是混淆矩阵不一样了,FPR和TPR也就不一样了。所以当阈值从0开始慢慢移动到1的过程,就会形成很多对(FPR, TPR)的值,将它们画在坐标系上,就是所谓的ROC曲线了。
AUC的优势就在于AUC的计算方法同时考虑了分类器对于正例和负例的分类能力,在样本不平衡的情况下,依然能够对分类器作出合理的评价。例如在反欺诈场景,设欺诈类样本为正例,正例占比很少(假设0.1%),如果使用准确率评估,把所有的样本预测为负例,便可以获得99.9%的准确率。但是如果使用AUC,把所有样本预测为负例,TPR和FPR同时为0,与(0,0) (1,1)连接,得出AUC仅为0.5,成功规避了样本不均匀带来的问题。
4 实验参数设置
本次实验主要实现了四种分类算法,分别为Bi-LSTM、TextCNN、CNN+Bi-LSTM以及支持向量机。
4.1 支持向量机
惩罚系数C、核函数类型kernel与核函数系数gamma这三个参数主要利用到了Python提供的第三方模块scikit-learn提供的GridSearchCV方法进行调整。此外,本次实验进行了传统的特征提取方法——TF-IDF与word2vec的对比实验,以及去除标点符号和不去除标点符号的对比实验。
word2vec+svm:
# -*- coding: utf-8 -*-
import codecs
import csv
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
import re
from bs4 import BeautifulSoup
import multiprocessing
from gensim.models.word2vec import Word2Vec
from sklearn import svm
from sklearn.model_selection import GridSearchCV
import joblib
cpu_count = multiprocessing.cpu_count()
vocab_dim = 100
n_iterations = 1
n_exposures = 10 # 所有频数超过10的词语
window_size = 7
def loadfile():
train_data = pd.read_csv('../word2vec-nlp-tutorial/labeledTrainData.tsv', header=0, delimiter='\t', quoting=3)
test_data = pd.read_csv('../word2vec-nlp-tutorial/testData.tsv', header=0, delimiter='\t', quoting=3)
unlabeled = pd.read_csv('../word2vec-nlp-tutorial/unlabeledTrainData.tsv', header=0, delimiter='\t', quoting=3)
combined = np.concatenate((train_data['review'], test_data['review'], unlabeled['review']))
return combined
# 对句子进行分词
def tokenizer(reviews):
Words = []
for review in reviews:
review_text = BeautifulSoup(review, 'html.parser').get_text() # 除去标签
review_text = re.sub("[^a-zA-Z]", " ", review_text) # 除去标点符号等非英文字母
words = review_text.lower().split() # 小写化且按空格分词
stops = set(stopwords.words("english"))
words = [w for w in words if not w in stops] # 除去停用词
Words.append(words)
return Words
def word2vec_train(combined):
model = Word2Vec(size=vocab_dim,
min_count=n_exposures,
window=window_size,
workers=cpu_count,
iter=n_iterations)
model.build_vocab(combined) # input: list
model.train(combined, total_examples=model.corpus_count, epochs=model.iter)
model.save('model/Word2vec_model_100_punc.pkl')
# 直接词向量相加求平均
def fea_sentence(list_w):
n0 = np.array([0. for i in range(vocab_dim)], dtype=np.float32)
for i in list_w:
n0 += i
fe = n0 / len(list_w)
fe = fe.tolist()
return fe
def parse_dataset(x_data, word2vec):
xVec = []
for x in x_data:
sentence = []
for word in x:
if word in word2vec:
sentence.append(word2vec[word])
else:
sentence.append([0. for i in range(vocab_dim)])
xVec.append(fea_sentence(sentence))
xVec = np.array(xVec)
return xVec
def get_data(word2vec):
neg_train = pd.read_csv('data/neg_train.csv', header=None, index_col=None)
pos_train = pd.read_csv('data/pos_train.csv', header=None, index_col=None)
x_train = np.concatenate((neg_train[0], pos_train[0]))
x_train = tokenizer(x_train)
x_train = parse_dataset(x_train, word2vec)
y_train = np.concatenate((np.zeros(len(neg_train), dtype=int), np.ones(len(pos_train), dtype=int)))
x_test = pd.read_csv('data/test_data.csv', header=None, index_col=None)
x_test = tokenizer(x_test[0])
x_test = parse_dataset(x_test, word2vec)
return x_train, y_train, x_test
def train_svm(x_train, y_train):
svr = svm.SVC(verbose=True)
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 2, 4], 'gamma':[0.125, 0.25, 0.5 ,1, 2, 4]}
clf = GridSearchCV(svr, parameters, scoring='f1')
clf.fit(x_train, y_train, )
print('最佳参数: ')
# print(clf.best_params_) # rbf 4 0.125
print(clf.best_params_) # punc {'C': 4, 'gamma': 0.25, 'kernel': 'rbf'}
# clf = svm.SVC(kernel='rbf', C=4, gamma=0.125, verbose=True)
# clf.fit(x_train,y_train)
# 封装模型
print('保存模型...')
joblib.dump(clf, 'model/svm_100_punc.pkl')
if __name__ == '__main__':
# 训练模型,并保存
print('加载数据集...')
combined = loadfile()
print(len(combined))
print('数据预处理...')
combined = tokenizer(combined)
print('训练word2vec模型...')
word2vec_train(combined)
################# 若所需的word2vec已经训练好了,则上述几行代码可注释掉
print('加载word2vec模型')
# word2vec = Word2Vec.load('model/Word2vec_model_200.pkl')
word2vec = Word2Vec.load('model/Word2vec_model_100_punc.pkl')
print('将数据转换为模型输入所需格式...')
x_train, y_train, x_test = get_data(word2vec)
print("特征与标签大小:")
print(x_train.shape, y_train.shape)
print('训练svmm模型...')
train_svm(x_train, y_train)
print('加载svm模型...')
model = joblib.load('model/svm_100_punc.pkl')
y_pred = model.predict(x_test)
id = pd.read_csv('../word2vec-nlp-tutorial/sampleSubmission.csv', header=0)['id']
print(len(id))
print(len(y_pred))
f = codecs.open('data/Submission_svm.csv', 'w', encoding='utf-8')
writer = csv.writer(f)
writer.writerow(['id', 'sentiment'])
for i in range(len(id)):
writer.writerow([id[i], y_pred[i]])
f.close()
TFIDF+SVM:
# -*- coding: utf-8 -*-
import csv
import re
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
import joblib
# 对句子进行分词
def tokenizer(reviews):
Words = []
for review in reviews:
review_text = BeautifulSoup(review, 'html.parser').get_text() # 除去标签
review_text = re.sub("[^a-zA-Z]", " ", review_text) # 除去标点符号等非英文字母
words = review_text.lower().split() # 小写化且按空格分词
stops = set(stopwords.words("english"))
words = [w for w in words if not w in stops] # 除去停用词
Words.append(words)
return Words
def parse_dataset(x_data):
x_data = tokenizer(x_data) # 分词
# tfidfVectorizer = TfidfVectorizer(min_df=100) # (25000, 6110) (25000,) 太大了
# tfidfVectorizer = TfidfVectorizer(min_df=200) # (25000, 3614) (25000,)
tfidfVectorizer = TfidfVectorizer(min_df=500) # (25000, 1648) (25000,)
vectors = tfidfVectorizer.fit_transform(x_data) # 进行训练集文本的拟合和转换
print(vectors.shape) # (1352, 2597)
return vectors
def get_data():
neg_train = pd.read_csv('data/neg_train.csv', header=None, index_col=None)
pos_train = pd.read_csv('data/pos_train.csv', header=None, index_col=None)
x_test = pd.read_csv('data/test_data.csv', header=None, index_col=None)
y_train = np.concatenate((np.zeros(len(neg_train), dtype=int), np.ones(len(pos_train), dtype=int)))
x = np.concatenate((neg_train[0], pos_train[0], x_test[0]))
x = parse_dataset(x)
x_train = x[: -len(x_test[0])]
x_test = x[-len(x_test[0]):]
return x_train, y_train, x_test
def train_svm(x_train, y_train):
svr = svm.SVC(verbose=True)
parameters = {'C':[1, 2, 4], 'gamma':[0.5 ,1, 2]} # 4 2
clf = GridSearchCV(svr, parameters, scoring='f1')
clf.fit(x_train, y_train, )
print('最佳参数: ')
print(clf.best_params_) # {'C': 4, 'gamma': 2}
# clf = svm.SVC(kernel='rbf', C=1, gamma=1, verbose=True)
# clf.fit(x_train,y_train)
# 封装模型
print('保存模型...')
joblib.dump(clf, 'model/svm_tfidf.pkl')
if __name__ == '__main__':
print('特征转换...')
x_train, y_train, x_test = get_data()
print("特征与标签大小:")
print(x_train.shape, y_train.shape)
print('训练svmm模型...')
train_svm(x_train, y_train)
print('加载svm模型...')
model = joblib.load('model/svm_tfidf.pkl')
y_pred = model.predict(x_test)
id = pd.read_csv('../word2vec-nlp-tutorial/sampleSubmission.csv', header=0)['id']
print(len(id))
print(len(y_pred))
f = open('data/Submission_svm_tfidf.csv', 'w', encoding='utf-8')
writer = csv.writer(f)
writer.writerow(['id', 'sentiment'])
for i in range(len(id)):
writer.writerow([id[i], y_pred[i]])
f.close()
最终生成的这个Submission_svm_tfidf.csv就是用来提交到kaggle评分的。
4.2 Bi-LSTM
句长的最大值maxlen以及词向量维度vocab_dim。
# -*- coding: utf-8 -*-
import codecs
import csv
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import re
import multiprocessing
from nltk.corpus import stopwords
from gensim.models.word2vec import Word2Vec
from gensim.corpora.dictionary import Dictionary
import keras
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.models import load_model
from keras.layers import Bidirectional
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.layers.core import Dense, Dropout
from keras.callbacks import EarlyStopping
import random
cpu_count = multiprocessing.cpu_count()
vocab_dim = 200
n_iterations = 1
n_exposures = 10 # 所有频数超过10的词语
window_size = 7
n_epoch = 30
maxlen = 100
batch_size = 64
def loadfile():
train_data = pd.read_csv('../word2vec-nlp-tutorial/labeledTrainData.tsv', header=0, delimiter='\t', quoting=3)
test_data = pd.read_csv('../word2vec-nlp-tutorial/testData.tsv', header=0, delimiter='\t', quoting=3)
unlabeled = pd.read_csv('../word2vec-nlp-tutorial/unlabeledTrainData.tsv', header=0, delimiter='\t', quoting=3)
combined = np.concatenate((train_data['review'], test_data['review'], unlabeled['review']))
return combined
# 对句子进行分词
def tokenizer(reviews):
Words = []
for review in reviews:
review_text = BeautifulSoup(review, 'html.parser').get_text() # 除去标签
review_text = re.sub("[^a-zA-Z]", " ", review_text) # 除去标点符号等非英文字母
words = review_text.lower().split() # 小写化且按空格分词
stops = set(stopwords.words("english"))
words = [w for w in words if not w in stops] # 除去停用词
Words.append(words)
return Words
def word2vec_train(combined):
model = Word2Vec(size=vocab_dim,
min_count=n_exposures,
window=window_size,
workers=cpu_count,
iter=n_iterations)
model.build_vocab(combined) # input: list
model.train(combined, total_examples=model.corpus_count, epochs=model.iter)
model.save('model/Word2vec_model_200.pkl')
def create_dictionaries(model=None):
gensim_dict = Dictionary()
gensim_dict.doc2bow(model.wv.vocab.keys(),
allow_update=True)
# 10->0 所以k+1
w2indx = {v: k + 1 for k, v in gensim_dict.items()} # 所有频数超过10的词语的索引
f = open("word2index.txt", 'w', encoding='utf8')
for key in w2indx:
f.write(str(key))
f.write(' ')
f.write(str(w2indx[key]))
f.write('\n')
f.close()
w2vec = {word: model[word] for word in w2indx.keys()} # 所有频数超过10的词语的词向量
return w2indx, w2vec
def parse_dataset(combined, w2indx):
data = []
for sentence in combined:
new_txt = []
for word in sentence:
try:
new_txt.append(w2indx[word])
except:
new_txt.append(0) # 10->0
data.append(new_txt)
data = sequence.pad_sequences(data, maxlen=maxlen) # 每个句子所含词语对应的索引,所以句子中含有频数小于10的词语,索引为0
return data
def get_data(index_dict, word_vectors):
n_symbols = len(index_dict) + 1 # 所有单词的索引数,频数小于10的词语索引为0,所以加1
embedding_weights = np.zeros((n_symbols, vocab_dim)) # 初始化 索引为0的词语,词向量全为0
for word, index in index_dict.items(): # 从索引为1的词语开始,对每个词语对应其词向量
embedding_weights[index, :] = word_vectors[word]
neg_train = pd.read_csv('data/neg_train.csv', header=None, index_col=None)
pos_train = pd.read_csv('data/pos_train.csv', header=None, index_col=None)
x_train = np.concatenate((neg_train[0], pos_train[0]))
x_train = tokenizer(x_train)
x_train = parse_dataset(x_train, index_dict)
y_train = np.concatenate((np.zeros(len(neg_train), dtype=int), np.ones(len(pos_train), dtype=int)))
y_train = keras.utils.to_categorical(y_train, num_classes=2) # 转换为对应one-hot 表示 [len(y),2]
x_test = pd.read_csv('data/test_data.csv', header=None, index_col=None)
x_test = tokenizer(x_test[0])
x_test = parse_dataset(x_test, index_dict)
return n_symbols, embedding_weights, x_train, y_train, x_test
##定义网络结构
def train_bilstm(n_symbols, embedding_weights, x_train, y_train):
model = Sequential()
model.add(Embedding(output_dim=vocab_dim,
input_dim=n_symbols,
weights=[embedding_weights],
input_length=maxlen))
model.add(Bidirectional(LSTM(output_dim=50, dropout=0.5, activation='tanh')))
model.add(Dense(2, activation='softmax')) # Dense=>全连接层,输出维度=2
model.compile(loss='categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=batch_size, epochs=n_epoch, verbose=2)
model.save('model/bilstm_100_200.h5')
if __name__ == '__main__':
# 训练模型,并保存
print('加载数据集...')
combined = loadfile()
print(len(combined))
print('数据预处理...')
combined = tokenizer(combined)
print('训练word2vec模型...')
word2vec_train(combined)
################# 若所需的word2vec已经训练好了,则上述几行代码可注释掉
print('加载word2vec模型')
word2vec = Word2Vec.load('model/Word2vec_model_200.pkl')
print('创建词典...')
index_dict, word_vectors = create_dictionaries(model=word2vec)
print('将数据转换为模型输入所需格式...')
n_symbols, embedding_weights, x_train, y_train, x_test = get_data(index_dict, word_vectors)
print("特征与标签大小:")
print(x_train.shape, y_train.shape)
print('训练bilstm模型...')
train_bilstm(n_symbols, embedding_weights, x_train, y_train)
print('加载bilstm模型...')
model = load_model('model/bilstm_100_200.h5')
y_pred = model.predict(x_test)
for i in range(len(y_pred)):
max_value = max(y_pred[i])
for j in range(len(y_pred[i])):
if max_value == y_pred[i][j]:
y_pred[i][j] = 1
else:
y_pred[i][j] = 0
test_result = []
for i in y_pred:
if i[0] == 1:
test_result.append(0)
else:
test_result.append(1)
id = pd.read_csv('../word2vec-nlp-tutorial/sampleSubmission.csv', header=0)['id']
print(len(id))
print(len(test_result))
f = codecs.open('data/Submission_bilstm_100_200.csv', 'w', encoding='utf-8')
writer = csv.writer(f)
writer.writerow(['id', 'sentiment'])
for i in range(len(id)):
writer.writerow([id[i],test_result[i]])
f.close()
4.3 TextCNN
句长的最大值maxlen以及词向量维度vocab_dim。
def train_textcnn(n_symbols, embedding_weights, x_train, y_train):
# 模型结构:词嵌入-卷积池化*3-拼接-全连接-dropout-全连接
main_input = Input(shape=(maxlen,), dtype='float64')
# 词嵌入(使用预训练的词向量)
embedder = Embedding(output_dim=vocab_dim,
input_dim=n_symbols,
input_length=maxlen,
weights=[embedding_weights])
embed = embedder(main_input)
# 卷积核大小分别为3,4,5
cnn1 = Conv1D(256, 3, padding='same', strides=1, activation='relu')(embed)
cnn1 = MaxPooling1D(pool_size=38)(cnn1)
cnn2 = Conv1D(256, 4, padding='same', strides=1, activation='relu')(embed)
cnn2 = MaxPooling1D(pool_size=37)(cnn2)
cnn3 = Conv1D(256, 5, padding='same', strides=1, activation='relu')(embed)
cnn3 = MaxPooling1D(pool_size=36)(cnn3)
# 合并三个模型的输出向量
cnn = concatenate([cnn1, cnn2, cnn3], axis=-1)
flat = Flatten()(cnn)
drop = Dropout(0.5)(flat)
main_output = Dense(2, activation='softmax')(drop)
model = Model(inputs=main_input, outputs=main_output)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=batch_size, epochs=n_epoch)
model.save('model/textcnn.h5')
4.4 CNN+Bi-LSTM
句长的最大值maxlen以及词向量维度vocab_dim。
def train_cnn_bilstm(n_symbols, embedding_weights, x_train, y_train):
# 模型结构:词嵌入-卷积池化*3-拼接-BiLSTM-全连接-dropout-全连接
main_input = Input(shape=(maxlen,), dtype='float64')
# 词嵌入(使用预训练的词向量)
embedder = Embedding(output_dim=vocab_dim,
input_dim=n_symbols,
input_length=maxlen,
weights=[embedding_weights])
embed = embedder(main_input)
cnn = Conv1D(64, 3, padding='same', strides=1, activation='relu')(embed)
bilstm = Bidirectional(LSTM(output_dim=50, dropout=0.5, activation='tanh', return_sequences=True))(cnn)
flat = Flatten()(bilstm)
main_output = Dense(2, activation='softmax')(flat)
model = Model(inputs=main_input, outputs=main_output)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=batch_size, epochs=n_epoch)
model.save('model/cnnbilstm_100_100.h5')
(除此之外,本文也提供了cnn、bilstm+attention的复现代码)
4.5 CNN
def train_cnn(n_symbols, embedding_weights, x_train, y_train):
# 模型结构:词嵌入-卷积池化*3-拼接-BiLSTM-全连接-dropout-全连接
main_input = Input(shape=(maxlen,), dtype='float64')
# 词嵌入(使用预训练的词向量)
embedder = Embedding(output_dim=vocab_dim,
input_dim=n_symbols,
input_length=maxlen,
weights=[embedding_weights])
embed = embedder(main_input)
cnn = Conv1D(64, 3, padding='same', strides=1, activation='relu')(embed)
flat = Flatten()(cnn)
main_output = Dense(2, activation='softmax', dropout=0.2)(flat)
model = Model(inputs=main_input, outputs=main_output)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=batch_size, epochs=n_epoch)
model.save('model/cnn.h5')
4.6 Bi-LSTM + attention
# 自定义Attention层
class AttentionLayer(Layer):
def __init__(self, attention_size=None, **kwargs):
self.attention_size = attention_size
super(AttentionLayer, self).__init__(**kwargs)
def get_config(self):
config = super().get_config()
config['attention_size'] = self.attention_size
return config
def build(self, input_shape):
assert len(input_shape) == 3
self.time_steps = input_shape[1]
hidden_size = input_shape[2]
if self.attention_size is None:
self.attention_size = hidden_size
self.W = self.add_weight(name='att_weight', shape=(hidden_size, self.attention_size),
initializer='uniform', trainable=True)
self.b = self.add_weight(name='att_bias', shape=(self.attention_size,),
initializer='uniform', trainable=True)
self.V = self.add_weight(name='att_var', shape=(self.attention_size,),
initializer='uniform', trainable=True)
super(AttentionLayer, self).build(input_shape)
def call(self, inputs):
self.V = K.reshape(self.V, (-1, 1))
H = K.tanh(K.dot(inputs, self.W) + self.b)
score = K.softmax(K.dot(H, self.V), axis=1)
outputs = K.sum(score * inputs, axis=1)
return outputs
def compute_output_shape(self, input_shape):
return input_shape[0], input_shape[2]
##定义网络结构
def train_bilstm_att(n_symbols, embedding_weights, x_train, y_train, ATT_SIZE):
model = Sequential()
model.add(Embedding(output_dim=vocab_dim,
input_dim=n_symbols,
weights=[embedding_weights],
input_length=maxlen))
model.add(Bidirectional(LSTM(output_dim=50, dropout=0.5, return_sequences=True)))
model.add(AttentionLayer(attention_size=ATT_SIZE))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=batch_size, epochs=n_epoch)
model.save('model/bilstmAtt.h5')
5 实验结果
可得,当词向量维度取200,句长最大值取200时,Bi-LSTM模型效果达到最佳,而在其他情况下的模型效果相差不大。
可得,当词向量维度取100,句长最大值取160时,TextCNN模型效果达到最佳,且词向量维度取100的效果要优于词向量维度为200的效果,并且有利于减少模型计算量,减少运行时间。
可知,当词向量维度取100,句长最大值取200时,TextCNN模型效果达到最佳。
可知,当采用TF-IDF进行文本特征提取时,效果要高于采用word2vec进行文本提取的方法。同时,去除标点符号有利于提升模型的准确率。
6 讨论和分析
通过本次实验,我们可以发现本任务更适用于使用支持向量机进行分类,其次是CNN+Bi-LSTM模型,最后是TextCNN模型和Bi-LSTM模型。同时由实验可以发现,训练支持向量机模型的时间要远低于其他深度学习模型的训练时间。
通过调参,我们发现词向量的维度对于模型最终的分类效果没有太大的影响,而相较之下,句长的最大值影响更大一些。原因可能是使用word2vec提取词向量时,词向量维度取100已经足够代表词语本身,增加维度本身没有太大变化。而句长最大值的不同选择将会导致最终较短的句子的句向量中有多少是由零向量填充,较长的句子有多少词语将被删去,这本身就会对一个句子的完整表示带来较大的影响。
同时,我们也可以发现使用传统的文本特征提取的方法,最终模型的分类效果也要高于使用神经网络word2vec模型提取文本特征的分类效果。原因可能是使用TF-IDF表示句向量的话,句向量的维度较大,尽管会导致训练的时间有所增加,但也能够更加表达出更加完整的句子含义,而相较之下word2vec的效果就会差一些。
在数据预处理时如果去除标点符号,相较于不去除标点符号而言,对于模型的最终分类效果也会有较大提升。
结束语
标签:svm,word2vec,bilstm,train,cnn,import,model,csv,data 来源: https://blog.csdn.net/qq_44186838/article/details/118435765