NLP实验1-基于逻辑回归的句子情感分析
作者:互联网
1 准备工作
1.1 实现两个工具函数
处理推文函数:首先将输入的句子去除不必要的字符,随后,调用tokennize进行扫描,对于每一个单词,如果不是停用词且不是标点符号,就将它词根化
def process_tweet(tweet):
"""Process tweet function.
Input:
tweet: a string containing a tweet
Output:
tweets_clean: a list of words containing the processed tweet
"""
stemmer = PorterStemmer()
stopwords_english = stopwords.words('english')
# remove stock market tickers like $GE
tweet = re.sub(r'\$\w*', '', tweet)
# remove old style retweet text "RT"
tweet = re.sub(r'^RT[\s]+', '', tweet)
# remove hyperlinks
tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
# remove hashtags
# only removing the hash # sign from the word
tweet = re.sub(r'#', '', tweet)
# tokenize tweets
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
reduce_len=True)
tweet_tokens = tokenizer.tokenize(tweet)
tweets_clean = []
for word in tweet_tokens:
if (word not in stopwords_english and # remove stopwords
word not in string.punctuation): # remove punctuation
# tweets_clean.append(word)
stem_word = stemmer.stem(word) # stemming word
tweets_clean.append(stem_word)
return tweets_clean
建立词汇表函数:即哈希
def build_freqs(tweets, ys):
"""Build frequencies.
Input:
tweets: a list of tweets
ys: an m x 1 array with the sentiment label of each tweet
(either 0 or 1)
Output:
freqs: a dictionary mapping each (word, sentiment) pair to its
frequency
"""
# Convert np array to list since zip needs an iterable.
# The squeeze is necessary or the list ends up with one element.
# Also note that this is just a NOP if ys is already a list.
yslist = np.squeeze(ys).tolist()
# Start with an empty dictionary and populate it by looping over all tweets
# and over all processed words in each tweet.
freqs = {}
for y, tweet in zip(yslist, tweets):
for word in process_tweet(tweet):
pair = (word, y)
if pair in freqs:
freqs[pair] += 1
else:
freqs[pair] = 1
return freqs
1.2 实现逻辑回归sigmoid函数
# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def sigmoid(z):
'''
Input:
z: is the input (can be a scalar or an array)
Output:
h: the sigmoid of z
'''
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
# calculate the sigmoid of z
h = 1 / (1 + np.exp(-z))
### END CODE HERE ###
return h
1.3 根据公式实现梯度下降函数
# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def gradientDescent(x, y, theta, alpha, num_iters):
'''
Input:
x: matrix of features which is (m,n+1)
y: corresponding labels of the input matrix x, dimensions (m,1)
theta: weight vector of dimension (n+1,1)
alpha: learning rate
num_iters: number of iterations you want to train your model for
Output:
J: the final cost
theta: your final weight vector
Hint: you might want to print the cost to make sure that it is going down.
'''
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
# get 'm', the number of rows in matrix x
m = x.shape[0]
for i in range(0, num_iters):
# get z, the dot product of x and theta
z = np.dot(x, theta)
# get the sigmoid of z
h = sigmoid(z)
# calculate the cost function
J = -(np.dot(y.T, np.log(h)) + np.dot((1 - y).T, np.log(1 - h))) / m
# update the weights theta
theta = theta - alpha * (np.dot(x.T, (h - y))) / m
### END CODE HERE ###
J = float(J)
return J, theta
1.4 实现提取特征函数
这个函数是根据词汇表和推文获取1行3列的向量x,调用了工具函数process_tweet
# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def extract_features(tweet, freqs):
'''
Input:
tweet: a list of words for one tweet
freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
Output:
x: a feature vector of dimension (1,3)
'''
# process_tweet tokenizes, stems, and removes stopwords
word_l = process_tweet(tweet)
# 3 elements in the form of a 1 x 3 vector
x = np.zeros((1, 3))
#bias term is set to 1
x[0,0] = 1
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
# loop through each word in the list of words
for word in word_l:
# increment the word count for the positive label 1
x[0,1] += freqs.get((word,1),0)
# increment the word count for the negative label 0
x[0,2] += freqs.get((word,0),0)
### END CODE HERE ###
assert(x.shape == (1, 3))
return x
1.5 处理数据集
将输入数据分为两部分,分别用来训练和验证
# split the data into two pieces, one for training and one for testing (validation set)
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]
train_x = train_pos + train_neg
test_x = test_pos + test_neg
同时也处理好输出结果y
# combine positive and negative labels
train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)
# Print the shape train and test sets
print(train_y)
print("train_y.shape = " + str(train_y.shape))
print("test_y.shape = " + str(test_y.shape))
2 正式流程
2.1 利用train数据建立词汇表
freqs = build_freqs(train_x, train_y)
2.2 开始训练
利用提取特征函数获取每一条推文的三列向量,合在一起就是m行3列的X向量,然后进行梯度下降。获得theta
# collect the features 'x' and stack them into a matrix 'X'
X = np.zeros((len(train_x), 3))
for i in range(len(train_x)):
X[i, :]= extract_features(train_x[i], freqs)
# training labels corresponding to X
Y = train_y
# Apply gradient descent
J, theta = gradientDescent(X, Y, np.zeros((3, 1)), 1e-9, 1500)
print(f"The cost after training is {J:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(theta)]}")
3 进行预测
3.1 预测函数
对于每一条推文,提取特征,然后与θ向量相乘,再经过激活函数即可。
# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def predict_tweet(tweet, freqs, theta):
'''
Input:
tweet: a string
freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
theta: (3,1) vector of weights
Output:
y_pred: the probability of a tweet being positive or negative
'''
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
# extract the features of the tweet and store it into x
x = extract_features(tweet,freqs)
# make the prediction using x and theta
y_pred = sigmoid(np.dot(x, theta))
### END CODE HERE ###
return y_pred
3.2 计算准确率
以0.5为界限来判别是否。在获得准确率时,先将两种数据分别变成一维数组,然后计算==1的个数。
# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def test_logistic_regression(test_x, test_y, freqs, theta):
"""
Input:
test_x: a list of tweets
test_y: (m, 1) vector with the corresponding labels for the list of tweets
freqs: a dictionary with the frequency of each pair (or tuple)
theta: weight vector of dimension (3, 1)
Output:
accuracy: (# of tweets classified correctly) / (total # of tweets)
"""
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
# the list for storing predictions
y_hat = []
for tweet in test_x:
# get the label prediction for the tweet
y_pred = predict_tweet(tweet, freqs, theta)
if y_pred > 0.5:
# append 1.0 to the list
y_hat.append(1.0)
else:
# append 0 to the list
y_hat.append(0)
# With the above implementation, y_hat is a list, but test_y is (m,1) array
# convert both to one-dimensional arrays in order to compare them using the '==' operator
print(np.array(y_hat) == test_y.flatten())
accuracy = np.sum((np.array(y_hat) == test_y.flatten()) == 1) / len(y_hat)
### END CODE HERE ###
return accuracy
标签:NLP,word,tweet,freqs,test,情感,np,句子,### 来源: https://blog.csdn.net/weixin_43911373/article/details/116327181