编程语言
首页 > 编程语言> > Python数据分析:NLTK

Python数据分析:NLTK

作者:互联网

Python数据分析:NLTK

Natural Language Toolkit
语料库
分词(tokenize)
特殊字符的分词
词形问题
词形归一化
词性标注
文本预处理流程:

在这里插入图片描述

import nltk
from nltk.corpus import brown # 需要下载brown语料库
# 引用布朗大学的语料库

# 查看语料库包含的类别
print(brown.categories())

运行:
在这里插入图片描述

# 查看brown语料库
print('共有{}个句子'.format(len(brown.sents())))
print('共有{}个单词'.format(len(brown.words())))

运行:
在这里插入图片描述

sentence = "Python is a widely used high-level programming language for general-purpose programming."
tokens = nltk.word_tokenize(sentence) # 需要下载punkt分词模型
print(tokens)

运行:
在这里插入图片描述

词干提取(stemming)
# PorterStemmer
from nltk.stem.porter import PorterStemmer

porter_stemmer = PorterStemmer()
print(porter_stemmer.stem('looked'))
print(porter_stemmer.stem('looking'))

# SnowballStemmer
from nltk.stem import SnowballStemmer

snowball_stemmer = SnowballStemmer('english')
print(snowball_stemmer.stem('looked'))
print(snowball_stemmer.stem('looking'))

# LancasterStemmer
from nltk.stem.lancaster import LancasterStemmer

lancaster_stemmer = LancasterStemmer()
print(lancaster_stemmer.stem('looked'))
print(lancaster_stemmer.stem('looking'))

运行:
在这里插入图片描述

词形归并(lemmatization)
from nltk.stem import WordNetLemmatizer # 需要下载wordnet语料库

wordnet_lematizer = WordNetLemmatizer()
print(wordnet_lematizer.lemmatize('cats'))
print(wordnet_lematizer.lemmatize('boxes'))
print(wordnet_lematizer.lemmatize('are'))
print(wordnet_lematizer.lemmatize('went'))

运行:
在这里插入图片描述

# 指明词性可以更准确地进行lemma
# lemmatize 默认为名词
print(wordnet_lematizer.lemmatize('are', pos='v'))
print(wordnet_lematizer.lemmatize('went', pos='v'))

运行:
在这里插入图片描述

词性标注:
import nltk

words = nltk.word_tokenize('Python is a widely used programming language.')
print(nltk.pos_tag(words)) # 需要下载 averaged_perceptron_tagger

运行:
在这里插入图片描述

去除停用词:
from nltk.corpus import stopwords # 需要下载stopwords

filtered_words = [word for word in words if word not in stopwords.words('english')]
print('原始词:', words)
print('去除停用词后:', filtered_words)

运行:
在这里插入图片描述

文本预处理流程:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# 原始文本
raw_text = 'Life is like a box of chocolates. You never know what you\'re gonna get.'

# 分词
raw_words = nltk.word_tokenize(raw_text)

# 词形归一化
wordnet_lematizer = WordNetLemmatizer()
words = [wordnet_lematizer.lemmatize(raw_word) for raw_word in raw_words]

# 去除停用词
filtered_words = [word for word in words if word not in stopwords.words('english')]

print('原始文本:', raw_text)
print('预处理结果:', filtered_words)

运行:
在这里插入图片描述

标签:数据分析,word,Python,nltk,words,print,import,stem,NLTK
来源: https://blog.csdn.net/weixin_41792682/article/details/89705971