首页 > 其他分享> > CountVectorizer 词频统计

CountVectorizer 词频统计

2019-06-27 20:31:13 作者：互联网

from sklearn.feature_extraction.text import CountVectorizer
import  jieba
# 实例化一个con_vec对象
# con_vec = CountVectorizer(min_df=1)


# 准备文本数据
# text = ['This is the first document.', 'This is the second second document.', 'And the third one.',
#         'Is this the first document?', ]

# 统计词语出现次数
# X = con_vec.fit_transform(text)
# feature__name = con_vec.get_feature_names()
# print(feature__name)
# print(X)
"""
(0, 1)	1
第一个值 属于第几个句子
第二个值 哪个词
1 词频
"""
# 将单词个数转化为单词个数矩阵。
# print(X.toarray())
# stop_words 去掉一些不重要的词
con_vec = CountVectorizer(min_df=1, stop_words=['之后', '玩完'])
text = '今天天气真好,我要去北京天安门玩，要去景山攻牙之后，玩完大明劫'
# 进行结巴分词，精确模式
text_list = jieba.cut(text, cut_all=False)
text_list = ",".join(text_list)
context = []
context.append(text_list)
print(context)

X = con_vec.fit_transform(context)
feature__name = con_vec.get_feature_names()
print(feature__name)
print(X.toarray())

标签：__,CountVectorizer,text,feature,词频,vec,print,统计,con
来源： https://blog.csdn.net/YPL_ZML/article/details/93906264