其他分享
首页 > 其他分享> > Chapter 2.2 高频词和关键词提取(二)续

Chapter 2.2 高频词和关键词提取(二)续

作者:互联网

知识点2.2.5 基于sklearn的TF-IDF关键词提取

基于sklearn的TF-IDF关键词提取的特点:

  1. 能够使用jieba库分词
  2. 能够使用自定义词典(新词、停用词)
  3. 适用于多文本关键词提取(而非单文本)
  4. 能够根据导入的语料库计算TF-IDF值(需训练模型)
  5. 计算结果不便阅读(以矩阵呈现而非列表)

scikit-learn官方网站(https://scikit-learn.org.cn/)

#载入需要的程序模块
from sklearn.feature_extraction.text import TfidfVectorizer
#将原始文档转换为TF-IDF的矩阵(实例化模型、训练模型)
vect = TfidfVectorizer(tokenizer = jieba.lcut, stop_words = list(stopword.stopword), max_df = 50, smooth_idf = True)
matrix = vect.fit_transform(txt_list)
#分别打印IDF值、词坐标与TF-IDF值、TF-IDF矩阵、TF-IDF特征词、TF-IDF特征词和索引
print('IDF值:\n', vect.idf_)
print('词坐标与TF-IDF值:\n', matrix)
print('TF-IDF矩阵:\n', matrix.todense())
print('TF-IDF特征词:\n', vect.get_feature_names())
#或者print('TF-IDF特征词:\n', vect.get_feature_names_out())
print('TF-IDF特征词和索引:\n', vect.vocabulary_)

欢迎关注微信公众号“Trihub数据社”

标签:Chapter,高频词,语料库,print,IDF,vect,文档,TF,2.2
来源: https://blog.csdn.net/Yif18/article/details/122681919