其他分享
首页 > 其他分享> > 关键词提取-TFIDF(一)

关键词提取-TFIDF(一)

作者:互联网

系列文章

✓ 词向量

✗Adam,sgd

✗ 梯度消失和梯度爆炸

✗初始化的方法

✗ 过拟合&欠拟合

✗ 评价&损失函数的说明

✗ 深度学习模型及常用任务说明

✗ RNN的时间复杂度

✗ neo4j图数据库

 

分词、词向量

关键词提取-TFIDF

TfidfVectorizer

基本介绍

算法明细

算法优缺点

应用场景

可执行实例

# python:3.8
# sklearn:0.23.1
# 1、CountVectorizer 的作用是将文本文档转换为计数的稀疏矩阵
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
   'This is the first document.',
   'This document is the second document.',
   'And this is the third one.',
   'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
# 查看每个单词的位置
print(vectorizer.get_feature_names())
#['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
# 查看结果
print(X.toarray())
# [[0 1 1 1 0 0 1 0 1]
# [0 2 0 1 0 1 1 0 1]
# [1 0 0 1 1 0 1 1 1]
# [0 1 1 1 0 0 1 0 1]]

# 2、TfidfTransformer:使用计算 tf-idf
from sklearn.feature_extraction.text import TfidfTransformer
transform = TfidfTransformer()    
Y = transform.fit_transform(X)  
print(Y.toarray())                # 输出tfidf的值
# [[0.         0.46979139 0.58028582 0.38408524 0.         0. 0.38408524 0.         0.38408524]
# [0.         0.6876236 0.         0.28108867 0.         0.53864762 0.28108867 0.         0.28108867]
# [0.51184851 0.         0.         0.26710379 0.51184851 0. 0.26710379 0.51184851 0.26710379]
# [0.         0.46979139 0.58028582 0.38408524 0.         0. 0.38408524 0.         0.38408524]]

# 3、TfidfVectorizer:TfidfVectorizer 相当于 CountVectorizer 和 TfidfTransformer 的结合使用
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
   'This is the first document.',
   'This document is the second document.',
   'And this is the third one.',
   'Is this the first document?',
]
vectorizer = TfidfVectorizer() #构建一个计算词频(TF)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
# ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
print(X.shape)
# (4, 9)

参数项说明

关键词提取-TFIDF_4

关键词提取-TFIDF_5

关键词提取-TFIDF_6



标签:提取,关键词,TFIDF,IDF,词频,TF,文档,document,first
来源: https://www.cnblogs.com/nlper2wx/p/15200892.html