编程语言
首页 > 编程语言> > 如何在gensim工具的python中实现TF-IDF?

如何在gensim工具的python中实现TF-IDF?

作者:互联网

从我从网上发现的文档中,我找出了用于确定语料库中术语的术语频率和逆文档频率权重的表达式

tf-idf(wt)= tf * log(| N | / d);

我正在经历gensim中提到的tf-idf的实现.
文档中给出的示例是

>>> doc_bow = [(0, 1), (1, 1)]
>>> print tfidf[doc_bow] # step 2 -- use the model to transform vectors
[(0, 0.70710678), (1, 0.70710678)] 

这显然不遵循Tf-IDF的标准实现.
两种模型有什么区别?

注意:0.70710678是2 ^(-1/2)值,通常在特征值计算中使用.
那么特征值如何进入TF-IDF模型?

解决方法:

从维基百科:

The term count in the given document is simply the number of times a given term appears in that document. This count is usually normalized to prevent a bias towards longer documents (which may have a higher term count regardless of the actual importance of that term in the document)

gensim source行126-127:

if self.normalize:
        vector = matutils.unitvec(vector)

标签:gensim,tf-idf,latent-semantic-indexing,python
来源: https://codeday.me/bug/20191101/1985142.html