编程语言
首页 > 编程语言> > python – gensim.corpora.Dictionary是否保存了术语频率?

python – gensim.corpora.Dictionary是否保存了术语频率?

作者:互联网

gensim.corpora.Dictionary的术语频率是否已保存?

gensim.corpora.Dictionary开始,可以获得单词的文档频率(即,特定单词出现的文档数量):

from nltk.corpus import brown
from gensim.corpora import Dictionary

documents = brown.sents()
brown_dict = Dictionary(documents)

# The 100th word in the dictionary: 'these'
print('The word "' + brown_dict[100] + '" appears in', brown_dict.dfs[100],'documents')

[OUT]:

The word "these" appears in 1213 documents

还有filter_n_most_frequent(remove_n)功能可以删除第n个最常用的令牌:

filter_n_most_frequent(remove_n)
Filter out the ‘remove_n’ most frequent tokens that appear in the documents.

After the pruning, shrink resulting gaps in word ids.

Note: Due to the gap shrinking, the same word may have a different word id before and after the call to this function!

filter_n_most_frequent函数是否根据文档频率或术语频率删除第n个最频繁的函数?

如果是后者,有没有办法访问gensim.corpora.Dictionary对象中单词的术语频率?

解决方法:

不,gensim.corpora.Dictionary不保存术语频率.你可以see the source code here.该类只存储以下成员变量:

    self.token2id = {}  # token -> tokenId
    self.id2token = {}  # reverse mapping for token2id; only formed on request, to save memory
    self.dfs = {}  # document frequencies: tokenId -> in how many documents this token appeared

    self.num_docs = 0  # number of documents processed
    self.num_pos = 0  # total number of corpus positions
    self.num_nnz = 0  # total number of non-zeroes in the BOW matrix

这意味着类中的所有内容都将频率定义为文档频率,而不是术语频率,因为后者永远不会全局存储.这适用于filter_n_most_frequent(remove_n)以及其他所有方法.

标签:python,dictionary,frequency,gensim,tf-idf
来源: https://codeday.me/bug/20190608/1195925.html