将语料库字典排序为OrderedDict的最快方法 – python
作者:互联网
鉴于语料库/文本本身:
Resumption of the session
I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999 , and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period .
Although , as you will have seen , the dreaded ' millennium bug ' failed to materialise , still the people in a number of countries suffered a series of natural disasters that truly were dreadful .
You have requested a debate on this subject in the course of the next few days , during this part @-@ session .
In the meantime , I should like to observe a minute ' s silence , as a number of Members have requested , on behalf of all the victims concerned , particularly those of the terrible storms , in the various countries of the European Union .
我可以简单地这样做以获得一个字频率的字典:
>>> word_freq = Counter()
>>> for line in text.split('\n'):
... for word in line.split():
... word_freq[word]+=1
...
但如果目标是从最高频率到最低频率实现有序字典,我将不得不这样做:
>>> from collections import OrderedDict
>>> sorted_word_freq = OrderedDict()
>>> for word, freq in word_freq.most_common():
... sorted_word_freq[word] = freq
...
想象一下,我在Counter对象中有10亿个键,迭代通过most_common()会有一次复杂的语料库(非唯一实例)和词汇表(唯一键).
注意:Counter.most_common()将调用ad-hoc sorted(),请参阅https://hg.python.org/cpython/file/e38470b49d3c/Lib/collections.py#l472
鉴于此,我看到以下使用numpy.argsort()的代码:
>>> import numpy as np
>>> words = word_freq.keys()
>>> freqs = word_freq.values()
>>> sorted_word_index = np.argsort(freqs) # lowest to highest
>>> sorted_word_freq_with_numpy = OrderedDict()
>>> for idx in reversed(sorted_word_index):
... sorted_word_freq_with_numpy[words[idx]] = freqs[idx]
...
哪个更快?
有没有其他更快的方法从计数器获得这样的OrderedDict?
除了OrderedDict之外,还有其他python对象可以实现相同的排序键值对吗?
假设内存不是问题.鉴于120 GB的内存,保持10亿个键值对不应该有太多问题吗?假设10亿个密钥每个密钥平均有20个字符,每个值都有一个整数.
解决方法:
Pandas中的Series对象是可能感兴趣的键值对(可以具有非唯一键)的数组.它有一个sort方法,按值排序并在Cython中实现.这是一个排序长度为一百万的数组的示例:
In [39]:
import pandas as pd
import numpy as np
arr = np.arange(1e6)
np.random.shuffle(arr)
s = pd.Series(arr, index=np.arange(1e6))
%timeit s.sort()
%timeit sorted(arr)
1 loops, best of 3: 85.8 ms per loop
1 loops, best of 3: 1.15 s per loop
给定一个普通的Python dict,你可以通过调用来构造一个Series:
my_series = pd.Series(my_dict)
然后按值排序
my_series.sort()
标签:python,dictionary,numpy,counter,ordereddictionary 来源: https://codeday.me/bug/20190519/1136971.html