记住数据帧的昂贵计算
作者:互联网
我在Pandas DataFrames上运行时计算成本很高.我想记住它.我正在尝试弄清楚该如何使用.
In [16]: id(pd.DataFrame({1: [1,2,3]}))
Out[16]: 52015696
In [17]: id(pd.DataFrame({1: [1,2,3]}))
Out[17]: 52015504
In [18]: id(pd.DataFrame({1: [1,2,3]}))
Out[18]: 52015504
In [19]: id(pd.DataFrame({1: [1,2,3]})) # different results, won't work for my case
Out[19]: 52015440
In [20]: hash(pd.DataFrame({1: [1,2,3]})) # throws
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-20-3bddc0b20163> in <module>()
----> 1 hash(pd.DataFrame({1: [1,2,3]}))
/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in __hash__(self)
52 def __hash__(self):
53 raise TypeError('{0!r} objects are mutable, thus they cannot be'
---> 54 ' hashed'.format(self.__class__.__name__))
55
56 def __unicode__(self):
TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed
考虑到我确定我不会变异要记忆的DataFrame,是否可以做我想做的事情?
解决方法:
如果您不介意比较索引或列名,则可以将DataFrame转换为元组:
>>> df1 = pd.DataFrame({1: [1,2,3]})
>>> df2 = pd.DataFrame({1: [1,2,3]})
>>> hash(tuple(tuple(x) for x in df1.values)) == hash(tuple(tuple(x) for x in df2.values))
True
>>> id(df1) == id(df2)
False
您也可以使用地图功能代替生成器:
tuple(map(tuple, df1.values))
如果还需要比较索引,则可以将其添加为列.您还可以通过创建namedtuple来保留列名:
>>> from collections import namedtuple
>>> from pprint import pprint
>>> df = pd.DataFrame({1: [1,2,3], 2:[3,4,5]})
>>> df['index'] = df.index
>>> df
1 2 index
0 1 3 0
1 2 4 1
2 3 5 2
>>>
>>> dfr = namedtuple('row', map(lambda x: 'col_' + str(x), df.columns))
>>> res = tuple(map(lambda x: dfr(*x), df.values))
>>> pprint(res)
(row(col_1=1, col_2=3, col_index=0),
row(col_1=2, col_2=4, col_index=1),
row(col_1=3, col_2=5, col_index=2))
希望能帮助到你.
标签:pandas,memoization,python 来源: https://codeday.me/bug/20191030/1966331.html