python-使用句号选择/切片多索引数据帧时间序列会导致错误?
作者:互联网
我有一个多级索引,它的一级索引是一个时间序列,与下面的索引非常相似:
In[168]: rng = pd.date_range('01-01-2000',periods=50,freq='M')
In[169]: long_df = pd.DataFrame(np.random.randn(50,4),index = rng, columns=['bar','baz','foo','zoo'])
In[170]: long_df = long_df.stack()
In[171]: long_df[:10]
Out[171]:
2000-01-31 bar 2.079474
baz -0.569920
foo 1.149012
zoo -0.228926
2000-02-29 bar 0.429502
baz -0.117166
foo 0.956546
zoo -1.483818
2000-03-31 bar -1.137998
baz 1.049849
编辑
我可以使用句点对其进行切片,并且可以正常工作:
In[172]: long_df = long_df.sort_index()
In[173]: long_df.loc['2001']
Out[173]:
2001-01-31 bar -0.193987
baz 0.769297
foo 0.286880
zoo -1.431313
2001-02-28 bar -0.840502
baz 1.786758
foo 0.878356
zoo 0.433383
2001-03-31 bar 0.897548
baz 1.901540
foo 0.110606
zoo 0.571267
2001-04-30 bar -0.375377
baz 1.423742
foo -0.415006
zoo -0.141000
(...)
但是,当我使用多索引版本时,未确认正在使用切片:
In[204]: dfmi
Out[204]:
Last Days to expiry
Date Ticker
1988-12-06 HGF89 1.46894 52
HGF90 1.17100 419
HGG89 1.42100 80
HGH89 1.37344 113
HGH90 1.17450 477
HGK89 1.28750 171
HGK90 1.15900 539
HGN89 1.24550 233
HGN90 1.15900 598
HGU89 1.21750 295
HGU90 1.15900 659
HGZ89 1.18500 386
1988-12-07 HGF89 1.51900 51
HGF90 1.18900 418
HGG89 1.46394 79
HGH89 1.41300 112
HGH90 1.19250 476
HGK89 1.31750 170
HGK90 1.17700 538
HGN89 1.27550 232
HGN90 1.17700 597
HGU89 1.24250 294
HGU90 1.17700 658
HGZ89 1.20300 385
1988-12-08 HGF89 1.58100 50
HGF90 1.18900 417
HGG89 1.50894 78
HGH89 1.43994 111
HGH90 1.19250 475
HGK89 1.32750 169
... ...
2016-07-05 HGK7 2.20500 325
HGM7 2.20900 358
HGN6 2.18150 22
HGN7 2.21000 387
HGQ6 2.18150 55
HGQ7 2.21450 420
HGU6 2.18350 85
HGU7 2.21550 449
HGV6 2.18700 114
HGV7 2.21850 479
HGX6 2.19100 146
HGX7 2.22000 511
HGZ6 2.19250 176
2016-07-06 HGF7 2.16700 205
HGG7 2.17100 233
HGH7 2.17100 266
HGJ7 2.17550 294
HGK7 2.17650 324
HGM7 2.18050 357
HGN6 2.15150 21
HGN7 2.18150 386
HGQ6 2.15150 54
HGQ7 2.18600 419
HGU6 2.15350 84
HGU7 2.18700 448
HGV6 2.15700 113
HGV7 2.19000 478
HGX6 2.16100 145
HGX7 2.19150 510
HGZ6 2.16300 175
[167701 rows x 2 columns]
In[204]: dfmi = dfmi.sort_index()
In[205]: dfmi.loc['2001']
Out[206]:
Last Days to expiry
Date Ticker
1988-12-06 HGF89 1.46894 52
HGF90 1.17100 419
HGG89 1.42100 80
HGH89 1.37344 113
HGH90 1.17450 477
HGK89 1.28750 171
HGK90 1.15900 539
HGN89 1.24550 233
HGN90 1.15900 598
HGU89 1.21750 295
HGU90 1.15900 659
1988-12-07 HGF89 1.51900 51
HGF90 1.18900 418
HGG89 1.46394 79
HGH89 1.41300 112
HGH90 1.19250 476
HGK89 1.31750 170
HGK90 1.17700 538
HGN89 1.27550 232
HGN90 1.17700 597
HGU89 1.24250 294
HGU90 1.17700 658
1988-12-08 HGF89 1.58100 50
HGF90 1.18900 417
HGG89 1.50894 78
HGH89 1.43994 111
HGH90 1.19250 475
HGK89 1.32750 169
HGK90 1.17700 537
HGN89 1.27750 231
... ...
2016-07-05 HGH7 2.19950 267
HGJ7 2.20400 295
HGK7 2.20500 325
HGM7 2.20900 358
HGN6 2.18150 22
HGN7 2.21000 387
HGQ6 2.18150 55
HGQ7 2.21450 420
HGU6 2.18350 85
HGU7 2.21550 449
HGV6 2.18700 114
HGV7 2.21850 479
HGX6 2.19100 146
HGX7 2.22000 511
2016-07-06 HGF7 2.16700 205
HGG7 2.17100 233
HGH7 2.17100 266
HGJ7 2.17550 294
HGK7 2.17650 324
HGM7 2.18050 357
HGN6 2.15150 21
HGN7 2.18150 386
HGQ6 2.15150 54
HGQ7 2.18600 419
HGU6 2.15350 84
HGU7 2.18700 448
HGV6 2.15700 113
HGV7 2.19000 478
HGX6 2.16100 145
HGX7 2.19150 510
[161017 rows x 2 columns]
我注意到作为示例给出的long_df(pandas.core.series.Series)与使用的df(pandas.core.frame.DataFrame)在类型上有所不同
正确的方法是什么?
非常感谢您的提示,
解决方法:
您需要添加loc
,但需要最新版本的pandas 0.18.1:
print (long_df.loc['2001'])
2001-01-31 bar 1.684425
baz 1.215258
foo 0.158968
zoo 0.689477
2001-02-28 bar -0.123582
baz 0.312533
foo 0.609169
zoo -0.093985
2001-03-31 bar 0.372093
baz -0.281191
foo -0.400354
zoo 0.646965
2001-04-30 bar -0.287488
baz -0.928941
foo 1.365416
zoo 0.267282
2001-05-31 bar -1.021086
baz 0.317819
foo -0.393135
zoo -0.213589
2001-06-30 bar -2.594173
...
...
编辑:
另一个解决方案是从第一级的get_level_values
到get_loc
,以查找整数索引:
import pandas as pd
long_df = pd.read_csv('test/testslice.csv', parse_dates=[0], index_col=[0,1])
dfmi = long_df.stack().sort_index()
print (dfmi.index.get_level_values(0))
DatetimeIndex(['1988-12-06', '1988-12-06', '1988-12-06', '1988-12-06',
'1988-12-06', '1988-12-06', '1988-12-06', '1988-12-06',
'1988-12-06', '1988-12-06',
...
'2016-07-06', '2016-07-06', '2016-07-06', '2016-07-06',
'2016-07-06', '2016-07-06', '2016-07-06', '2016-07-06',
'2016-07-06', '2016-07-06'],
dtype='datetime64[ns]', name='Date', length=335402, freq=None)
print (dfmi.index.get_level_values(0).get_loc('2001'))
slice(121844, 133684, None)
print (dfmi.iloc[dfmi.index.get_level_values(0).get_loc('2001')])
Date Ticker
2001-01-02 HGF01 Last 0.8180
Days to expiry 27.0000
HGF02 Last 0.8180
Days to expiry 392.0000
HGG01 Last 0.8165
Days to expiry 55.0000
HGG02 Last 0.8180
Days to expiry 420.0000
HGH01 Last 0.8115
Days to expiry 85.0000
HGH02 Last 0.8180
Days to expiry 448.0000
HGJ01 Last 0.8125
Days to expiry 114.0000
HGJ02 Last 0.8170
Days to expiry 479.0000
HGK01 Last 0.8135
Days to expiry 147.0000
HGK02 Last 0.8160
Days to expiry 512.0000
HGM01 Last 0.8145
Days to expiry 176.0000
HGM02 Last 0.8155
Days to expiry 540.0000
HGN01 Last 0.8155
Days to expiry 206.0000
HGN02 Last 0.8140
Days to expiry 573.0000
HGQ01 Last 0.8160
Days to expiry 239.0000
...
2001-12-31 HGK03 Last 0.6960
Days to expiry 513.0000
HGM02 Last 0.6680
Days to expiry 177.0000
HGM03 Last 0.6980
Days to expiry 542.0000
HGN02 Last 0.6710
Days to expiry 210.0000
HGN03 Last 0.7005
Days to expiry 575.0000
HGQ02 Last 0.6740
Days to expiry 240.0000
HGQ03 Last 0.7030
Days to expiry 604.0000
HGU02 Last 0.6770
Days to expiry 269.0000
HGU03 Last 0.7050
Days to expiry 634.0000
HGV02 Last 0.6795
Days to expiry 302.0000
HGV03 Last 0.7080
Days to expiry 667.0000
HGX02 Last 0.6820
Days to expiry 329.0000
HGX03 Last 0.7110
Days to expiry 694.0000
HGZ02 Last 0.6850
Days to expiry 361.0000
HGZ03 Last 0.7140
Days to expiry 728.0000
dtype: float64
EDIT1通过评论:
不幸的是,如果需要按范围选择,我对列表理解和concat
的解决方案比较慢:
print (list(range(1993, 2003)))
[1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002]
dfs = [dfmi.iloc[dfmi.index.get_level_values(0).get_loc(str(x))] for x in range(1993, 2003)]
print (pd.concat(dfs))
1993-01-01 00:00:00 bar 0.080676
baz 0.315925
foo -1.484132
zoo -0.977202
1993-01-01 01:00:00 bar 0.817846
baz -1.280649
foo 0.727975
zoo -0.062142
1993-01-01 02:00:00 bar 1.278623
baz 0.268865
foo -0.183612
zoo 0.194996
1993-01-01 03:00:00 bar -0.304734
baz -0.227468
foo -0.134305
zoo 0.887374
1993-01-01 04:00:00 bar -0.166669
baz -0.132718
foo -0.624932
zoo 1.959724
1993-01-01 05:00:00 bar -1.379774
baz -0.738452
foo 0.398924
zoo 0.005612
1993-01-01 06:00:00 bar -0.864205
baz -0.813321
foo 0.931858
zoo -1.005977
1993-01-01 07:00:00 bar 0.667380
baz -1.208457
...
2002-10-30 08:00:00 foo 0.311835
zoo 0.611802
2002-10-30 09:00:00 bar 2.615050
baz -0.291767
foo -0.508202
zoo 0.443429
2002-10-30 10:00:00 bar -1.724252
baz -0.126579
foo 1.108530
zoo -0.553025
2002-10-30 11:00:00 bar 1.208705
baz -1.561024
foo 0.722768
zoo 1.893419
2002-10-30 12:00:00 bar 0.239383
baz -0.543053
foo -0.687370
zoo 0.848929
2002-10-30 13:00:00 bar 0.897465
baz 0.631292
foo 0.068200
zoo -1.579010
2002-10-30 14:00:00 bar -0.996531
baz -1.208318
foo 0.174970
zoo -0.780913
2002-10-30 15:00:00 bar 0.237465
baz 0.380585
foo -1.646285
zoo -0.730744
dtype: float64
标签:multi-index,pandas,slice,time-series,python 来源: https://codeday.me/bug/20191118/2026922.html