其他分享
首页 > 其他分享> > 使用熊猫MultiIndex时如何基于索引值进行插值?

使用熊猫MultiIndex时如何基于索引值进行插值?

作者:互联网

我有人口统计面板数据,其中每个数据点均按国家/地区,性别,年份和年龄分类.对于给定的国家/地区,性别和年份,我的年龄模式缺少数据,因此我想根据年龄值进行插值.例如,如果5岁的孩子的值为5,而10岁的孩子的值为10,则6.3岁的孩子的值为6.3.我无法使用默认的熊猫“线性”插值方法,因为我的年龄组不是线性间隔的.我的数据如下所示:

iso3s = ['USA', 'CAN']
age_start_in_years = [0, 0.01, 0.1, 1]
years = [1990, 1991]
sexes = [1,2]
multi_index = pd.MultiIndex.from_product([iso3s,sexes,years,age_start_in_years],
                                          names = ['iso3','sex','year','age_start'])

frame_length = len(iso3s)*len(age_start_in_years)*len(years)*len(sexes)
test_df = pd.DataFrame({'value':range(frame_length)},index=multi_index)
test_df=test_df.sortlevel()

# Insert missingness to practice interpolating
idx = pd.IndexSlice
test_df.loc[idx[:,:,:,[0.01,0.1]],:] = np.NaN
test_df

                                value
iso3    sex year    age_start   
CAN     1   1990    0.00        0
                    0.01        NaN
                    0.10        NaN
                    1.00        3
            1991    0.00        4
                    0.01        NaN
                    0.10        NaN
                    1.00        7
       2    1990    0.00        8
...

但是,当我尝试使用test_df.interpolate(method =’index’)时,出现此错误:

ValueError: Only `method=linear` interpolation is supported on MultiIndexes.

当然,必须有一些根据索引值进行插值的方法.

解决方法:

我发现这个骇人的解决方法摆脱了MultiIndex,并使用了groupby和transform的组合:

def multiindex_interp(x, interp_col, step_col):

    valid = ~pd.isnull(x[interp_col])
    invalid = ~valid

    x['last_valid_value'] = x[interp_col].ffill()
    x['next_valid_value'] = x[interp_col].bfill()

    # Generate a new Series filled with NaN's
    x['last_valid_step'] =  np.NaN
    # Copy the step value where we have a valid value
    x['last_valid_step'][valid] = x[step_col][valid]
    x['last_valid_step'] = x['last_valid_step'].ffill()

    x['next_valid_step'] =  np.NaN
    x['next_valid_step'][valid] = x[step_col][valid]
    x['next_valid_step'] = x['next_valid_step'].bfill()

    # Simple linear interpolation= distance from last step / (range between closest valid steps) *
    #                              difference between closest values + last value
    x[interp_col][invalid] = (x[step_col]-x['last_valid_step'])/(x['next_valid_step'] - x['last_valid_step']) \
                             * (x['next_valid_value']-x['last_valid_value']) \
                             + x['last_valid_value']
    return x

test_df = test_df.reset_index(drop=False)
grouped = test_df.groupby(['iso3','sex','year'])
interpolated = grouped.transform(multiindex_interp,'value','age_start')
test_df['value'] = interpolated['value']
test_df
    iso3    sex year    age_start   value
0   CAN     1   1990    0.00        16.00
1   CAN     1   1990    0.01        16.03
2   CAN     1   1990    0.10        16.30
3   CAN     1   1990    1.00        19.00
4   CAN     1   1991    0.00        20.00
5   CAN     1   1991    0.01        20.03
6   CAN     1   1991    0.10        20.30
7   CAN     1   1991    1.00        23.00
8   CAN     2   1990    0.00        24.00
9   CAN     2   1990    0.01        24.03
10  CAN     2   1990    0.10        24.30
11  CAN     2   1990    1.00        27.00
...

标签:pandas,python,interpolation
来源: https://codeday.me/bug/20191119/2038553.html