编程语言
首页 > 编程语言> > python – Pandas-拆分多索引数据帧

python – Pandas-拆分多索引数据帧

作者:互联网

我有一个多索引的pandas数据框,我想用“||”分隔值字符并包含一个索引层,其中包含三个新列’Connection’,’Val1”Val2′.

如果有人可以给我一些提示来做这件事会有帮助.

当前示例数据框:

                              Experiment1                        Experiment2      
    Target      Analyze1_ab               Analyze2_zz            Analyze1_yy        
    XXX_1   Edge2||3.1E-07||-0.5    Edge2||2.1E-06||-0.9     Edge2||6.4E-02||-0.3   
    XXX_4   Edge1||6.4E-12||1.1     Edge1||2.4E-11||9.4      Edge1||1.4E-11||1.4    
    ABC_1   Edge1||3.9E-07||0.7     Edge1||2.9E-07||5.6      Edge1||6.8E-02||0.4    
    ABC_2   Edge2||1.1E-09||-0.5    Edge2||1.2E-09||1.2      Edge2||1.0E-03||-0.5   
    ABC_3   Edge2||4.6E-25||-0.8    Edge2||2.6E-10||1.9      Edge2||5.0E-17||-0.9   
    XXX_2   Edge2||1.7E-07||-0.5    Edge2||5.7E-08||-0.3     Edge2||4.1E-02||-0.3   
    ABC_4   Edge1||8.1E-02||0.5     Edge1||9.1E-02||1.5      Edge1||5.4E-02||0.6    
    ABC_5   Edge1||6.7E-02||0.3     Edge1||4.2E-02||1.9      Edge1||5.6E-03||0.4    
    XXX_3   Edge2||3.1E-03||-0.4    Edge1||2.4E-11||1.1      Edge2||2.4E-02||-0.3   

期望的数据框架:

                                   Experiment1                            Experiment2       
    Target          Analyze1_ab                  Analyze2_zz                Analyze1_yy     
           Connection   Val1    Val2    Connection  Val1    Val2    Connection  Val1    Val2
    XXX_1   Edge2   3.10E-07    -0.5    Edge2   2.10E-06    -0.9    Edge2   6.40E-02    -0.3
    XXX_4   Edge1   6.40E-12    1.1     Edge1   2.40E-11    9.4     Edge1   1.40E-11    1.4
    ABC_1   Edge1   3.90E-07    0.7     Edge1   2.90E-07    5.6     Edge1   6.80E-02    0.4
    ABC_2   Edge2   1.10E-09    -0.5    Edge2   1.20E-09    1.2     Edge2   1.00E-03    -0.5
    ABC_3   Edge2   4.60E-25    -0.8    Edge2   2.60E-10    1.9     Edge2   5.00E-17    -0.9
    XXX_2   Edge2   1.70E-07    -0.5    Edge2   5.70E-08    -0.3    Edge2   4.10E-02    -0.3
    ABC_4   Edge1   8.10E-02    0.5     Edge1   9.10E-02    1.5     Edge1   5.40E-02    0.6
    ABC_5   Edge1   6.70E-02    0.3     Edge1   4.20E-02    1.9     Edge1   5.60E-03    0.4
    XXX_3   Edge2   3.10E-03    -0.4    Edge1   2.40E-11    1.1     Edge2   2.40E-02    -0.3

解决方法:

import pandas as pd

# Initialize DataFrame
# -----------------------------------------------------------------------------

df = pd.DataFrame({
    'Analyze1_ab': ['Edge2||3.1E-07||-0.5', 'Edge1||6.4E-12||1.1'],
    'Analyze2_zz': ['Edge2||2.1E-06||-0.9', 'Edge1||2.4E-11||9.4'],
    'Analyze1_yy': ['Edge2||6.4E-02||-0.3', 'Edge1||1.4E-11||1.4'],
    'Target': ['XXX_1', 'XXX_4'],})

df.columns = pd.MultiIndex.from_tuples(
    [('Experiment1', 'Analyze1_ab'),
     ('Experiment2', 'Analyze1_yy'),
     ('Experiment1', 'Analyze2_zz'),
     ('Target', '')])

# Split 'Analyses' columns by double pipes ||
# -----------------------------------------------------------------------------

# Initialize final DataFrame
final_df = pd.DataFrame()

for col_name in df.columns:
    if (col_name[1].startswith('Analyze') and
            df[col_name].str.contains('||').all()):
        # Split 'Analysis' by || into new columns
        splitted_analysis = df[col_name].str.split('\|\|', expand=True)
        # The new column names are 0, 1, 2. Let's rename them.
        splitted_analysis.columns = ['Connection', 'Val1', 'Val2']
        # Recreate MultiIndex
        splitted_analysis.columns = pd.MultiIndex.from_tuples(
            [(col_name[0], col_name[1], c) for c in splitted_analysis.columns])
        # Concatenate the new columns to the final_df
        final_df = pd.concat(objs=[final_df, splitted_analysis], axis=1)

# Add 'Target' column in the final_df.
# First, extract it.
target_col = pd.DataFrame(df[('Target', '')])
# Then, increase MultiIndex level of 'Target' from 2 to 3,
# to allow smooth concatenation with the final_df.
target_col.columns = pd.MultiIndex.from_tuples([('Target', '', '')])
final_df = pd.concat([final_df, target_col], axis=1)

验证:print(final_df):

  Experiment1                Experiment2                Experiment1                Target
  Analyze1_ab                Analyze1_yy                Analyze2_zz                      
   Connection     Val1  Val2  Connection     Val1  Val2  Connection     Val1  Val2       
0       Edge2  3.1E-07  -0.5       Edge2  6.4E-02  -0.3       Edge2  2.1E-06  -0.9  XXX_1
1       Edge1  6.4E-12   1.1       Edge1  1.4E-11   1.4       Edge1  2.4E-11   9.4  XXX_4

验证:pprint.pprint([c for final_df.columns]):

[('Experiment1', 'Analyze1_ab', 'Connection'),
 ('Experiment1', 'Analyze1_ab', 'Val1'),
 ('Experiment1', 'Analyze1_ab', 'Val2'),
 ('Experiment2', 'Analyze1_yy', 'Connection'),
 ('Experiment2', 'Analyze1_yy', 'Val1'),
 ('Experiment2', 'Analyze1_yy', 'Val2'),
 ('Experiment1', 'Analyze2_zz', 'Connection'),
 ('Experiment1', 'Analyze2_zz', 'Val1'),
 ('Experiment1', 'Analyze2_zz', 'Val2'),
 ('Target', '', '')]

标签:python,pandas,multiple-columns
来源: https://codeday.me/bug/20190611/1217502.html