编程语言
首页 > 编程语言> > python – 计算pandas DataFrame中列对的减法

python – 计算pandas DataFrame中列对的减法

作者:互联网

我使用大尺寸(48K行,最多数十列)DataFrames.在他们操纵的某个时刻,我需要对列值进行成对减法,我想知道是否有更有效的方法来做到这一点而不是我正在做的那个(见下文).

我目前的代码:

 # Matrix is the pandas DataFrame containing all the data
 comparison_df = pandas.DataFrame(index=matrix.index)
 combinations = itertools.product(group1, group2)

 for observed, reference in combinations:

     observed_data = matrix[observed]
     reference_data = matrix[reference]

     comparison = observed_data - reference_data
     name = observed + "_" + reference
     comparison_df[name] = comparison

由于数据可能很大(我在排列测试期间也使用这段代码),我很想知道它是否可以进行一些优化.

编辑:根据要求,这是典型数据集的示例

ID                    A1      A2       A3       B1       B2       B3
Ku8QhfS0n_hIOABXuE    6.343   6.304    6.410    6.287    6.403    6.279
fqPEquJRRlSVSfL.8A    6.752   6.681    6.680    6.677    6.525    6.739
ckiehnugOno9d7vf1Q    6.297   6.248    6.524    6.382    6.316    6.453
x57Vw5B5Fbt5JUnQkI    6.268   6.451    6.379    6.371    6.458    6.333

并且典型的结果是,如果“A”组是group1而“B”group2,对于每个ID行,对于每个列具有对应于所生成的配对的一对(例如,A1_B1,A2_B1,A3_B1 ……)上面,包含每个行ID的减法.

解决方法:

在DataFrame列上使用itertools.combinations()

您可以使用itertools.combinations()创建列组合,并根据这些对评估减法和新名称:

import pandas as pd
from cStringIO import StringIO
import itertools as iter

matrix = pd.read_csv(StringIO('''ID,A1,A2,A3,B1,B2,B3
Ku8QhfS0n_hIOABXuE,6.343,6.304,6.410,6.287,6.403,6.279
fqPEquJRRlSVSfL.8A,6.752,6.681,6.680,6.677,6.525,6.739
ckiehnugOno9d7vf1Q,6.297,6.248,6.524,6.382,6.316,6.453
x57Vw5B5Fbt5JUnQkI,6.268,6.451,6.379,6.371,6.458,6.333''')).set_index('ID')

print 'Original DataFrame:'
print matrix
print

# Create DataFrame to fill with combinations
comparison_df = pd.DataFrame(index=matrix.index)
# Create combinations of columns
for a, b in iter.combinations(matrix.columns, 2):
    # Subtract column combinations
    comparison_df['{}_{}'.format(a, b)] = matrix[a] - matrix[b]

print 'Combination DataFrame:'
print comparison_df
Original DataFrame:
                       A1     A2     A3     B1     B2     B3
ID                                                          
Ku8QhfS0n_hIOABXuE  6.343  6.304  6.410  6.287  6.403  6.279
fqPEquJRRlSVSfL.8A  6.752  6.681  6.680  6.677  6.525  6.739
ckiehnugOno9d7vf1Q  6.297  6.248  6.524  6.382  6.316  6.453
x57Vw5B5Fbt5JUnQkI  6.268  6.451  6.379  6.371  6.458  6.333

Combination DataFrame:
                    A1_A2  A1_A3  A1_B1  A1_B2  A1_B3  A2_A3  A2_B1  A2_B2  A2_B3  A3_B1  A3_B2  A3_B3  B1_B2  B1_B3  B2_B3
ID                                                                                                                         
Ku8QhfS0n_hIOABXuE  0.039 -0.067  0.056 -0.060  0.064 -0.106  0.017 -0.099  0.025  0.123  0.007  0.131 -0.116  0.008  0.124
fqPEquJRRlSVSfL.8A  0.071  0.072  0.075  0.227  0.013  0.001  0.004  0.156 -0.058  0.003  0.155 -0.059  0.152 -0.062 -0.214
ckiehnugOno9d7vf1Q  0.049 -0.227 -0.085 -0.019 -0.156 -0.276 -0.134 -0.068 -0.205  0.142  0.208  0.071  0.066 -0.071 -0.137
x57Vw5B5Fbt5JUnQkI -0.183 -0.111 -0.103 -0.190 -0.065  0.072  0.080 -0.007  0.118  0.008 -0.079  0.046 -0.087  0.038  0.125

标签:python,pandas,data-analysis
来源: https://codeday.me/bug/20190704/1373369.html