python-在Pandas DataFrame中构建复杂的子集
作者:互联网
我正在使用GroupBy,但仍然需要一些帮助.假设我有一个带有列Group的DataFrame,为对象提供了组编号,一些参数R和球坐标RA和Dec.这是一个模拟DataFrame:
df = pd.DataFrame({
'R' : (-21.0,-21.5,-22.1,-23.7,-23.8,-20.4,-21.8,-19.3,-22.5,-24.7,-19.9),
'RA': (154.362789,154.409301,154.419191,154.474165,154.424842,162.568516,8.355454,8.346812,8.728223,8.759622,8.799796),
'Dec': (-0.495605,-0.453085,-0.481657,-0.614827,-0.584243,8.214719,8.355454,8.346812,8.728223,8.759622,8.799796),
'Group': (1,1,1,1,1,2,2,2,2,2,2)
})
我想建立一个包含每个组的“最亮”对象的选择,即具有最小R(或最大绝对值,因为Ris为负值)和该组中3个最接近的对象(所以我在每个对象中保留4个对象)组-我们可以假设不存在小于4个对象的组).
我们在这里假设我们定义了以下功能:
#deg to rad
def d2r(x):
return x * np.pi / 180.0
#rad to deg
def r2d(x):
return x * 180.0 / np.pi
#Computes separation on a sphere
def calc_sep(phi1,theta1,phi2,theta2):
return np.arccos(np.sin(theta1)*np.sin(theta2) +
np.cos(theta1)*np.cos(theta2)*np.cos(phi2 - phi1) )
并且两个对象之间的间隔由r2d(calc_sep(RA1,Dec1,RA2,Dec2))给出,其中RA1作为第一个对象的RA,依此类推.
我不知道如何使用GroupBy实现这一目标…
解决方法:
您在这里可以做的是构建一个更具体的帮助器函数,该函数将应用于每个“子框架”(每个组).
实际上,GroupBy只是一种设施,它创建了类似(组ID,DataFrame)对的迭代器,并且当您调用.groupby().apply时,将对每个函数应用一个函数. (有很多详细信息,如果您有兴趣,请参阅here,以获取有关内部的一些详细信息.)
因此,在定义了三个基于NumPy的函数之后,还需要定义:
def sep_df(df, keep=3):
min_r = df.loc[df.R.argmin()]
RA1, Dec1 = min_r.RA, min_r.Dec
sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
idx = sep.nsmallest(keep+1).index
return df.loc[idx]
然后只需应用,您将获得一个MultiIndex DataFrame,其中第一个索引级别是该组.
print(df.groupby('Group').apply(sep_df))
Dec Group R RA
Group
1 3 -0.61483 1 -23.7 154.47416
2 -0.48166 1 -22.1 154.41919
0 -0.49561 1 -21.0 154.36279
4 -0.58424 1 -23.8 154.42484
2 8 8.72822 2 -22.5 8.72822
10 8.79980 2 -19.9 8.79980
6 8.35545 2 -21.8 8.35545
9 8.75962 2 -24.7 8.75962
穿插着一些评论:
def sep_df(df, keep=3):
# Applied to each sub-Dataframe (this is what GroupBy does under the hood)
# Get RA and Dec values at minimum R
min_r = df.loc[df.R.argmin()] # Series - row at which R is minimum
RA1, Dec1 = min_r.RA, min_r.Dec # Relevant 2 scalars within this row
# Calculate separation for each pair including minimum R row
# The result is a series of separations, same length as `df`
sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
# Get index values of `keep` (default 3) smallest results
# Retain `keep+1` values because one will be the minimum R
# row where separation=0
idx = sep.nsmallest(keep+1).index
# Restrict the result to those 3 index labels + your minimum R
return df.loc[idx]
对于速度,如果结果仍然适合您,则将consider passing sort=False
更改为GroupBy.
标签:pandas-groupby,pandas,dataframe,split-apply-combine,python 来源: https://codeday.me/bug/20191025/1930495.html