[转载] python处理数据列_Python中基于跨列的数据处理
作者:互联网
python处理数据列
Let’s suppose your manager gives you a random dataset and told you to do a “basic” cleaning: “Keep only the records that have values in column A, B, and C or the records don’t have any value in those three columns at all“. How would you approach that?
假设您的经理为您提供了一个随机数据集,并告诉您进行“基本”清理:“仅保留在A,B和C列中具有值的记录,或者在以下三列中不具有任何值的记录:所有”。 您将如何处理?
介绍 (Introduction)
Feature engineering can be really basic such as scaling and encoding, but sometimes mysterious. Recently, I came across this type of cross-column-based requirement and this got me thinking, how could a cleaning logics sound so simple in words but be inherently complicated in the data cleaning and feature engineering?
特征工程可以是真正的基础,例如缩放和编码,但有时是神秘的。 最近,我遇到了这种基于跨列的需求,这让我开始思考, 清理逻辑听起来多么简单,但在数据清理和功能工程中却固有地复杂?
文字直截了当,代码不那么直观 (Straightforward in words, not so intuitive in codes)
Most of the data manipulation skills we learned from existing courses focus on row-wise applications. This type of function is well written in data analysis packages such as Pandas, meaning analysts or data scientists don’t have to create a new application from scratch. However, things are not that easy when it comes to column-wise applications, especially involving multiple columns. If we like to perform the simple request as I mentioned in the introduction, we have to mix something more advanced such as logical relationship into the process.
我们从现有课程中学到的大多数数据处理技能都集中在按行应用程序上。 这类功能可以很好地用数据分析包(例如Pandas)编写,这意味着分析师或数据科学家不必从头开始创建新的应用程序。 但是,对于列式应用程序, 尤其是涉及多列的应用程序,事情并非那么容易。 如果我们喜欢执行引言中提到的简单请求,则必须在流程中混合一些更高级的内容,例如逻辑关系。
In this post, I will demonstrate a basic use case and a more complicated situation about cross-column manipulation. I will also include the complete manipulation function, showcasing how to apply those functions in a regular ETL process. The techniques shared in the following article includes:
在本文中,我将演示一个基本的用例以及有关跨列操作的更复杂的情况。 我还将包括完整的操纵功能,展示如何在常规ETL过程中应用这些功能。 下一篇文章中共享的技术包括:
Computing cross-column aggregated metrics for numerical values 计算数值的跨列汇总指标 Validating records with cross-column logical operations 使用跨列逻辑操作验证记录
数据集简介 (Introducing the Dataset)
During the last spring, my friend and I have competed in Datathon 2020 hosted by Brown University. The topic is predicting where are the prospective homebuyers based on credit-bureau data and demographic data at the zip code level. The features include metrics such as bank card limit, number of open bank cards, and balance for the most recent mortgage.
在去年春天,我和我的朋友参加了由布朗大学主办的2020年Datathon竞赛。 该主题基于邮政编码级别的信用局数据和人口统计数据预测潜在的购房者在哪里。 这些功能包括度量标准,例如银行卡限额,已开立银行卡的数量以及最新抵押的余额。
The provided datasets cover April to September 2019 and the CSV files are split accordingly. There are over 36,000,000 rows and over 300 columns in sum, and the total size of those makes up about 10GB. I would use this to demonstrate the ideas.
提供的数据集涵盖2019年4月至2019年9月,并且CSV文件已相应拆分。 总行数超过36,000,000,列数超过300,这些文件的总大小约为10GB。 我将用它来演示想法。
A snippet of the data set
数据集的摘要
问题1:使用聚合方法减少跨数据集的聚合指标 (Problem 1: Reducing the cross-dataset aggregated metrics with aggregation method)
Simply put, the data is a mess. After combing data sets across different times, the total size of the data could freeze a regular laptop whenever a simple computation is applied. Thus, I wanted to come up with a way that generalized the same metric for different months (e.g., there is a “number of open bank cards” metric for each month. I like to reduce the total number of features but keep that information).
简而言之,数据是一团糟。 在跨不同时间组合数据集之后,每当应用简单计算时,数据的总大小可能会冻结常规笔记本电脑。 因此,我想提出一种在不同月份推广相同指标的方法(例如,每个月都有一个“未结银行卡数量”指标。我希望减少功能总数,但保留该信息) 。
Various machine learning techniques (e.g., PCA, clustering) could achieve similar effects. However, I would transform features with the most direct fashion, which is averaging.
各种机器学习技术(例如,PCA,聚类)可以达到类似的效果。 但是,我将以最平均的方式(平均)来转换要素。
解决方案1:谨慎使用Pandas内置计算 (Solution 1: Utilizing Pandas built-in calculation cautiously)
That’s just saying that we need to average the metrics over time. We can easily achieve that with pd.DataFrame.mean(). Pandas package also provides various aggregated function such as sum(), std().
这只是说我们需要对指标进行平均。 我们可以使用pd.DataFrame.mean()轻松实现。 Pandas软件包还提供了各种聚合函数,例如sum() , std() 。
The expected result
预期结果
This is a relatively easy task. However, the trick is setting the correct pipeline that organizes new columns and builds up a new data frame. Here are a few tips worth mentioning:
这是一个相对容易的任务。 但是,技巧是设置正确的管道来组织新的列并建立新的数据框。 这里有一些值得一提的提示:
pd.DataFrame()doesn’t allow you to individually specify data type to columns accordingly as pd.read_csv()does (It is weird, I know). To accommodate the limitations, it is recommended to assignastype()functions after building a new data frame. pd.DataFrame()不允许您像pd.read_csv()那样单独为列指定数据类型(这很奇怪,我知道)。 为了适应这些限制,建议在构建新的数据帧之后分配astype()函数。 Also, if you had different types of numerical data to be assigned at the same time, watch out for the possible loss of information while using lists of value to build apd.DataFrame() . Auto-conversion between integer and floating-point will mangle the value without any warning! (Not to mention that numeric data converted to string value will also lose a few digits in the process) 另外,如果要同时分配不同类型的数值数据,请在使用值列表构建pd.DataFrame()时注意信息的可能丢失。 整数和浮点之间的自动转换将破坏该值,而不会发出任何警告! (更不用说转换为字符串值的数字数据在此过程中也会丢失几位数)
The whole function is shown below. Note that “df” is the data to be processed, and “df_col” is just for building a list of column names.
整个功能如下所示。 请注意,“ df”是要处理的数据,“ df_col”仅用于构建列名列表。
#### aggregate column function ####
def agg_col_calculation(df):
### df is the data frame containing all data from different months
exclude_col_list = ['zip5', 'age', 'household_count', 'person_count',
'homebuyers', 'first_homebuyers']
month_list = [4,5,6,7,8,9]
# import the original column for building new columns template
df_col = pd.read_csv('/content/gdrive/My Drive/Brown_Datathon_Jeff/zip9_coded_201904_pv.csv',
nrows=2)
col_list = df_col.columns[2:]
avg_series_list = []
## compute the aggregated metrics
for c in col_list:
working_col_list = [c + f'_{m}' for m in month_list]
avg_series = df[working_col_list].mean(axis='columns')
avg_series_list.append(avg_series)
print('complete appending!')
# set the new names of columns
new_col_list = [c + '_avg' for c in col_list]
# preprocess the data before building data frame
exclude_col_array = np.array(df[exclude_col_list]).T
## make new df
df_avg = pd.DataFrame(np.array([*exclude_col_array, *np.array(avg_series_list)]).T,
index=df.index, columns=[*exclude_col_list, *new_col_list],
dtype='float')
## handle data type issue
df_avg = df_avg.astype({'zip5': int})
df_avg = df_avg.astype({'zip5': str})
print(df_avg.shape)
return df_avg
df = agg_col_calculation(df)
问题2:清理具有跨列逻辑关系的数据 (Problem 2: Cleaning data with cross-column logical relation)
Until this point, it was no too much trouble. But here comes the tough one. Although we narrowed down the data, something was off.
到目前为止,还不算太麻烦。 但是这是艰难的一步。 尽管我们缩小了数据范围,但是还是有些不对劲。
In common sense, If one record (an area) has one mortgage, then there should be a value in each of the columns (average loan amount, average balance, the proportion) describing the first mortgage. If one record has no mortgage at all, then there should be no value in these three columns at all (since no mortgage in that area). However, I found out the discrepancy just by eyeballing a few rows. Some rows have the average loan amount but no average balance, and some rows only have the proportion of the second loan without the other two counterparts.
通常,如果一个记录(一个区域)有一个抵押,那么在描述第一抵押的每个列中应该有一个值(平均贷款额,平均余额,比例)。 如果一条记录根本没有抵押,那么这三列中就根本没有价值(因为该区域没有抵押)。 但是,我只是盯着几行就发现了差异。 有些行具有平均贷款额,但没有平均余额,有些行仅具有第二笔贷款的比例,而没有其他两个对应物。
Description of three coexisting columns
三个共存列的描述
This type of error can seriously damage the model if it couldn’t identify the correct patterns with high-dimensional data. To make the model or the analysis more robust, the questionable records should be excluded from the data set.
如果无法使用高维数据识别正确的模式,则此类错误会严重损坏模型。 为了使模型或分析更加可靠,应将可疑记录从数据集中排除。
用维恩图思考 (Thinking with the Venn diagram)
To understand the problem, the Venn Diagram could be helpful for understanding. For what I tried to do here, and most of the general similar ad-hoc situation, I am looking at subsetting completely-non-intersecting part and the totally-intersecting part, which is 000 (All - (A∪B∪C)) and 111 (A⋂B⋂C) respectively.
要理解该问题,维恩图可能有助于理解。 对于我在这里尝试做的事情以及大多数一般类似的特殊情况,我正在考虑将完全不相交的部分和完全相交的部分细分为000(全部-(A-B∪C) )和111(A⋂B⋂C)。
Source:
https://en.wikipedia.org/wiki/Venn_diagram
资料来源:
https :
//en.wikipedia.org/wiki/Venn_diagram
With the idea, because Pandas supports logical operators (&, |, ~) I could configure the logic like this:
有了这个主意,因为Pandas支持逻辑运算符(&,|,〜),所以我可以这样配置逻辑:
## 111((df[col_1].astype(bool) & df[col_2].astype(bool) & df[col_3].astype(bool))## 000(~df[col_1].astype(bool) & ~df[col_2].astype(bool) & ~df[col_3].astype(bool)))
p.s. In case you are wondering, no, this is not the case that you need to convert AND into OR when you do the negative logical condition (since the elements here are all binary, True or False). See this thread for another example.
ps如果您想知道,不,这不是在执行负逻辑条件时不需要将AND转换为OR的情况(因为此处的元素都是二进制,True或False)。 有关另一个示例,请参见此线程 。
Then I integrated those logical operators into one line and make a series:
然后,我将这些逻辑运算符整合为一行,并进行一系列处理:
## 111 or 000s_bool = ((df[col_1].astype(bool) & df[col_2].astype(bool) & df[col_3].astype(bool))|(~df[col_1].astype(bool) & ~df[col_2].astype(bool) & ~df[col_3].astype(bool)))
After concatenating all the series of boolean into one data frame, I used all() with apply() and lambda to create a short function while applying it along the columns axis. To be more specific, all(x) will do another logical operation to make sure all the different sections (groups of columns; e.g., a group of columns for 3rd loan in June, a group of columns for 5th mortgage in August) in one row obeys the rule I set in the first place (which is 111 or 000 relationship in the Venn Diagram). This came down to a final series of booleans:
将所有布尔值系列连接到一个数据帧中后,我在将all()与apply()和lambda结合使用时,在沿列轴应用它的同时创建了一个短函数。 更具体地说, all(x)将执行另一逻辑运算,以确保所有不同的部分(列组;例如,6月份第三次贷款的一组列,8月第五次抵押的一组列) row遵循我首先设置的规则(维恩图中的关系为111或000)。 这归结为最后的布尔值系列:
s_agg_bool = df_bool.apply(lambda x: all(x), axis = 'columns')# We can then use this series of booleans to subset the right records out of the raw data.
Here is a final ETL function that I put the dataset through the cleaning process. Note that it was designed to apply to multiple groups of columns such as metrics from different months or from different categories:
这是最终的ETL函数,我将数据集放入清理过程中。 请注意,它旨在应用于多个列组,例如来自不同月份或不同类别的指标:
def cross_col_correct(df):
mtg_list = ['mortgage1_limit', 'mortgage1_balance', 'mortgage1_open']
heq_list = ['homeequity1_limit', 'homeequity1_balance', 'homeequity1_open']
s_bool_list = []
# convert all null values into 0
df.iloc[:,2:] = df.iloc[:,2:].applymap(lambda x: 0 if np.isnan(x) else x)
# montly individual columns check
for col_list in [mtg_list, heq_list]:
# make different variables names
for i in [0,1,2,3,4]:
if i == 0:
col_list = col_list
else:
col_list = [col.replace(str(i), str(i+1)) for col in col_list]
# logical operators
s_bool = ((df[col_list[0]].astype(bool) & df[col_list[1]].astype(bool) & df[col_list[2]].astype(bool))
| (~df[col_list[0]].astype(bool) & ~df[col_list[1]].astype(bool) & ~df[col_list[2]].astype(bool)))
s_bool_list.append(s_bool)
# aggregated columns check
total_mtg_check_list = ['total_mortgage_balance', 'total_mortgage_limit', 'mortgage1_balance']
total_heq_check_list = ['total_homeequity_balance', 'total_homeequity_limit', 'homeequity1_balance']
for col_list in [total_mtg_check_list, total_heq_check_list]:
s_bool2 = ((df[col_list[0]].astype(bool) & df[col_list[1]].astype(bool) & df[col_list[2]].astype(bool)) |
(~df[col_list[0]].astype(bool) & ~df[col_list[1]].astype(bool) & ~df[col_list[2]].astype(bool)))
s_bool_list.append(s_bool2)
# make boolean df containing t/f for each category for each number (e.g. mortgage1, homeequity2...)
df_bool = pd.concat(s_bool_list, axis=1, sort=False)
# make final boolean list (if a row fail to have consistent T in df_bool, then it will be dropped)
s_agg_bool = df_bool.apply(lambda x: all(x), axis = 'columns')
df = df.loc[s_agg_bool]
print('Complete cross columns correct!')
return df
df = cross_col_correct(df)
p.s. Before doing the logical works, we have to convert all NaN value into 0 and all numerical values into boolean data type, so that we can utilize the mechanic of the Pandas logic (all zero value will be False and any non-zero will be True).
ps在进行逻辑运算之前,我们必须将所有NaN值转换为0,并将所有数值转换为布尔数据类型,以便我们可以利用熊猫逻辑的机制(所有零值将为False,任何非零值将为False)。真正)。
结论 (Conclusion)
I quickly ran through a few tips and examples of crossed-column-based data manipulation. However, I know this is hardly a thorough guide for similar use cases.
我快速浏览了一些技巧,并举例说明了基于交叉列的数据操作。 但是,我知道对于类似的用例,这几乎不是一个完整的指南。
Data manipulation is surprisingly difficult because we often face dynamic problems that a single function to impute missing values or identify biased data is far from enough in real-world scenarios. After trying to find related resources, I realized there is no systematic guide or discussion thoroughly covering the practical stuff since the essence of it is rather hard to be classified and non-traditional. Therefore, this serves as an attempt to share possible solutions, and I hope this post could spark new ideas and discover more non-ordinary methods.
数据操作异常困难,因为我们经常面临动态问题,即在实际情况下,单个函数来估算缺失值或识别有偏差的数据还远远不够。 在尝试查找相关资源之后,我意识到没有系统的指导或讨论可以完全覆盖实际内容,因为其实质很难分类且非传统。 因此,这是尝试共享可能的解决方案的一种尝试,我希望本文能激发出新的想法并发现更多非常规方法。
Photo by
Cristian Escobar on
Unsplash
克里斯蒂安·埃斯科巴 (
Cristian Escobar)在
Unsplash上
拍摄的照片
Congrats and thanks for your reading! Feel free to reply with any thoughts and opinions, or just drop a message on my Linkedin page!
恭喜,感谢您的阅读! 如有任何想法和意见,请随时回复,或在我的Linkedin页面上留言!
p.s. If you are interested in some COVID-19 state-wise info, be sure to visit this neat personal dashboard to track the curves.
ps如果您对某些COVID-19状态信息感兴趣,请确保访问此整洁的个人仪表板以跟踪曲线。
Jeff
杰夫
翻译自: https://towardsdatascience.com/cross-column-based-data-manipulation-in-python-dfa5d8ffdd64
python处理数据列
标签:Python,list,跨列,astype,python,bool,df,data,col 来源: https://blog.csdn.net/u013946150/article/details/112697399