【pandas官方文档-新手教程】:2.十个入门问题
作者:互联网
1. Pandas处理什么样的数据?
核心知识点:
导入包,又名Import pandas as pd
一个数据表被存储为一个DataFrame
DataFrame中的每一列都是一个Series
你可以通过对DataFrame或Serie应用一个方法来做事情
pandas适合处理表格状的数据。在pandas中,表格被称为DataFrame。
1.1. 引入pandas
模块
- pd是pandas在行业内的标准缩写。
# 第一个案例
import pandas as pd
1.2. 用DataFrame
来处理表格
- 当使用字典来引入数据时,key就是表格的列名,value就是每列的值。
- 每个不同的列所对应的值,可以是不同的类型。
df = pd.DataFrame(
{
"Name": [
"Braund, Mr. Owen Harris",
"Allen, Mr. William Henry",
"Bonnell, Miss. Elizabeth",
],
"Age": [22, 35, 58],
"Sex": ["male", "male", "female"],
}
)
df
1.3. DataFrame的每列是Siries
df['Age']
0 22
1 35
2 58
Name: Age, dtype: int64
- 当选中DataFrame的一列时,返回的时Series
- 使用DataFrame[‘列名’]来选择某列
Series也可以直接创建:
ages = pd.Series([22, 35, 58], name="Age")
ages
0 22
1 35
2 58
Name: Age, dtype: int64
Series没有列标签,但是有行标签。
1.4. DataFrame和Series可以调用一些方法
举几个例子:
df["Age"].max()
58
ages.max()
58
df.describe()
Age | |
---|---|
count | 3.000000 |
mean | 38.333333 |
std | 18.230012 |
min | 22.000000 |
25% | 28.500000 |
50% | 35.000000 |
75% | 46.500000 |
max | 58.000000 |
2.怎样读写各种表格数据?
-
pandas支持很多种数据形式:csv, excel, sql, json, parquet ...
-
读取:
read_*
-
写入:
to_*
import pandas as pd
titanic_data_path = "../data/titanic.csv"
titanic_df = pd.read_csv(titanic_data_path)
# df.head()、tail() 用法
# titanic_df.head(8)
titanic_df.tail(8)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
883 | 884 | 0 | 2 | Banfield, Mr. Frederick James | male | 28.0 | 0 | 0 | C.A./SOTON 34068 | 10.500 | NaN | S |
884 | 885 | 0 | 3 | Sutehall, Mr. Henry Jr | male | 25.0 | 0 | 0 | SOTON/OQ 392076 | 7.050 | NaN | S |
885 | 886 | 0 | 3 | Rice, Mrs. William (Margaret Norton) | female | 39.0 | 0 | 5 | 382652 | 29.125 | NaN | Q |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.450 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.750 | NaN | Q |
df有很多种方法,还有很多属性可以直接调用。
titanic_df.dtypes
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
titanic_df.to_excel('ddd.xlsx', sheet_name = 'sheet1', index=False)
titanic_df.info
<bound method DataFrame.info of PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
.. ... ... ...
886 887 0 2
887 888 1 1
888 889 0 3
889 890 1 1
890 891 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
.. ... ... ... ...
886 Montvila, Rev. Juozas male 27.0 0
887 Graham, Miss. Margaret Edith female 19.0 0
888 Johnston, Miss. Catherine Helen "Carrie" female NaN 1
889 Behr, Mr. Karl Howell male 26.0 0
890 Dooley, Mr. Patrick male 32.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
.. ... ... ... ... ...
886 0 211536 13.0000 NaN S
887 0 112053 30.0000 B42 S
888 2 W./C. 6607 23.4500 NaN S
889 0 111369 30.0000 C148 C
890 0 370376 7.7500 NaN Q
[891 rows x 12 columns]>
titanic_df.__class__
pandas.core.frame.DataFrame
3.怎样选取表格的子集?
pandas可以:
-
选择或者筛选特定的行/列
-
按条件筛选数据
-
切片、选择、提取...
核心知识点:
用 方括号
[]
选取子集在括号里面, 可以是:
单个行/列的标签
行/列的标签的列表
切片
条件
用
loc
时,用行列的名字(label name)用
iloc
时,用行列的位置(索引值,不含后)用
loc
/iloc
的时候可以重新个DataFrame数据进行赋值。
3.1 用方括号[]
import pandas as pd
titanic = pd.read_csv('data/titanic.csv')
titanic.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
# 单列
ages = titanic['Age']
ages.head(),ages.shape, ages.dtype,type(ages)
(0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64,
(891,),
dtype('float64'),
pandas.core.series.Series)
DataFrame中的每一列都是Series。
# 多个不同列
age_sex = titanic[['Age','Sex']] # 方括号中间是list
age_sex.head(), age_sex.shape, age_sex.columns, type(age_sex)
( Age Sex
0 22.0 male
1 38.0 female
2 26.0 female
3 35.0 female
4 35.0 male,
(891, 2),
Index(['Age', 'Sex'], dtype='object'),
pandas.core.frame.DataFrame)
多列就是DataFrame。
# 筛选行
above_35 = titanic[titanic['Age'] > 35 ]
above_35.head(), above_35.shape
( PassengerId Survived Pclass \
1 2 1 1
6 7 0 1
11 12 1 1
13 14 0 3
15 16 1 2
Name Sex Age SibSp \
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
6 McCarthy, Mr. Timothy J male 54.0 0
11 Bonnell, Miss. Elizabeth female 58.0 0
13 Andersson, Mr. Anders Johan male 39.0 1
15 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0
Parch Ticket Fare Cabin Embarked
1 0 PC 17599 71.2833 C85 C
6 0 17463 51.8625 E46 S
11 0 113783 26.5500 C103 S
13 5 347082 31.2750 NaN S
15 0 248706 16.0000 NaN S ,
(217, 12))
type(titanic['Age'] > 35) # 括号里面的居然是个Series,index的序列
pandas.core.series.Series
condition = titanic['Age'] > 35
condition
0 False
1 True
2 False
3 False
4 False
...
886 False
887 False
888 False
889 False
890 False
Name: Age, Length: 891, dtype: bool
# 区间值
titanic[titanic['Pclass'].isin([2,3])].head() # 属于2、3的才是范围内
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
titanic[(titanic['Pclass'] == 2) | (titanic['Pclass'] == 3)].head()
4.怎样绘制散点图?
pandas用Matplotlib来创造性的绘制各种图形:
-
散点图
-
条状图
-
盒装图
-
等等
核心知识点:
- plot.系列方法对Series和DataFrame都有用
- 每一列默认用不同的图形表示
- Pandas的plot都是Matplotlib对象
数据
import pandas as pd
import matplotlib.pyplot as plt
data_source = "./data/air_quality_no2.csv"
air_quality = pd.read_csv(data_source, index_col=0, parse_dates=True)
air_quality.head()
- index_col参数定义第一列为index;
- parse_dates参数定义是否把日期转换为时间戳
4.1. 默认参数全表绘图
# 绘图
# 所有序列
air_quality.plot()
<AxesSubplot:xlabel='datetime'>
4.2. 指定列绘图
# 指定序列
air_quality['station_paris'].plot()
<AxesSubplot:xlabel='datetime'>
4.3. 散点图
# 散点图
air_quality.plot.scatter(x='station_london', y= 'station_paris',alpha=0.5)
<AxesSubplot:xlabel='station_london', ylabel='station_paris'>
4.4. 所有图形
for i in dir(air_quality.plot):
if not i.startswith("_"):
print(i)
#下面这个写法更舒服
# ----------------------------------------------------------
[
method_name
for method_name in dir(air_quality.plot)
if not method_name.startswith("_")
]
area
bar
barh
box
density
hexbin
hist
kde
line
pie
scatter
['area',
'bar',
'barh',
'box',
'density',
'hexbin',
'hist',
'kde',
'line',
'pie',
'scatter']
# 盒状图
air_quality.plot.box()
<AxesSubplot:>
4.5. 子图
# 分开的子图
axs = air_quality.plot.area(figsize=(12,4), subplots=True)
type(axs)
numpy.ndarray
# 1. 生成一个空的matplotlib的Figure和Axes对象
fig, axs = plt.subplots(figsize= (12,4))
# 2. Use pandas to put the area plot on the prepared Figure/Axes
air_quality.plot.area(ax=axs)
# 3. Do any matplotlib customization you like
axs.set_ylabel("NO$_2$ concentration")
# 4. Save the Figure/Axes using the existing matplotlib method.
fig.savefig('4.1-no2_concentrations.png')
5.怎样从现有列衍生新的列?
-
无需使用循环计算
-
直接以列为单位进行计算
比如在原表基础上新增一列数据
核心知识点:
- 新增列的方法就是
DataFrame['新列名称']
- 对新增列不用对行进行循环
- 可以通过字典或函数的方法对行列进行重命名
数据
import pandas as pd
data_source = "data/air_quality_no2.csv"
air_quality = pd.read_csv(data_source, index_col=0, parse_dates=True)
air_quality.head()
station_antwerp | station_paris | station_london | |
---|---|---|---|
datetime | |||
--- | --- | --- | --- |
2019-05-07 02:00:00 | NaN | NaN | 23.0 |
2019-05-07 03:00:00 | 50.5 | 25.0 | 19.0 |
2019-05-07 04:00:00 | 45.0 | 27.7 | 19.0 |
2019-05-07 05:00:00 | NaN | 50.4 | 16.0 |
2019-05-07 06:00:00 | NaN | 61.9 | NaN |
5.1 新增列
- 不用对每一行去循环
air_quality['london_mg_per_cubic'] = air_quality['station_london'] * 1.882
air_quality.head()
station_antwerp | station_paris | station_london | london_mg_per_cubic | |
---|---|---|---|---|
datetime | ||||
--- | --- | --- | --- | --- |
2019-05-07 02:00:00 | NaN | NaN | 23.0 | 43.286 |
2019-05-07 03:00:00 | 50.5 | 25.0 | 19.0 | 35.758 |
2019-05-07 04:00:00 | 45.0 | 27.7 | 19.0 | 35.758 |
2019-05-07 05:00:00 | NaN | 50.4 | 16.0 | 30.112 |
2019-05-07 06:00:00 | NaN | 61.9 | NaN | NaN |
air_quality["ratio_paris_antwerp"] = (
air_quality["station_paris"] / air_quality["station_antwerp"]) #非法计算直接空值。
air_quality.head()
station_antwerp | station_paris | station_london | london_mg_per_cubic | ratio_paris_antwerp | |
---|---|---|---|---|---|
datetime | |||||
--- | --- | --- | --- | --- | --- |
2019-05-07 02:00:00 | NaN | NaN | 23.0 | 43.286 | NaN |
2019-05-07 03:00:00 | 50.5 | 25.0 | 19.0 | 35.758 | 0.495050 |
2019-05-07 04:00:00 | 45.0 | 27.7 | 19.0 | 35.758 | 0.615556 |
2019-05-07 05:00:00 | NaN | 50.4 | 16.0 | 30.112 | NaN |
2019-05-07 06:00:00 | NaN | 61.9 | NaN | NaN | NaN |
数学计算和罗技计算都是跟随元素(element wise)。
复杂点的情况还可以使用apply()函数。
5.2 重命名
- 字典方法
- 函数方法
# 字典
air_quality_renamed = air_quality.rename(
columns={
"station_antwerp": "BETR801",
"station_paris": "FR04014",
"station_london": "London Westminster",
}
)
air_quality_renamed.head()
BETR801 | FR04014 | London Westminster | london_mg_per_cubic | ratio_paris_antwerp | |
---|---|---|---|---|---|
datetime | |||||
--- | --- | --- | --- | --- | --- |
2019-05-07 02:00:00 | NaN | NaN | 23.0 | 43.286 | NaN |
2019-05-07 03:00:00 | 50.5 | 25.0 | 19.0 | 35.758 | 0.495050 |
2019-05-07 04:00:00 | 45.0 | 27.7 | 19.0 | 35.758 | 0.615556 |
2019-05-07 05:00:00 | NaN | 50.4 | 16.0 | 30.112 | NaN |
2019-05-07 06:00:00 | NaN | 61.9 | NaN | NaN | NaN |
air_quality_renamed = air_quality_renamed.rename(columns=str.lower)
air_quality_renamed.head()
betr801 | fr04014 | london westminster | london_mg_per_cubic | ratio_paris_antwerp | |
---|---|---|---|---|---|
datetime | |||||
--- | --- | --- | --- | --- | --- |
2019-05-07 02:00:00 | NaN | NaN | 23.0 | 43.286 | NaN |
2019-05-07 03:00:00 | 50.5 | 25.0 | 19.0 | 35.758 | 0.495050 |
2019-05-07 04:00:00 | 45.0 | 27.7 | 19.0 | 35.758 | 0.615556 |
2019-05-07 05:00:00 | NaN | 50.4 | 16.0 | 30.112 | NaN |
2019-05-07 06:00:00 | NaN | 61.9 | NaN | NaN | NaN |
6.怎样汇总统计?
-
基础统计很轻易的搞定,适用于:
-
整个数据集
-
双向队列
-
分类汇总
-
-
还可以“split-apply-combine“(化整为零)
核心知识点:
可以对整行整列进行汇总计算
groupby可以实现split-apply-combine模式
value_counts是计算每个类型变量的快捷方式
6.1 聚合统计
import pandas as pd
data_source = "data/titanic.csv"
titanic = pd.read_csv(data_source)
titanic.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
titanic['Age'].mean()
29.69911764705882
titanic[['Age' ,'Fare']].median()
Age 28.0000
Fare 14.4542
dtype: float64
titanic[['Age' ,'Fare']].describe()
Age | Fare | |
---|---|---|
count | 714.000000 | 891.000000 |
mean | 29.699118 | 32.204208 |
std | 14.526497 | 49.693429 |
min | 0.420000 | 0.000000 |
25% | 20.125000 | 7.910400 |
50% | 28.000000 | 14.454200 |
75% | 38.000000 | 31.000000 |
max | 80.000000 | 512.329200 |
# 不想使用默认的方式,可以使用agg().
titanic.agg({
"Age": ["min", "max","median","skew"],
"Fare": ["min", "max", "median", "mean"],
})
Age | Fare | |
---|---|---|
min | 0.420000 | 0.000000 |
max | 80.000000 | 512.329200 |
median | 28.000000 | 14.454200 |
skew | 0.389108 | NaN |
mean | NaN | 32.204208 |
6.2. 按类别分组统计
titanic[['Age' ,'Sex']].groupby('Sex').mean()
Age | |
---|---|
Sex | |
--- | --- |
female | 27.915709 |
male | 30.726645 |
计算一列数据里面不同类别的统计值很常见。groupby方法可被用来做这样的操作。
这就是所谓:
- split-apply-combine模式
- Split 数据分组
- Apply 每组独立进行计算
- Combine 合并成一种数据结构
Pandas里,Apply和combine一起执行了。
如果不对上面例子进行指定列,将会对每个数字值的列进行统计。
titanic.groupby("Sex").mean()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
Sex | |||||||
--- | --- | --- | --- | --- | --- | --- | --- |
female | 431.028662 | 0.742038 | 2.159236 | 27.915709 | 0.694268 | 0.649682 | 44.479818 |
male | 454.147314 | 0.188908 | 2.389948 | 30.726645 | 0.429809 | 0.235702 | 25.523893 |
分组数据也支持使用方括号来进行选取。
titanic.groupby("Sex")["Age"].mean()
Sex
female 27.915709
male 30.726645
Name: Age, dtype: float64
[pandas-note\Docs\06_groupby_select_detail.svg]
对于一些以数据值来区别类别的列,比如例子中的Pclass, 1、2、3等舱计算个平均值没有任何意义。针对这种情况,pandas提供了Categorical数据类型来处理。
titanic.groupby(["Sex","Pclass"])["Fare"].mean()
Sex Pclass
female 1 106.125798
2 21.970121
3 16.118810
male 1 67.226127
2 19.741782
3 12.661633
Name: Fare, dtype: float64
可以一次性处理不同属性复合列的统计计算。
6.3.分类计数
同一列不同数字代表不同类型,需要计数可以用 value_counts()
titanic["Pclass"].value_counts() # 3类舱位统计个数
3 491
1 216
2 184
Name: Pclass, dtype: int64
titanic.groupby('Pclass')["Pclass"].count() #对Pclass进行分组,然后对Pclass进行count().
Pclass
1 216
2 184
3 491
Name: Pclass, dtype: int64
titanic.groupby('Pclass')["Pclass"].size() # 和上面有什么区别?
Pclass
1 216
2 184
3 491
Name: Pclass, dtype: int64
不同方法的区别:
- size() 含缺失值数据,是行数。
- count() 忽略了缺失值
- value_counts() 可以用参数 dropna 来明确是否包含NaN数据
7.怎样重构表格?
-
melt() 可以从长到宽 long to wide
-
pivot()可以从宽到长 wide to long
核心知识点:
sort_values 可以对一个或多个列进行排序
pivot函数就是重构数据,pivot后的表格也支持聚合
pivot (long to wide format) 的反向操作是 melt (wide to long format)
import pandas as pd
data1 = "data/titanic.csv"
data2 = "data/air_quality_long.csv"
titanic = pd.read_csv(data1)
air_quality = pd.read_csv(data2, index_col= "date.utc", parse_dates=True)
titanic.head(), air_quality.head()
( PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S ,
city country location parameter value unit
date.utc
2019-06-18 06:00:00+00:00 Antwerpen BE BETR801 pm25 18.0 µg/m³
2019-06-17 08:00:00+00:00 Antwerpen BE BETR801 pm25 6.5 µg/m³
2019-06-17 07:00:00+00:00 Antwerpen BE BETR801 pm25 18.5 µg/m³
2019-06-17 06:00:00+00:00 Antwerpen BE BETR801 pm25 16.0 µg/m³
2019-06-17 05:00:00+00:00 Antwerpen BE BETR801 pm25 7.5 µg/m³)
7.1. 对行进行排序
- sort_values()
- sort_index()
titanic.sort_values(by='Age').head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
803 | 804 | 1 | 3 | Thomas, Master. Assad Alexander | male | 0.42 | 0 | 1 | 2625 | 8.5167 | NaN | C |
755 | 756 | 1 | 2 | Hamalainen, Master. Viljo | male | 0.67 | 1 | 1 | 250649 | 14.5000 | NaN | S |
644 | 645 | 1 | 3 | Baclini, Miss. Eugenie | female | 0.75 | 2 | 1 | 2666 | 19.2583 | NaN | C |
469 | 470 | 1 | 3 | Baclini, Miss. Helene Barbara | female | 0.75 | 2 | 1 | 2666 | 19.2583 | NaN | C |
78 | 79 | 1 | 2 | Caldwell, Master. Alden Gates | male | 0.83 | 0 | 2 | 248738 | 29.0000 | NaN | S |
titanic.sort_values(by=['Pclass','Age'],ascending=False).head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
851 | 852 | 0 | 3 | Svensson, Mr. Johan | male | 74.0 | 0 | 0 | 347060 | 7.7750 | NaN | S |
116 | 117 | 0 | 3 | Connors, Mr. Patrick | male | 70.5 | 0 | 0 | 370369 | 7.7500 | NaN | Q |
280 | 281 | 0 | 3 | Duane, Mr. Frank | male | 65.0 | 0 | 0 | 336439 | 7.7500 | NaN | Q |
483 | 484 | 1 | 3 | Turkula, Mrs. (Hedwig) | female | 63.0 | 0 | 0 | 4134 | 9.5875 | NaN | S |
326 | 327 | 0 | 3 | Nysveen, Mr. Johan Hansen | male | 61.0 | 0 | 0 | 345364 | 6.2375 | NaN | S |
长型数据转宽数据
- 长数据:一般是指数据集中的变量没有做明确的细分,即变量中至少有一个变量中的元素存在值严重重复循环的情况(可以归为几类),表格整体的形状为长方形,数据总体的表现为:变量少而观察值多。
- 宽数据:是指数据集对所有的变量进行了明确的细分,各变量的值不存在重复循环的情况也无法归类。数据总体的表现为:变量多而观察值少。 Python实现长型数据与宽型数据转换
pivot()
- 参数
columns
:来一项项分列values
:显示哪列的值
- 参数
no2 = air_quality[air_quality['parameter']=='no2']
no2_subset = no2.sort_index().groupby(['location']).head(2)
no2_subset
no2_subset.pivot(columns='location',values="value")
city | country | parameter | value | unit | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
location | BETR801 | FR04014 | London Westminster | BETR801 | FR04014 | London Westminster | BETR801 | FR04014 | London Westminster | BETR801 | FR04014 | London Westminster | BETR801 | FR04014 | London Westminster |
--- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
date.utc | |||||||||||||||
--- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
2019-04-09 01:00:00+00:00 | Antwerpen | Paris | NaN | BE | FR | NaN | no2 | no2 | NaN | 22.5 | 24.4 | NaN | µg/m³ | µg/m³ | NaN |
2019-04-09 02:00:00+00:00 | Antwerpen | Paris | London | BE | FR | GB | no2 | no2 | no2 | 53.5 | 27.4 | 67.0 | µg/m³ | µg/m³ | µg/m³ |
2019-04-09 03:00:00+00:00 | NaN | NaN | London | NaN | NaN | GB | NaN | NaN | no2 | NaN | NaN | 67.0 | NaN | NaN | µg/m³ |
no2.pivot(columns='location',values='value').plot()
<AxesSubplot:xlabel='date.utc'>
pivot_table()
- 当需要对多个值进行聚合的时候用
air_quality.pivot_table(
values='value', index='location', columns='parameter',aggfunc="mean"
)
parameter | no2 | pm25 |
---|---|---|
location | ||
--- | --- | --- |
BETR801 | 26.950920 | 23.169492 |
FR04014 | 29.374284 | NaN |
London Westminster | 29.740050 | 13.443568 |
air_quality.pivot_table(
values='value',
index='location',
columns='parameter',
aggfunc="mean",
margins=True,
margins_name= "合计"
)
parameter | no2 | pm25 | 合计 |
---|---|---|---|
location | |||
--- | --- | --- | --- |
BETR801 | 26.950920 | 23.169492 | 24.982353 |
FR04014 | 29.374284 | NaN | 29.374284 |
London Westminster | 29.740050 | 13.443568 | 21.491708 |
合计 | 29.430316 | 14.386849 | 24.222743 |
air_quality.groupby(['parameter','location']).mean()
value | ||
---|---|---|
parameter | location | |
--- | --- | --- |
no2 | BETR801 | 26.950920 |
FR04014 | 29.374284 | |
London Westminster | 29.740050 | |
pm25 | BETR801 | 23.169492 |
London Westminster | 13.443568 |
7.3.宽到长
no2_pivoted = no2.pivot(columns='location',values='value').reset_index()
no2_pivoted
location | date.utc | BETR801 | FR04014 | London Westminster |
---|---|---|---|---|
0 | 2019-04-09 01:00:00+00:00 | 22.5 | 24.4 | NaN |
1 | 2019-04-09 02:00:00+00:00 | 53.5 | 27.4 | 67.0 |
2 | 2019-04-09 03:00:00+00:00 | 54.5 | 34.2 | 67.0 |
3 | 2019-04-09 04:00:00+00:00 | 34.5 | 48.5 | 41.0 |
4 | 2019-04-09 05:00:00+00:00 | 46.5 | 59.5 | 41.0 |
... | ... | ... | ... | ... |
1700 | 2019-06-20 20:00:00+00:00 | NaN | 21.4 | NaN |
1701 | 2019-06-20 21:00:00+00:00 | NaN | 24.9 | NaN |
1702 | 2019-06-20 22:00:00+00:00 | NaN | 26.5 | NaN |
1703 | 2019-06-20 23:00:00+00:00 | NaN | 21.8 | NaN |
1704 | 2019-06-21 00:00:00+00:00 | NaN | 20.0 | NaN |
1705 rows × 4 columns
no_2 = no2_pivoted.melt(id_vars='date.utc')
no_2.head()
date.utc | location | value | |
---|---|---|---|
0 | 2019-04-09 01:00:00+00:00 | BETR801 | 22.5 |
1 | 2019-04-09 02:00:00+00:00 | BETR801 | 53.5 |
2 | 2019-04-09 03:00:00+00:00 | BETR801 | 54.5 |
3 | 2019-04-09 04:00:00+00:00 | BETR801 | 34.5 |
4 | 2019-04-09 05:00:00+00:00 | BETR801 | 46.5 |
no_2 = no2_pivoted.melt(
id_vars='date.utc',
value_vars=["BETR801", "FR04014", "London Westminster"],
value_name="NO_2",
var_name="id_location",
)
no_2.head()
8.怎样合并多个表格
-
merge() 按行合并
-
join() 按列合并
核心知识点:
concat()
: 多个表按行合并,按列合并
merge()
: 类似数据库操作时,用这个
8.1 concat()
import pandas as pd
data1 = "data/air_quality_no2_long.csv"
data2 = "data/air_quality_pm25_long.csv"
air_quality_no2 = pd.read_csv(data1)
air_quality_pm25 = pd.read_csv(data2)
air_quality = pd.concat([air_quality_no2,air_quality_pm25], axis=0)
air_quality.head()
city | country | date.utc | location | parameter | value | unit | |
---|---|---|---|---|---|---|---|
0 | Paris | FR | 2019-06-21 00:00:00+00:00 | FR04014 | no2 | 20.0 | µg/m³ |
1 | Paris | FR | 2019-06-20 23:00:00+00:00 | FR04014 | no2 | 21.8 | µg/m³ |
2 | Paris | FR | 2019-06-20 22:00:00+00:00 | FR04014 | no2 | 26.5 | µg/m³ |
3 | Paris | FR | 2019-06-20 21:00:00+00:00 | FR04014 | no2 | 24.9 | µg/m³ |
4 | Paris | FR | 2019-06-20 20:00:00+00:00 | FR04014 | no2 | 21.4 | µg/m³ |
print('Shape of the ``air_quality_pm25`` table: ', air_quality_pm25.shape)
Shape of the ``air_quality_pm25`` table: (1110, 7)
print('Shape of the ``air_quality_no2`` table: ', air_quality_no2.shape)
Shape of the ``air_quality_no2`` table: (2068, 7)
print('Shape of the resulting ``air_quality`` table: ', air_quality.shape)
Shape of the resulting ``air_quality`` table: (3178, 7)
air_quality = air_quality.sort_values('date.utc')
air_quality.head()
city | country | date.utc | location | parameter | value | unit | |
---|---|---|---|---|---|---|---|
1109 | London | GB | 2019-05-07 01:00:00+00:00 | London Westminster | pm25 | 8.0 | µg/m³ |
2067 | London | GB | 2019-05-07 01:00:00+00:00 | London Westminster | no2 | 23.0 | µg/m³ |
1098 | Antwerpen | BE | 2019-05-07 01:00:00+00:00 | BETR801 | no2 | 50.5 | µg/m³ |
1003 | Paris | FR | 2019-05-07 01:00:00+00:00 | FR04014 | no2 | 25.0 | µg/m³ |
100 | Antwerpen | BE | 2019-05-07 01:00:00+00:00 | BETR801 | pm25 | 12.5 | µg/m³ |
air_quality_ = pd.concat(
[air_quality_pm25, air_quality_no2],
keys=["PM25", "NO2"]
)
air_quality_.head()
city | country | date.utc | location | parameter | value | unit | ||
---|---|---|---|---|---|---|---|---|
PM25 | 0 | Antwerpen | BE | 2019-06-18 06:00:00+00:00 | BETR801 | pm25 | 18.0 | µg/m³ |
1 | Antwerpen | BE | 2019-06-17 08:00:00+00:00 | BETR801 | pm25 | 6.5 | µg/m³ | |
2 | Antwerpen | BE | 2019-06-17 07:00:00+00:00 | BETR801 | pm25 | 18.5 | µg/m³ | |
3 | Antwerpen | BE | 2019-06-17 06:00:00+00:00 | BETR801 | pm25 | 16.0 | µg/m³ | |
4 | Antwerpen | BE | 2019-06-17 05:00:00+00:00 | BETR801 | pm25 | 7.5 | µg/m³ |
8.2 用公共标识进行表连接
stations_coord = pd.read_csv("data/air_quality_stations.csv")
stations_coord.head()
location | coordinates.latitude | coordinates.longitude | |
---|---|---|---|
0 | BELAL01 | 51.23619 | 4.38522 |
1 | BELHB23 | 51.17030 | 4.34100 |
2 | BELLD01 | 51.10998 | 5.00486 |
3 | BELLD02 | 51.12038 | 5.02155 |
4 | BELR833 | 51.32766 | 4.36226 |
air_quality = pd.merge(air_quality, stations_coord, how="left", on="location")
air_quality.head()
city | country | date.utc | location | parameter | value | unit | coordinates.latitude | coordinates.longitude | |
---|---|---|---|---|---|---|---|---|---|
0 | London | GB | 2019-05-07 01:00:00+00:00 | London Westminster | pm25 | 8.0 | µg/m³ | 51.49467 | -0.13193 |
1 | London | GB | 2019-05-07 01:00:00+00:00 | London Westminster | no2 | 23.0 | µg/m³ | 51.49467 | -0.13193 |
2 | Antwerpen | BE | 2019-05-07 01:00:00+00:00 | BETR801 | no2 | 50.5 | µg/m³ | 51.20966 | 4.43182 |
3 | Paris | FR | 2019-05-07 01:00:00+00:00 | FR04014 | no2 | 25.0 | µg/m³ | 48.83724 | 2.39390 |
4 | Paris | FR | 2019-05-07 01:00:00+00:00 | FR04014 | no2 | 25.0 | µg/m³ | 48.83722 | 2.39390 |
air_quality_parameters = pd.read_csv("data/air_quality_parameters.csv")
air_quality_parameters.head()
id | description | name | |
---|---|---|---|
0 | bc | Black Carbon | BC |
1 | co | Carbon Monoxide | CO |
2 | no2 | Nitrogen Dioxide | NO2 |
3 | o3 | Ozone | O3 |
4 | pm10 | Particulate matter less than 10 micrometers in... | PM10 |
air_quality = pd.merge(air_quality, air_quality_parameters,
how='left', left_on='parameter', right_on='id')
air_quality.head()
city | country | date.utc | location | parameter | value | unit | coordinates.latitude | coordinates.longitude | id | description | name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | London | GB | 2019-05-07 01:00:00+00:00 | London Westminster | pm25 | 8.0 | µg/m³ | 51.49467 | -0.13193 | pm25 | Particulate matter less than 2.5 micrometers i... | PM2.5 |
1 | London | GB | 2019-05-07 01:00:00+00:00 | London Westminster | no2 | 23.0 | µg/m³ | 51.49467 | -0.13193 | no2 | Nitrogen Dioxide | NO2 |
2 | Antwerpen | BE | 2019-05-07 01:00:00+00:00 | BETR801 | no2 | 50.5 | µg/m³ | 51.20966 | 4.43182 | no2 | Nitrogen Dioxide | NO2 |
3 | Paris | FR | 2019-05-07 01:00:00+00:00 | FR04014 | no2 | 25.0 | µg/m³ | 48.83724 | 2.39390 | no2 | Nitrogen Dioxide | NO2 |
4 | Paris | FR | 2019-05-07 01:00:00+00:00 | FR04014 | no2 | 25.0 | µg/m³ | 48.83722 | 2.39390 | no2 | Nitrogen Dioxide | NO2 |
9.怎样处理时间序列?
pandas能很好的处理时间序列。
核心知识点
使用to_datetime函数或作为read函数的一部分,可以将有效的日期字符串转换为datetime对象。
pandas中的Datetime对象使用dt访问器支持计算、逻辑操作和方便的日期相关属性。
DatetimeIndex包含这些与日期相关的属性,并支持方便的切片。
重采样是改变时间序列频率的一种强有力的方法。
数据:
import pandas as pd
import matplotlib.pyplot as plt
air_quality = pd.read_csv("data/air_quality_no2_long.csv")
air_quality = air_quality.rename(columns={'date.utc': "datetime"})
air_quality.head()
city | country | datetime | location | parameter | value | unit | |
---|---|---|---|---|---|---|---|
0 | Paris | FR | 2019-06-21 00:00:00+00:00 | FR04014 | no2 | 20.0 | µg/m³ |
1 | Paris | FR | 2019-06-20 23:00:00+00:00 | FR04014 | no2 | 21.8 | µg/m³ |
2 | Paris | FR | 2019-06-20 22:00:00+00:00 | FR04014 | no2 | 26.5 | µg/m³ |
3 | Paris | FR | 2019-06-20 21:00:00+00:00 | FR04014 | no2 | 24.9 | µg/m³ |
4 | Paris | FR | 2019-06-20 20:00:00+00:00 | FR04014 | no2 | 21.4 | µg/m³ |
air_quality.city.unique()
array(['Paris', 'Antwerpen', 'London'], dtype=object)
9.1. 使用pandas日期时间属性
datetime这一列是纯文字,改成datetime对象
。
air_quality['datetime'] = pd.to_datetime(air_quality['datetime'])
air_quality['datetime'].dtype
datetime64[ns, UTC]
在读取文件的时候,可以通过设置parse_dates=
参数将文字形式的时间转换为时间戳,它可以作为对象进行处理。
aq = pd.read_csv("data/air_quality_no2_long.csv", parse_dates=["date.utc"])
aq['date.utc'].dtypes
aq = aq.rename(columns={'date.utc': "datetime"}) # 改名后要赋值给原来的dataframe
aq['datetime']
0 2019-06-21 00:00:00+00:00
1 2019-06-20 23:00:00+00:00
2 2019-06-20 22:00:00+00:00
3 2019-06-20 21:00:00+00:00
4 2019-06-20 20:00:00+00:00
...
2063 2019-05-07 06:00:00+00:00
2064 2019-05-07 04:00:00+00:00
2065 2019-05-07 03:00:00+00:00
2066 2019-05-07 02:00:00+00:00
2067 2019-05-07 01:00:00+00:00
Name: datetime, Length: 2068, dtype: datetime64[ns, UTC]
改成时间戳后,可以进行一些关于时间的操作。
比如起始时间:
air_quality["datetime"].min(), air_quality["datetime"].max()
(Timestamp('2019-05-07 01:00:00+0000', tz='UTC'),
Timestamp('2019-06-21 00:00:00+0000', tz='UTC'))
air_quality["datetime"].max() - air_quality["datetime"].min()
Timedelta('44 days 23:00:00')
结果是一个pandas.Timedelta
对象,类似于标准Python库中的datetime.timedelta
,用于定义时间区间。
还记得统计计算教程中groupby提供的split-apply-combine模式吗?这里,我们想要计算每个工作日和每个测量位置的给定统计量(例如,平均值(NO_2))。为了对工作日进行分组,我们使用pandas Timestamp的datetime属性weekday(星期一=0,星期日=6),dt访问器也可以访问该属性。可以对位置和工作日进行分组,以拆分这些组合的平均值计算。
air_quality.groupby(
[air_quality['datetime'].dt.weekday, 'location']
)['value'].mean()
datetime location
0 BETR801 27.875000
FR04014 24.856250
London Westminster 23.969697
1 BETR801 22.214286
FR04014 30.999359
London Westminster 24.885714
2 BETR801 21.125000
FR04014 29.165753
London Westminster 23.460432
3 BETR801 27.500000
FR04014 28.600690
London Westminster 24.780142
4 BETR801 28.400000
FR04014 31.617986
London Westminster 26.446809
5 BETR801 33.500000
FR04014 25.266154
London Westminster 24.977612
6 BETR801 21.896552
FR04014 23.274306
London Westminster 24.859155
Name: value, dtype: float64
与前面的情况类似,我们想要计算一天中每个小时的给定统计数据(例如,mean (NO_2)),我们可以再次使用split-apply-combine方法。在这种情况下,我们使用pandas Timestamp的datetime属性hour,dt访问器也可以访问它。
# 绘制图形
fig, axs = plt.subplots(figsize=(12, 4))
air_quality.groupby(air_quality["datetime"].dt.hour)["value"].mean().plot(
kind='bar', rot=0, ax=axs
)
plt.xlabel("Hour of the day")
plt.ylabel("$NO_2 (µg/m^3)$")
Text(0, 0.5, '$NO_2 (µg/m^3)$')
9.2. Datetime作为索引
在'7.reshaping'的教程中,pivot()用于调整数据表的形状,将每个测量位置作为单独的列:
no_2 = air_quality.pivot(index='datetime',columns='location',values='value')
no_2.head()
location | BETR801 | FR04014 | London Westminster |
---|---|---|---|
datetime | |||
--- | --- | --- | --- |
2019-05-07 01:00:00+00:00 | 50.5 | 25.0 | 23.0 |
2019-05-07 02:00:00+00:00 | 45.0 | 27.7 | 19.0 |
2019-05-07 03:00:00+00:00 | NaN | 50.4 | 19.0 |
2019-05-07 04:00:00+00:00 | NaN | 61.9 | 16.0 |
2019-05-07 05:00:00+00:00 | NaN | 72.4 | NaN |
通过旋转数据,datetime信息成为表的索引。通常,可以通过set_index函数将列设置为索引。
使用日期时间索引(即DatetimeIndex)提供了强大的功能。例如,我们不需要dt访问器来获取时间序列属性,就可以直接在索引上使用这些属性:
no_2.index.year, no_2.index.weekday
(Int64Index([2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019,
...
2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019],
dtype='int64', name='datetime', length=1033),
Int64Index([1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
...
3, 3, 3, 3, 3, 3, 3, 3, 3, 4],
dtype='int64', name='datetime', length=1033))
其他优点包括方便地划分时段或调整时间尺度。举例:
no_2["2019-05-20":"2019-05-21"].plot();
通过提供一个解析为datetime的字符串,可以在DatetimeIndex上选择数据的特定子集。
9.3. 调整时间频率
monthly_max = no_2.resample("M").max()
monthly_max
location | BETR801 | FR04014 | London Westminster |
---|---|---|---|
datetime | |||
--- | --- | --- | --- |
2019-05-31 00:00:00+00:00 | 74.5 | 97.0 | 97.0 |
2019-06-30 00:00:00+00:00 | 52.5 | 84.7 | 52.0 |
resample()方法类似于groupby操作:
-
它通过使用定义目标频率的字符串(例如M, 5H,…)来提供基于时间的分组
-
它需要一个聚合函数,如mean, max,…
定义时,时间序列的频率由freq属性提供:
monthly_max.index.freq
<MonthEnd>
no_2.resample("D").mean().plot(style="-o", figsize=(10, 5));
10.怎样处理字符串数据?
pandas同样有强大的能力处理字符串数据。
核心知识点:
可以使用str访问器使用字符串方法。
字符串方法以元素为单位工作,可以用于条件索引。
replace方法是一种根据给定字典转换值的方便方法。
数据:
import pandas as pd
titanic = pd.read_csv("data/titanic.csv")
titanic.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
案例分析
Q1:使所有名称字符小写。
titanic["Name"].str.lower()
0 braund, mr. owen harris
1 cumings, mrs. john bradley (florence briggs th...
2 heikkinen, miss. laina
3 futrelle, mrs. jacques heath (lily may peel)
4 allen, mr. william henry
...
886 montvila, rev. juozas
887 graham, miss. margaret edith
888 johnston, miss. catherine helen "carrie"
889 behr, mr. karl howell
890 dooley, mr. patrick
Name: Name, Length: 891, dtype: object
要使Name列中的每个字符串都小写,请选择Name列(参见关于数据选择的教程),添加str访问器并应用下面的方法。因此,每个字符串都按元素顺序进行了转换。
与时间序列教程中的datetime对象具有dt访问器类似,在使用str访问器时,可以使用许多专门的字符串方法。这些方法的名称通常与单个元素的等效内置字符串方法相匹配,但是对每个列的值应用元素级(还记得元素级计算吗?)
Q2:通过提取逗号前的部分,创建一个新列Surname,该列包含乘客的姓氏。
titanic["Name"].str.split(",")
0 [Braund, Mr. Owen Harris]
1 [Cumings, Mrs. John Bradley (Florence Briggs ...
2 [Heikkinen, Miss. Laina]
3 [Futrelle, Mrs. Jacques Heath (Lily May Peel)]
4 [Allen, Mr. William Henry]
...
886 [Montvila, Rev. Juozas]
887 [Graham, Miss. Margaret Edith]
888 [Johnston, Miss. Catherine Helen "Carrie"]
889 [Behr, Mr. Karl Howell]
890 [Dooley, Mr. Patrick]
Name: Name, Length: 891, dtype: object
使用Series.str.split()方法,每个值都作为包含2个元素的列表返回。第一个元素是逗号前的部分,第二个元素是逗号后的部分。
titanic["Surname"] = titanic["Name"].str.split(",").str.get(0)
titanic['Surname']
0 Braund
1 Cumings
2 Heikkinen
3 Futrelle
4 Allen
...
886 Montvila
887 Graham
888 Johnston
889 Behr
890 Dooley
Name: Surname, Length: 891, dtype: object
因为我们只对表示姓氏(元素0)的第一部分感兴趣,所以可以再次使用str访问器并应用Series.str.get()来提取相关部分。实际上,这些字符串函数可以连接起来一次组合多个函数!
Q3:提取泰坦尼克号上的女伯爵的乘客数据。
titanic["Name"].str.contains("Countess")
0 False
1 False
2 False
3 False
4 False
...
886 False
887 False
888 False
889 False
890 False
Name: Name, Length: 891, dtype: bool
titanic[titanic['Name'].str.contains('Countess')]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Surname | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
759 | 760 | 1 | 1 | Rothes, the Countess. of (Lucy Noel Martha Dye... | female | 33.0 | 0 | 0 | 110152 | 86.5 | B77 | S | Rothes |
字符串方法Series.str.contains()检查列Name中的每个值,如果字符串包含单词Countess,则返回每个值True(Countess是名称的一部分)或False(Countess不是名称的一部分)。该输出可用于使用数据教程子集中介绍的条件(布尔)索引对数据进行子选择。因为泰坦尼克号上只有一位女伯爵,所以我们只得到一行数据。
Series.str.extract()方法接受正则表达式
Q4:泰坦尼克号上哪位乘客的名字最长?
titanic['Name'].str.len()
0 23
1 51
2 22
3 44
4 24
..
886 21
887 28
888 40
889 21
890 19
Name: Name, Length: 891, dtype: int64
要获得最长的名称,我们首先必须获得name列中每个名称的长度。通过使用pandas字符串方法,Series.str.len()函数分别应用于每个名称(按元素顺序)。
titanic['Name'].str.len().idxmax()
307
接下来,我们需要获得名称长度最大的表中的相应位置,最好是索引标签。idxmax()方法正是这样做的。它不是字符串方法,应用于整数,因此不使用str。
titanic.loc[titanic['Name'].str.len().idxmax(),'Name']
'Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)'
根据行(307)和列(name)的索引名,我们可以使用loc操作符进行选择,这在subsetting教程中介绍过。
Q5:在“性别”一栏,将“男性”的值替换为“M”,将“女性”的值替换为“F”。
titanic["Sex_short"] = titanic["Sex"].replace({"male": "M", "female": "F"})
titanic['Sex_short']
0 M
1 F
2 F
3 F
4 M
..
886 M
887 F
888 F
889 M
890 M
Name: Sex_short, Length: 891, dtype: object
虽然replace()不是一个字符串方法,但它提供了一种使用映射或词汇表来转换某些值的方便方法。它需要一个字典来定义{from: to}的映射。
标签:00,NaN,2019,---,文档,新手,air,quality,pandas 来源: https://www.cnblogs.com/watalo/p/16319286.html