首页 > 编程语言> > 利用python进行数据分析（2）

利用python进行数据分析（2）

2021-12-29 14:02:43 作者：互联网

第七章数据清洗与准备

7.1 处理缺失值

pandas对象的所有描述性统计信息默认情况下是排除缺失值的。
对于数值型数据，pandas使用浮点值NaN（Not a Number来表示缺失值）。
在pandas中，我们采用了R语言中的编程惯例，将缺失值成为NA，意思是not available（不可用）

string_data = pd.Series(['aardvark','artichoke',np.nan,'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

#当清洗数据用于分析时，对缺失数据本身进行分析以确定数据收集问题或数据丢失导致的数据偏差通常很重要。
#Python内建的None值在对象数组中也被当作NA处理
string_data[0] = None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

NA处理方法

函数名	描述
dropna	根据每个标签的值是否是缺失数据来筛选轴标签，并根据允许丢失的数据量来确定阈值
fillna	用某些值填充缺失的数据或使用插值方法(如’ffill’或’bfill’).
isnull	返回表明哪些值是缺失值的布尔值
notnull	isnull的反函数

7.1.1 过滤缺失值

#在Series上使用dropna，它会返回Series中所有的非空数据及其索引值
from numpy import nan as  NA
data = pd.Series([1,NA,3.5,NA,7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

#dropna默认情况下会删除包含缺失值的行
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

data = pd.DataFrame([[1,6.5,3],[1,NA,NA],[NA,NA,NA],[NA,6.5,3]])
data

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
2	NaN	NaN	NaN
3	NaN	6.5	3.0

#传入how='all’时，将删除所有值均为NA的行
data.dropna(how='all')

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
3	NaN	6.5	3.0

#如果要用同样的方式去删除列，传入参数axis=1
data[4] = NA
data

	0	1	2	4
0	1.0	6.5	3.0	NaN
1	1.0	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN
3	NaN	6.5	3.0	NaN

data.dropna(axis=1,how = 'all')

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
2	NaN	NaN	NaN
3	NaN	6.5	3.0

df = pd.DataFrame(np.random.randn(7,3))
df

	0	1	2
0	-0.100288	0.117081	0.629897
1	0.145224	0.827820	-0.197561
2	-1.372610	-0.521075	0.783224
3	-0.679339	0.355698	-1.283404
4	-1.587708	0.254616	0.149215
5	-0.323276	-0.393636	-1.828212
6	-0.639610	-1.677821	1.618943

df.iloc[:4,1] = NA

df.iloc[:2,2] = NA

df

	0	1	2
0	-0.100288	NaN	NaN
1	0.145224	NaN	NaN
2	-1.372610	NaN	0.783224
3	-0.679339	NaN	-1.283404
4	-1.587708	0.254616	0.149215
5	-0.323276	-0.393636	-1.828212
6	-0.639610	-1.677821	1.618943

df.dropna()

	0	1	2
4	-1.587708	0.254616	0.149215
5	-0.323276	-0.393636	-1.828212
6	-0.639610	-1.677821	1.618943

#假设你只想保留包含一定数量的观察值的行。你可以用thresh参数来表示
#thresh=n是指：保留下来的每一行，其非NA的数目大于等于n，thresh=2表示保留至少n个非Nan的数据行
df.dropna(thresh = 2)

	0	1	2
2	-1.372610	NaN	0.783224
3	-0.679339	NaN	-1.283404
4	-1.587708	0.254616	0.149215
5	-0.323276	-0.393636	-1.828212
6	-0.639610	-1.677821	1.618943

7.1.2 补全缺失值

主要使用fillna方法来补全缺失值。调用fillna时，可以使用一个常数来替代缺失值
fillna函数参数

参数	描述
value	标量值或字典型对象用于填充缺失值
method	插值方法，如果没有其他参数，默认是’ffill’
axis	需要填充的轴，默认axis=0
inplace	修改被调用的对象，而不是生成一个备份
limit	用于前向或后向填充时最大的填充范围

df.fillna(0)

	0	1	2
0	-0.100288	0.000000	0.000000
1	0.145224	0.000000	0.000000
2	-1.372610	0.000000	0.783224
3	-0.679339	0.000000	-1.283404
4	-1.587708	0.254616	0.149215
5	-0.323276	-0.393636	-1.828212
6	-0.639610	-1.677821	1.618943

#在调用fillna时使用字典，你可以为不同列设定不同的填充值
df.fillna({1:0.5,2:0})

	0	1	2
0	-0.100288	0.500000	0.000000
1	0.145224	0.500000	0.000000
2	-1.372610	0.500000	0.783224
3	-0.679339	0.500000	-1.283404
4	-1.587708	0.254616	0.149215
5	-0.323276	-0.393636	-1.828212
6	-0.639610	-1.677821	1.618943

#fillna返回的是一个新的对象，但你也可以修改已经存在的对象
_ = df.fillna(0,inplace = True)
df

	0	1	2
0	-0.100288	0.000000	0.000000
1	0.145224	0.000000	0.000000
2	-1.372610	0.000000	0.783224
3	-0.679339	0.000000	-1.283404
4	-1.587708	0.254616	0.149215
5	-0.323276	-0.393636	-1.828212
6	-0.639610	-1.677821	1.618943

#用于重建索引的相同的插值方法也可以用于fillna
df = pd.DataFrame(np.random.randn(6,3))
df

	0	1	2
0	-0.428405	0.199383	0.354342
1	0.019782	0.921389	0.534736
2	-0.583158	0.390681	-2.386976
3	-0.076475	-0.034995	1.635065
4	0.528814	0.711717	0.696243
5	-0.193577	0.162206	-0.520191

df.iloc[2:,1] = NA

df

	0	1	2
0	-0.428405	0.199383	0.354342
1	0.019782	0.921389	0.534736
2	-0.583158	NaN	-2.386976
3	-0.076475	NaN	1.635065
4	0.528814	NaN	0.696243
5	-0.193577	NaN	-0.520191

df.iloc[4:,2] = NA

df

	0	1	2
0	-0.428405	0.199383	0.354342
1	0.019782	0.921389	0.534736
2	-0.583158	NaN	-2.386976
3	-0.076475	NaN	1.635065
4	0.528814	NaN	NaN
5	-0.193577	NaN	NaN

df.fillna(method='ffill')

	0	1	2
0	-0.428405	0.199383	0.354342
1	0.019782	0.921389	0.534736
2	-0.583158	0.921389	-2.386976
3	-0.076475	0.921389	1.635065
4	0.528814	0.921389	1.635065
5	-0.193577	0.921389	1.635065

df.fillna(method='backfill')

	0	1	2
0	-0.428405	0.199383	0.354342
1	0.019782	0.921389	0.534736
2	-0.583158	NaN	-2.386976
3	-0.076475	NaN	1.635065
4	0.528814	NaN	NaN
5	-0.193577	NaN	NaN

df.fillna(method='ffill',limit=2)

	0	1	2
0	-0.428405	0.199383	0.354342
1	0.019782	0.921389	0.534736
2	-0.583158	0.921389	-2.386976
3	-0.076475	0.921389	1.635065
4	0.528814	NaN	1.635065
5	-0.193577	NaN	1.635065

data = pd.Series([5,NA,3,NA,7])
data.fillna(data.mean())

0    5.0
1    5.0
2    3.0
3    5.0
4    7.0
dtype: float64

data.mean()

5.0

7.2 数据转换

7.2.1 删除重复值

data = pd.DataFrame({'k1':['one','two']*3+['two'],
                    'k2':[1,1,2,3,3,4,4,]})
data

	k1	k2
0	one	1
1	two	1
2	one	2
3	two	3
4	one	3
5	two	4
6	two	4

#DataFrame的duplicated方法返回的是一个布尔值Series，
#这个Series反映的是每一行是否存在重复（与之前出现过的行相同）情况
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

#drop_duplicates返回的是DataFrame，内容是duplicated返回数组中为False的部分
#删除重复值，只保留唯一值
#这些方法默认都是对列进行操作
data.drop_duplicates()

	k1	k2
0	one	1
1	two	1
2	one	2
3	two	3
4	one	3
5	two	4

data['v1'] = range(7)
data

	k1	k2	v1
0	one	1	0
1	two	1	1
2	one	2	2
3	two	3	3
4	one	3	4
5	two	4	5
6	two	4	6

data.drop_duplicates(['k1'])

	k1	k2	v1
0	one	1	0
1	two	1	1

#duplicated和drop_duplicates默认都是保留第一个观测到的值。传入参数keep='last’将会返回最后一个
data.drop_duplicates(['k1','k2'],keep = 'last')

	k1	k2	v1
0	one	1	0
1	two	1	1
2	one	2	2
3	two	3	3
4	one	3	4
6	two	4	6

7.2.2 使用函数或映射进行数据转换

data = pd.DataFrame({'food':['bacon','pulled pork','bacon','pastrami','corned beef',
                             'bacon','pastrami','honey ham','nova lox'],
                     'ounces':[4.0,3.0,12.0,6.0,7.5,8.0,3.0,5.0,6.0]})

data

	food	ounces
0	bacon	4.0
1	pulled pork	3.0
2	bacon	12.0
3	pastrami	6.0
4	corned beef	7.5
5	bacon	8.0
6	pastrami	3.0
7	honey ham	5.0
8	nova lox	6.0

meat_to_animal = {
'bacon':'pig',
'pulled pork':'pig',  
'pastrami':'cow',
'corned beef':'cow',  
'honey ham':'pig', 
'nova lox':'samlon',     
}

#Series的map方法接收一个函数或一个包含映射关系的字典型对象
lowercased = data['food'].str.lower()
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

data['animal'] = lowercased.map(meat_to_animal)
data

	food	ounces	animal
0	bacon	4.0	pig
1	pulled pork	3.0	pig
2	bacon	12.0	pig
3	pastrami	6.0	cow
4	corned beef	7.5	cow
5	bacon	8.0	pig
6	pastrami	3.0	cow
7	honey ham	5.0	pig
8	nova lox	6.0	samlon

#使用map是一种可以便捷执行按元素转换及其他清洗相关操作的方法
data['food'].map(lambda x :meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    samlon
Name: food, dtype: object

7.2.3 替代值

data.replace方法与data.str.replace方法是不同的，data.str. replace是对字符串进行按元素替代的。

data = pd.Series([1,-999,2,-999,-1000,3])
data

0       1
1    -999
2       2
3    -999
4   -1000
5       3
dtype: int64

data.replace(-999,np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

#如果你想要一次替代多个值，你可以传入一个列表和替代值
data.replace([-999,-1000],np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

#要将不同的值替换为不同的值，可以传入替代值的列表
data.replace([-999,-1000],[np.nan,0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

#参数也可以通过字典传递
data.replace({-999:np.nan,-1000:0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

7.2.4 重命名轴索引

data = pd.DataFrame(np.arange(12).reshape(3,4),
                   index = ['Ohio','Colorado','New York'],
                   columns = ['one','two','three','four'])
data

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
New York	8	9	10	11

transform = lambda x :x[:4].upper()
data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

data.index = data.index.map(transform)
data

	one	two	three	four
OHIO	0	1	2	3
COLO	4	5	6	7
NEW	8	9	10	11

data.rename(index = str.title,columns = str.upper)

	ONE	TWO	THREE	FOUR
Ohio	0	1	2	3
Colo	4	5	6	7
New	8	9	10	11

#rename可以结合字典型对象使用，为轴标签的子集提供新的值
data.rename(index = {'OHIO':'INDIANA'},
           columns = {'three':'peekaboo'})

	one	two	peekaboo	four
INDIANA	0	1	2	3
COLO	4	5	6	7
NEW	8	9	10	11

7.2.5 离散化和分箱

ages = [20,22,24,27,21,23,37,31,61,45,41,32]

bins = [18,25,35,60,100]
cats = pd.cut(ages,bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')

#pd.value_counts(cats)是对pandas.cut的结果中的箱数量的计数
pd.value_counts(cats)

(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
dtype: int64

#与区间的数学符号一致，小括号表示边是开放的，中括号表示它是封闭的（包括边）。
#你可以通过传递right=False来改变哪一边是封闭的
pd.cut(ages,[18,26,36,61,100],right = False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

#你也可以通过向labels选项传递一个列表或数组来传入自定义的箱名
group_names = ['youth','youngadult','middleaged','senior']
a = pd.cut(ages,bins,labels = group_names)

pd.value_counts(a)

youth         5
youngadult    3
middleaged    3
senior        1
dtype: int64

#如果你传给cut整数个的箱来代替显式的箱边，pandas将根据数据中的最小值和最大值计算出等长的箱
#precision=2的选项将十进制精度限制在两位
data = np.random.rand(20)
pd.cut(data,4,precision=2)

[(0.51, 0.74], (0.29, 0.51], (0.74, 0.97], (0.29, 0.51], (0.06, 0.29], ..., (0.06, 0.29], (0.29, 0.51], (0.74, 0.97], (0.51, 0.74], (0.74, 0.97]]
Length: 20
Categories (4, interval[float64]): [(0.06, 0.29] < (0.29, 0.51] < (0.51, 0.74] < (0.74, 0.97]]

#qcut是一个与分箱密切相关的函数，它基于样本分位数进行分箱。取决于数据的分布，使用cut通常不会使每个箱具有相同数据量的数据点。
#由于qcut使用样本的分位数，你可以通过qcut获得等长的箱
data = np.random.randn(1000)
cats = pd.qcut(data,4)#切成4份
cats

[(-0.00707, 0.65], (-0.00707, 0.65], (-2.936, -0.626], (-0.626, -0.00707], (-2.936, -0.626], ..., (-0.626, -0.00707], (-0.626, -0.00707], (-0.626, -0.00707], (-0.626, -0.00707], (-0.00707, 0.65]]
Length: 1000
Categories (4, interval[float64]): [(-2.936, -0.626] < (-0.626, -0.00707] < (-0.00707, 0.65] < (0.65, 3.139]]

pd.value_counts(cats)

(-2.936, -0.626]      250
(-0.626, -0.00707]    250
(-0.00707, 0.65]      250
(0.65, 3.139]         250
dtype: int64

#与cut类似，你可以传入自定义的分位数（0和1之间的数据，包括边）
pd.cut(data,[0,0.1,0.5,0.9,1])

[(0.5, 0.9], (0.1, 0.5], NaN, NaN, NaN, ..., NaN, NaN, NaN, NaN, (0.1, 0.5]]
Length: 1000
Categories (4, interval[float64]): [(0.0, 0.1] < (0.1, 0.5] < (0.5, 0.9] < (0.9, 1.0]]

7.2.6 检测和过滤异常值

data = pd.DataFrame(np.random.randn(1000,4))
data.describe()

	0	1	2	3
count	1000.000000	1000.000000	1000.000000	1000.000000
mean	0.013343	0.030142	0.020312	0.042330
std	1.012528	0.984443	0.999869	0.982124
min	-2.942920	-3.799121	-3.412855	-2.632107
25%	-0.668303	-0.629645	-0.654843	-0.643005
50%	0.010349	0.040064	0.026197	0.028003
75%	0.701525	0.679371	0.706170	0.714993
max	3.274496	3.998493	3.264216	2.907744

#假设你想要找出一列中绝对值大于三的值
col = data[2]
col[np.abs(col) > 3]

91    -3.044972
711    3.264216
858   -3.412855
Name: 2, dtype: float64

data[(np.abs(data)>3).any(1)]

	0	1	2	3
91	-0.341046	-0.555910	-3.044972	0.474512
325	2.233400	-3.027404	0.845704	1.441757
332	-0.460361	-3.799121	-0.312931	0.478548
457	0.011004	3.998493	0.977419	0.577620
711	-0.603762	-1.650901	3.264216	-0.803395
746	1.455624	-3.178085	-0.387140	0.859193
858	-2.127923	0.163924	-3.412855	-0.073186
946	3.274496	-0.699596	-1.016879	0.358252

data[np.abs(data)>3] = np.sign(data)*3
data.describe()

	0	1	2	3
count	1000.000000	1000.000000	1000.000000	1000.000000
mean	0.013069	0.030148	0.020506	0.042330
std	1.011680	0.977459	0.997573	0.982124
min	-2.942920	-3.000000	-3.000000	-2.632107
25%	-0.668303	-0.629645	-0.654843	-0.643005
50%	0.010349	0.040064	0.026197	0.028003
75%	0.701525	0.679371	0.706170	0.714993
max	3.000000	3.000000	3.000000	2.907744

data

	0	1	2	3
0	0.997285	0.352539	-0.158277	-0.069519
1	-1.144523	-0.173312	-0.651227	0.686972
2	0.650131	0.271325	-0.304344	-0.281217
3	0.527442	-2.023765	0.827982	-1.855424
4	-0.578451	-0.949705	-0.582701	-1.725697
...	...	...	...	...
995	0.494311	0.528862	-0.191097	0.118121
996	-0.582154	1.251247	-1.622055	-0.436563
997	0.687732	-1.670059	-0.272708	-0.369290
998	-0.443230	0.984728	-0.283506	-1.473420
999	-0.276277	-0.597256	1.269391	-0.704337

1000 rows × 4 columns

#语句np.sign(data)根据数据中的值的正负分别生成1和-1的数值
np.sign(data).head()

	0	1	2	3
0	1.0	1.0	-1.0	-1.0
1	-1.0	-1.0	-1.0	1.0
2	1.0	1.0	-1.0	-1.0
3	1.0	-1.0	1.0	-1.0
4	-1.0	-1.0	-1.0	-1.0

7.2.7 置换和随机抽样

使用numpy.random.permutation对DataFrame中的Series或行进行置换（随机重排序）是非常方便的。

sampler = np.random.permutation(5)
sampler

array([3, 2, 0, 4, 1])

df = pd.DataFrame(np.arange(5*4).reshape(5,4))
df

	0	1	2	3
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19

#整数数组可以用在基于iloc的索引或等价的take函数中
df.take(sampler)

	0	1	2	3
3	12	13	14	15
2	8	9	10	11
0	0	1	2	3
4	16	17	18	19
1	4	5	6	7

#要选出一个不含有替代值的随机子集，你可以使用Series和DataFrame的sample方法
df.sample(n=3)

	0	1	2	3
0	0	1	2	3
4	16	17	18	19
3	12	13	14	15

#要生成一个带有替代值的样本（允许有重复选择），将replace=True传入sample方法
choices = pd.Series([5,6,-1,6,4])
draws = choices.sample(n=10,replace = True)
draws

2   -1
0    5
2   -1
3    6
0    5
1    6
1    6
4    4
3    6
1    6
dtype: int64

7.2.8 计算指标/虚拟变量

将分类变量转换为“虚拟”或“指标”矩阵是另一种用于统计建模或机器学习的转换操作
pandas有一个get_dummies函数用于实现该功能

df = pd.DataFrame({'key':['b','b','a','c','a','b'],
                  'data1':range(6)})
df

	key	data1
0	b	0
1	b	1
2	a	2
3	c	3
4	a	4
5	b	5

pd.get_dummies(df['key'])

	a	b	c
0	0	1	0
1	0	1	0
2	1	0	0
3	0	0	1
4	1	0	0
5	0	1	0

#在某些情况下，你可能想在指标DataFrame的列上加入前缀，然后与其他数据合并。
#在get_dummies方法中有一个前缀参数用于实现该功能
dummies = pd.get_dummies(df['key'],prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy

	data1	key_a	key_b	key_c
0	0	0	1	0
1	1	0	1	0
2	2	1	0	0
3	3	0	0	1
4	4	1	0	0
5	5	0	1	0

mnames = ['movie_id','title','genres']
movies = pd.read_table(r'D:\PythonFlie\python\利用python进行数据分析(书籍笔记)\pydata-book-2nd-edition\datasets\movielens\movies.dat'
                       ,sep='::',header=None,names = mnames)

<ipython-input-188-960ac40c2eea>:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  movies = pd.read_table(r'D:\PythonFlie\python\利用python进行数据分析(书籍笔记)\pydata-book-2nd-edition\datasets\movielens\movies.dat'

movies[::10]

	movie_id	title	genres
0	1	Toy Story (1995)	Animation\|Children's\|Comedy
10	11	American President, The (1995)	Comedy\|Drama\|Romance
20	21	Get Shorty (1995)	Action\|Comedy\|Drama
30	31	Dangerous Minds (1995)	Drama
40	41	Richard III (1995)	Drama\|War
...	...	...	...
3840	3910	Dancer in the Dark (2000)	Drama\|Musical
3850	3920	Faraway, So Close (In Weiter Ferne, So Nah!) (...	Drama\|Fantasy
3860	3930	Creature From the Black Lagoon, The (1954)	Horror
3870	3940	Slumber Party Massacre III, The (1990)	Horror
3880	3950	Tigerland (2000)	Drama

389 rows × 3 columns

#为每个电影流派添加指标变量需要进行一些数据处理。首先，我们从数据集中提取出所有不同的流派的列表
all_genres = []
for x in movies.genres:
    all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)

genres

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

zero_matrix = np.zeros((len(movies),len(genres)))
dummies = pd.DataFrame(zero_matrix,columns=genres)

zero_matrix

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

dummies

	Animation	Children's	Comedy	Adventure	Fantasy	Romance	Drama	Action	Crime	Thriller	Horror	Sci-Fi	Documentary	War	Musical	Mystery	Film-Noir	Western
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3878	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3879	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3880	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3881	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3882	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

3883 rows × 18 columns

gen = movies.genres[0]
gen.split('|')

['Animation', "Children's", 'Comedy']

dummies.columns.get_indexer(gen.split("|"))

array([0, 1, 2], dtype=int64)

for i,gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split("|"))
    dummies.iloc[i,indices] = 1

movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]

movie_id                                       1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Animation                              1.0
Genre_Children's                             1.0
Genre_Comedy                                 1.0
Genre_Adventure                              0.0
Genre_Fantasy                                0.0
Genre_Romance                                0.0
Genre_Drama                                  0.0
Genre_Action                                 0.0
Genre_Crime                                  0.0
Genre_Thriller                               0.0
Genre_Horror                                 0.0
Genre_Sci-Fi                                 0.0
Genre_Documentary                            0.0
Genre_War                                    0.0
Genre_Musical                                0.0
Genre_Mystery                                0.0
Genre_Film-Noir                              0.0
Genre_Western                                0.0
Name: 0, dtype: object

#将get_dummies与cut等离散化函数结合使用是统计应用的一个有用方法
np.random.seed(12345)
values = np.random.rand(10)
values

array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
       0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])

bins = [0,0.2,0.4,0.6,0.8,1]
pd.get_dummies(pd.cut(values,bins))

	(0.0, 0.2]	(0.2, 0.4]	(0.4, 0.6]	(0.6, 0.8]	(0.8, 1.0]
0	0	0	0	0	1
1	0	1	0	0	0
2	1	0	0	0	0
3	0	1	0	0	0
4	0	0	1	0	0
5	0	0	1	0	0
6	0	0	0	0	1
7	0	0	0	1	0
8	0	0	0	1	0
9	0	0	0	1	0

7.3 字符串操作

7.3.1 字符串对象方法

#一个逗号分隔的字符串可以使用split方法拆分成多块：
val = 'a,b, guido'
val.split(',')

['a', 'b', ' guido']

#split常和strip一起使用，用于清除空格（包括换行）
pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

#这些子字符串可以使用加法与两个冒号分隔符连接在一起
first,second,third = pieces
first+"::"+second+"::"+third

'a::b::guido'

#在字符串’ : : ’的join方法中传入一个列表或元组是一种更快且更加Pythonic（Python风格化）的方法
"::".join(pieces)

'a::b::guido'

#使用Python的in关键字是检测子字符串的最佳方法，尽管index和find也能实现同样的功能
'guido' in val

True

#请注意find和index的区别在于index在字符串没有找到时会抛出一个异常（而find是返回-1）
val.index('guido')

val.find('guido')

#count返回的是某个特定的子字符串在字符串中出现的次数

val.count('guido')

#replace将用一种模式替代另一种模式。它通常也用于传入空字符串来删除某个模式
val.replace(',','::')

'a::b:: guido'

val.replace(',','')

'ab guido'

Python内建字符串方法

方法	描述
count	返回子字符串在字符串中的非重叠出现次数
endswith	如果字符串以后缀结尾则返回True
startswith	如果字符串以前缀开始则返回True
join	使用字符串作为间隔符，用于粘合其他字符串的序列
index	如果在字符串中找到，则返回子字符串中第一个字符的位置:如果找不到则引发ValueError
find	返回字符串中第一个出现子字符的第一个字符的位置:类似index,但如果没有找到则返回-1
rfind	返回子字符串在字符串中最后一次出现时第一个字符的位置，如果没有找到，则返回-1
replace	使用一个字符串替代另一个字符串
strip,rstrip,1strip	修剪空白，包括换行符，相当于对每个元素进行x. strip() (以及rstrip，lstrip)。
split	使用分隔符将字符串拆分为子字符串的列表
lower	将大写字母转换为小写字母
upper	将小写字母转换为大写字母
casefold	将字符转换为小写，并将任何特定于区域的变量字符组合转换为常见的可比较形式
ljust, rjust	左对齐或右对齐;用空格(或其他一些字符)填充字符串的相反侧以返回且，有最小宽度的字符串

7.3.2 正则表达式

re模块主要有三个主题：模式匹配、替代、拆分
描述一个或多个空白字符的正则表达式是\s+
如果你需要将相同的表达式应用到多个字符串上，推荐使用re.compile创建一个正则表达式对象，这样做有利于节约CPU周期。
为了在正则表达式中避免转义符\的影响，可以使用原生字符串语法，比如r’C:\x’或者用等价的’C:\x’
正则表达式方法

方法	描述
findall	将字符串中所有的非重叠匹配模式以列表形式返回
finditer	与findall类似，但返回的是迭代器
match	在字符串起始位置匹配模式，也可以将模式组建匹配到分组中;如果模式匹配上了，返回的一个匹配对象，否则返回None
search	扫描字符串的匹配模式，如果扫描到了返回匹配对象，与match方法不同的是，search 方法的匹配可以是字符串的任意位置，而不仅仅是字符串的起始位置
split	根据模式，将字符串拆分为多个部分
sub，subn	用替换表达式替换字符串中所有的匹配(sub) 或第n个出现的匹配串(subn);使用符号\ 1. \ 2 …来引用替换字符串中的匹配组元素

import re

text = 'foo bar\t baz \tqux'
re.split('\s+',text)

['foo', 'bar', 'baz', 'qux']

#你可以使用re.compile自行编译，形成一个可复用的正则表达式对象
regex = re.compile('\s+')
regex.split(text)

['foo', 'bar', 'baz', 'qux']

#如果你想获得的是一个所有匹配正则表达式的模式的列表，你可以使用findall方法
regex.findall(text)

[' ', '\t ', ' \t']

text = """
Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""

pattern = r'[A-Z0-9.%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
regex = re.compile(pattern,flags = re.IGNORECASE)

regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

#search返回的是文本中第一个匹配到的电子邮件地址
m = regex.search(text)
m

<re.Match object; span=(6, 21), match='dave@google.com'>

text[m.start():m.end()]

'dave@google.com'

#regex.match只在模式出现于字符串起始位置时进行匹配，如果没有匹配到，返回None
print(regex.match(text))

None

#sub会返回一个新的字符串，原字符串中的模式会被一个新的字符串替代
print(regex.sub('ABC',text))

Dave ABC
Steve ABC
Rob ABC
Ryan ABC

#假设您想查找电子邮件地址，并将每个地址分为三个部分：用户名，域名和域名后缀。
#要实现这一点，可以用括号将模式包起来
pattern = r'([A-Z0-9.%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern,flags = re.IGNORECASE)

m = regex.match('wesm@bright.net')
m.groups()

('wesm', 'bright', 'net')

#当模式可以分组时，findall返回的是包含元组的列表
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

print(regex.sub(r'Username:\1,Domain:\2,Suffix:\3',text))

Dave Username:dave,Domain:google,Suffix:com
Steve Username:steve,Domain:gmail,Suffix:com
Rob Username:rob,Domain:gmail,Suffix:com
Ryan Username:ryan,Domain:yahoo,Suffix:com

7.3.3 pandas中的向量化字符串函数

部分向量化字符串方法列表

方法	描述
cat	根据可选的分隔符按元素黏合字符串
contains	返回是否含有某个模式/正则表达式的布尔值数组
count	模式出现次数的计数
extract	使用正则表达式从字符串Scries 中分组抽取-个或多个字符串;返回的结果是每个分组形成-列的DataFrame
endswith	等价于对每个元素使用x. endwith (模式)

data = {'Dave':'dave@google.com','Steve':'steve@gmail.com','Rob':'rob@gmail.com','Ryan':np.nan}
data = pd.Series(data)
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Ryan                 NaN
dtype: object

data.isnull()

Dave     False
Steve    False
Rob      False
Ryan      True
dtype: bool

#你可以使用data.map将字符串和有效的正则表达式方法（以lambda或其他函数的方式传递）应用到每个值上，
#但是在NA（null）值上会失败,Series有面向数组的方法用于跳过NA值的字符串操作。这些方法通过Series的str属性进行调用
data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Ryan       NaN
dtype: object

#正则表达式也可以结合任意的re模块选项使用
data.str.findall(pattern,flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Ryan                       NaN
dtype: object

#可以使用str.get或在str属性内部索引
matches = data.str.match(pattern,flags=re.IGNORECASE)
matches

Dave     True
Steve    True
Rob      True
Ryan      NaN
dtype: object

#要访问嵌入式列表中的元素，我们可以将索引传递给这些函数中的任意一个
matches.str.get(l)

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-245-8d76f9329d2a> in <module>
      1 #要访问嵌入式列表中的元素，我们可以将索引传递给这些函数中的任意一个
----> 2 matches.str.get(l)


D:\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5459             or name in self._accessors
   5460         ):
-> 5461             return object.__getattribute__(self, name)
   5462         else:
   5463             if self._info_axis._can_hold_identifiers_and_holds_name(name):


D:\Anaconda3\lib\site-packages\pandas\core\accessor.py in __get__(self, obj, cls)
    178             # we're accessing the attribute of the class, i.e., Dataset.geo
    179             return self._accessor
--> 180         accessor_obj = self._accessor(obj)
    181         # Replace the property with the accessor object. Inspired by:
    182         # https://www.pydanny.com/cached-property.html


D:\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py in __init__(self, data)
    152         from pandas.core.arrays.string_ import StringDtype
    153 
--> 154         self._inferred_dtype = self._validate(data)
    155         self._is_categorical = is_categorical_dtype(data.dtype)
    156         self._is_string = isinstance(data.dtype, StringDtype)


D:\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py in _validate(data)
    215 
    216         if inferred_dtype not in allowed_types:
--> 217             raise AttributeError("Can only use .str accessor with string values!")
    218         return inferred_dtype
    219 


AttributeError: Can only use .str accessor with string values!

matches.str[0]

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-246-10bdd22fd8b2> in <module>
----> 1 matches.str[0]


D:\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5459             or name in self._accessors
   5460         ):
-> 5461             return object.__getattribute__(self, name)
   5462         else:
   5463             if self._info_axis._can_hold_identifiers_and_holds_name(name):


D:\Anaconda3\lib\site-packages\pandas\core\accessor.py in __get__(self, obj, cls)
    178             # we're accessing the attribute of the class, i.e., Dataset.geo
    179             return self._accessor
--> 180         accessor_obj = self._accessor(obj)
    181         # Replace the property with the accessor object. Inspired by:
    182         # https://www.pydanny.com/cached-property.html


D:\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py in __init__(self, data)
    152         from pandas.core.arrays.string_ import StringDtype
    153 
--> 154         self._inferred_dtype = self._validate(data)
    155         self._is_categorical = is_categorical_dtype(data.dtype)
    156         self._is_string = isinstance(data.dtype, StringDtype)


D:\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py in _validate(data)
    215 
    216         if inferred_dtype not in allowed_types:
--> 217             raise AttributeError("Can only use .str accessor with string values!")
    218         return inferred_dtype
    219 


AttributeError: Can only use .str accessor with string values!

#你可以使用字符串切片的类似语法进行向量化切片
data.str[:5]

Dave     dave@
Steve    steve
Rob      rob@g
Ryan       NaN
dtype: object

第八章数据规整：连接、联合与重塑

8.1 分层索引

#你看到的是一个以MultiIndex作为索引的Series的美化视图。索引中的“间隙”表示“直接使用上面的标签”
data = pd.Series(np.random.randn(9),
                index = [['a','a','a','b','b','c','c','d','d'],
                        [1,2,3,1,3,1,2,2,3]])
data

a  1    1.007189
   2   -1.296221
   3    0.274992
b  1    0.228913
   3    1.352917
c  1    0.886429
   2   -2.001637
d  2   -0.371843
   3    1.669025
dtype: float64

data.index

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

data['b']

1    0.228913
3    1.352917
dtype: float64

data['b':'c']

b  1    0.228913
   3    1.352917
c  1    0.886429
   2   -2.001637
dtype: float64

data.loc[['b','c']]

b  1    0.228913
   3    1.352917
c  1    0.886429
   2   -2.001637
dtype: float64

#在“内部”层级中进行选择也是可以的
data.loc[:,3]

a    0.274992
b    1.352917
d    1.669025
dtype: float64

#分层索引在重塑数据和数组透视表等分组操作中扮演了重要角色。
#可以使用unstack方法将数据在DataFrame中重新排列
data.unstack()

	1	2	3
a	1.007189	-1.296221	0.274992
b	0.228913	NaN	1.352917
c	0.886429	-2.001637	NaN
d	NaN	-0.371843	1.669025

#unstack的反操作是stack
data.unstack().stack()

a  1    1.007189
   2   -1.296221
   3    0.274992
b  1    0.228913
   3    1.352917
c  1    0.886429
   2   -2.001637
d  2   -0.371843
   3    1.669025
dtype: float64

#在DataFrame中，每个轴都可以拥有分层索引
frame = pd.DataFrame(np.arange(12).reshape(4,3),
                    index = [['a','a','b','b'],[1,2,1,2]],
                    columns = [['ohio','ohio','colorado'],['green','red','green']])
frame

		ohio		colorado
		green	red	green
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

frame.index.names = ['key1','key2']
frame

		ohio		colorado
		green	red	green
key1	key2
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

frame.columns.names = ['state','color']
frame

	state	ohio		colorado
	color	green	red	green
key1	key2
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

frame['ohio']

	color	green	red
key1	key2
a	1	0	1
a	2	3	4
b	1	6	7
b	2	9	10

#一个MultiIndex对象可以使用其自身的构造函数创建并复用
pd.MultiIndex.from_arrays([['ohio','ohio','colorado'],['green','red','green']],
                      names = ['state','color'])

MultiIndex([(    'ohio', 'green'),
            (    'ohio',   'red'),
            ('colorado', 'green')],
           names=['state', 'color'])

8.1.1 重排序和层级排序

#swaplevel接收两个层级序号或层级名称，返回一个进行了层级变更的新对象（但是数据是不变的）
frame.swaplevel('key1','key2')

	state	ohio		colorado
	color	green	red	green
key2	key1
1	a	0	1	2
2	a	3	4	5
1	b	6	7	8
2	b	9	10	11

#sort_index只能在单一层级上对数据进行排序。
#在进行层级变换时，使用sort_index以使得结果按照层级进行字典排序也很常见
frame.sort_index(level=1)

	state	ohio		colorado
	color	green	red	green
key1	key2
a	1	0	1	2
b	1	6	7	8
a	2	3	4	5
b	2	9	10	11

frame.sort_index(level=0)

	state	ohio		colorado
	color	green	red	green
key1	key2
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

frame.swaplevel(0,1).sort_index(level=0)

	state	ohio		colorado
	color	green	red	green
key2	key1
1	a	0	1	2
1	b	6	7	8
2	a	3	4	5
2	b	9	10	11

8.1.2 按层级进行汇总统计

DataFrame和Series中很多描述性和汇总性统计有一个level选项，通过level选项你可以指定你想要在某个特定的轴上进行聚合

frame.sum(level='key2')

state	ohio		colorado
color	green	red	green
key2
1	6	8	10
2	12	14	16

frame.sum(level='color',axis=1)

	color	green	red
key1	key2
a	1	2	1
a	2	8	4
b	1	14	7
b	2	20	10

8.1.3 使用DataFrame的列进行索引

通常我们不会使用DataFrame中一个或多个列作为行索引；反而你可能想要将行索引移动到DataFrame的列中。
how参数的不同连接类型

选项	行为
’ inner’	只对两张表都有的键的交集进行联合
‘left’	对所有左表的键进行联合
'right ’	对所有右表的键进行联合
’ outer’	对两张表都有的键的并集进行联合

frame = pd.DataFrame({'a':range(7),'b':range(7,0,-1),
                     'c':['one','one','one','two','two','two','two'],
                     'd':[0,1,2,0,1,2,3]})
frame

	a	b	c	d
0	0	7	one	0
1	1	6	one	1
2	2	5	one	2
3	3	4	two	0
4	4	3	two	1
5	5	2	two	2
6	6	1	two	3

#DataFrame的set_index函数会生成一个新的DataFrame，新的DataFrame使用一个或多个列作为索引
frame2 = frame.set_index(['c','d'])
frame2

		a	b
c	d
one	0	0	7
	1	1	6
	2	2	5
two	0	3	4
	1	4	3
	2	5	2
	3	6	1

#默认情况下这些列会从DataFrame中移除，你也可以将它们留在DataFrame中
frame.set_index(['c','d'],drop = False)

		a	b	c	d
c	d
one	0	0	7	one	0
	1	1	6	one	1
	2	2	5	one	2
two	0	3	4	two	0
	1	4	3	two	1
	2	5	2	two	2
	3	6	1	two	3

#reset_index是set_index的反操作，分层索引的索引层级会被移动到列中
frame2.reset_index()

	c	d	a	b
0	one	0	0	7
1	one	1	1	6
2	one	2	2	5
3	two	0	3	4
4	two	1	4	3
5	two	2	5	2
6	two	3	6	1

8.2 联合与合并数据集

pandas.merge根据一个或多个键将行进行连接。对于SQL或其他关系型数据库的用户来说，这种方式比较熟悉，它实现的是数据库的连接操作。
pandas.concat使对象在轴向上进行黏合或“堆叠”。
combine_first实例方法允许将重叠的数据拼接在一起，以使用一个对象中的值填充另一个对象中的缺失值。

8.2.1 数据库风格的DataFrame连接

合并或连接操作通过一个或多个键连接行来联合数据集
merge函数参数

参数	描述
left	合并时操作中左边的DataFrame
right	合并时操作中右边的DataFrame
how	‘inner’. ‘outer’. ‘left’. ‘right’之一; 默认是’ inner’
on	需要连接的列名。必须是在两边的DataFrame对象都有的列名，并以left和right中的列名的交集作为连接键
left_on	1eft DataFrame 中用作连接键的列
right_on	right DataFrame 中用作连接键的列
left_index	使用left的行索引作为它的连接键(如果是Multilndex，则是多个键)
right_index	使用right的行索引作为它的连接键(如果是MultiIndex,则是多个键)
sort	通过连接键按字母顺序对合并的数据进行排序;在默认情况下为True (在大数据集上某些情况下禁用该功能可以获得更好的性能)
suffixes	在重叠情况下，添加到列名后的字符串元组;默认是(’. x’,’ y’) (例如如果待合并的DataFrame中都含有’data’ 列，那么结果中会出现’data_x’、‘data_ y’)
copy	如果为False,则在某些特殊情况下避免将数据复制到结果数据结构中;默认情况下总是复制
indicator	添加一个特殊的列_ merge, 指示每一行的来源;值将根据每行中连接数据的来源分别为’left_ only’，‘right_ only’ 或’ both’

df1 = pd.DataFrame({'key':['b','b','a','c','a','a','b'],
                   'data1':range(7)})
df1

	key	data1
0	b	0
1	b	1
2	a	2
3	c	3
4	a	4
5	a	5
6	b	6

df2 = pd.DataFrame({'key':['a','b','d'],
                   'data2':range(3)})
df2

	key	data2
0	a	0
1	b	1
2	d	2

pd.merge(df1,df2)

	key	data1	data2
0	b	0	1
1	b	1	1
2	b	6	1
3	a	2	0
4	a	4	0
5	a	5	0

#请注意，我并没有指定在哪一列上进行连接。如果连接的键信息没有指定，merge会自动将重叠列名作为连接的键。
#但是，显式地指定连接键才是好的实现
pd.merge(df1,df2,on='key')

	key	data1	data2
0	b	0	1
1	b	1	1
2	b	6	1
3	a	2	0
4	a	4	0
5	a	5	0

#如果每个对象的列名是不同的，你可以分别为它们指定列名
df3 = pd.DataFrame({'Lkey':['b','b','a','c','a','a','b'],
                   'data1':range(7)})
df3

	Lkey	data1
0	b	0
1	b	1
2	a	2
3	c	3
4	a	4
5	a	5
6	b	6

df4 = pd.DataFrame({'Rkey':['a','b','d'],
                   'data2':range(3)})
df4

	Rkey	data2
0	a	0
1	b	1
2	d	2

pd.merge(df3,df4,left_on='Lkey',right_on='Rkey')

	Lkey	data1	Rkey	data2
0	b	0	b	1
1	b	1	b	1
2	b	6	b	1
3	a	2	a	0
4	a	4	a	0
5	a	5	a	0

#其他可选的选项有’left'、'right’和’outer'。
#外连接（outer join）是键的并集，联合了左连接和右连接的效果
pd.merge(df1,df2,how='outer')

	key	data1	data2
0	b	0.0	1.0
1	b	1.0	1.0
2	b	6.0	1.0
3	a	2.0	0.0
4	a	4.0	0.0
5	a	5.0	0.0
6	c	3.0	NaN
7	d	NaN	2.0

df1 = pd.DataFrame({'key':['b','b','a','c','a','b'],
                   'data1':range(6)})
df1

	key	data1
0	b	0
1	b	1
2	a	2
3	c	3
4	a	4
5	b	5

df2 = pd.DataFrame({'key':['a','b','a','b','d'],
                   'data2':range(5)})
df2

	key	data2
0	a	0
1	b	1
2	a	2
3	b	3
4	d	4

pd.merge(df1,df2,on='key',how='left')

	key	data1	data2
0	b	0	1.0
1	b	0	3.0
2	b	1	1.0
3	b	1	3.0
4	a	2	0.0
5	a	2	2.0
6	c	3	NaN
7	a	4	0.0
8	a	4	2.0
9	b	5	1.0
10	b	5	3.0

pd.merge(df1,df2,on='key',how='inner')

	key	data1	data2
0	b	0	1
1	b	0	3
2	b	1	1
3	b	1	3
4	b	5	1
5	b	5	3
6	a	2	0
7	a	2	2
8	a	4	0
9	a	4	2

#使用多个键进行合并时，传入一个列名的列表
left = pd.DataFrame({'key1':['foo','foo','bar'],
                    'key2':['one','two','one'],
                    'lval':[1,2,3]})
left

	key1	key2	lval
0	foo	one	1
1	foo	two	2
2	bar	one	3

right = pd.DataFrame({'key1':['foo','foo','bar','bar'],
                    'key2':['one','one','one','two'],
                    'rval':[4,5,6,7]})
right

	key1	key2	rval
0	foo	one	4
1	foo	one	5
2	bar	one	6
3	bar	two	7

pd.merge(left,right,on=['key1','key2'],how='outer')

	key1	key2	lval	rval
0	foo	one	1.0	4.0
1	foo	one	1.0	5.0
2	foo	two	2.0	NaN
3	bar	one	3.0	6.0
4	bar	two	NaN	7.0

#merge有一个suffixes后缀选项，
#用于在左右两边DataFrame对象的重叠列名后指定需要添加的字符串
pd.merge(left,right,on=['key1'])

	key1	key2_x	lval	key2_y	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

pd.merge(left,right,on=['key1'],suffixes=('_left','_right'))

	key1	key2_left	lval	key2_right	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

8.2.2 根据索引合并

在某些情况下，DataFrame中用于合并的键是它的索引。在这种情况下，你可以传递left_index=True或right_index=True（或者都传）来表示索引需要用来作为合并的键

left1 = pd.DataFrame({'key':['a','b','a','a','b','c'],
                     'value':range(6)})
left1

	key	value
0	a	0
1	b	1
2	a	2
3	a	3
4	b	4
5	c	5

right1 = pd.DataFrame({'group_val':[3.5,7]},index = ['a','b'])
right1

	group_val
a	3.5
b	7.0

pd.merge(left1,right1,left_on = 'key',right_index=True)

	key	value	group_val
0	a	0	3.5
2	a	2	3.5
3	a	3	3.5
1	b	1	7.0
4	b	4	7.0

pd.merge(left1,right1,left_on = 'key',right_index=True,how='outer')

	key	value	group_val
0	a	0	3.5
2	a	2	3.5
3	a	3	3.5
1	b	1	7.0
4	b	4	7.0
5	c	5	NaN

#在多层索引数据的情况下，事情会更复杂，在索引上连接是一个隐式的多键合并
lefth = pd.DataFrame({'key1':['ohio','ohio','ohio','Nevada','Nevada'],
                     'key2':[2000,2001,2002,2001,2002],
                     'data':np.arange(5.)})
lefth

	key1	key2	data
0	ohio	2000	0.0
1	ohio	2001	1.0
2	ohio	2002	2.0
3	Nevada	2001	3.0
4	Nevada	2002	4.0

righth = pd.DataFrame(np.arange(12).reshape(6,2),
                     index=[['nevada','nevada','ohio','ohio','ohio','ohio'],[2001,2000,2000,2000,2001,2002]],
                     columns = ['event1','event2'])
righth

		event1	event2
nevada	2001	0	1
nevada	2000	2	3
ohio	2000	4	5
	2000	6	7
	2001	8	9
	2002	10	11

pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True,how='outer')

	key1	key2	data	event1	event2
0	ohio	2000	0.0	4.0	5.0
0	ohio	2000	0.0	6.0	7.0
1	ohio	2001	1.0	8.0	9.0
2	ohio	2002	2.0	10.0	11.0
3	Nevada	2001	3.0	NaN	NaN
4	Nevada	2002	4.0	NaN	NaN
4	nevada	2001	NaN	0.0	1.0
4	nevada	2000	NaN	2.0	3.0

left2 = pd.DataFrame([[1,2],[3,4],[5,6]],
                    index = ['a','c','e'],
                    columns = ['ohio','nevada'])
left2

	ohio	nevada
a	1	2
c	3	4
e	5	6

right2 = pd.DataFrame([[7,8],[9,10],[11,12],[13,14]],
                    index = ['b','c','d','e'],
                    columns = ['missouri','alabama'])
right2

	missouri	alabama
b	7	8
c	9	10
d	11	12
e	13	14

pd.merge(left2,right2,how='outer',left_index=True,right_index=True)

	ohio	nevada	missouri	alabama
a	1.0	2.0	NaN	NaN
b	NaN	NaN	7.0	8.0
c	3.0	4.0	9.0	10.0
d	NaN	NaN	11.0	12.0
e	5.0	6.0	13.0	14.0

#join实例方法，用于按照索引合并
left2.join(right2,how='outer')

	ohio	nevada	missouri	alabama
a	1.0	2.0	NaN	NaN
b	NaN	NaN	7.0	8.0
c	3.0	4.0	9.0	10.0
d	NaN	NaN	11.0	12.0
e	5.0	6.0	13.0	14.0

left1.join(right1,on='key')

	key	value	group_val
0	a	0	3.5
1	b	1	7.0
2	a	2	3.5
3	a	3	3.5
4	b	4	7.0
5	c	5	NaN

another = pd.DataFrame([[7,8],[9,10],[11,12],[16,17]],
                      index = ['a','c','e','f'],
                      columns = ['new york','oregon'])
another

	new york	oregon
a	7	8
c	9	10
e	11	12
f	16	17

left2.join([right2,another])

	ohio	nevada	missouri	alabama	new york	oregon
a	1.0	2.0	NaN	NaN	7.0	8.0
c	3.0	4.0	9.0	10.0	9.0	10.0
e	5.0	6.0	13.0	14.0	11.0	12.0

left2.join([right2,another],how='outer')

	ohio	nevada	missouri	alabama	new york	oregon
a	1.0	2.0	NaN	NaN	7.0	8.0
c	3.0	4.0	9.0	10.0	9.0	10.0
e	5.0	6.0	13.0	14.0	11.0	12.0
b	NaN	NaN	7.0	8.0	NaN	NaN
d	NaN	NaN	11.0	12.0	NaN	NaN
f	NaN	NaN	NaN	NaN	16.0	17.0

8.2.3 沿轴向连接

另一种数据组合操作可互换地称为拼接、绑定或堆叠。
NumPy的concatenate函数可以在NumPy数组上实现该功能
concat函数的参数

参数	描述
objs	需要连接的pandas对象列表或字典，这是必选参数
axis	连接的轴向;默认是0 (沿着行方向)
join	可以是’inner’或’outer’ (默认是’outer’);用于指定连接方式是内连接(inner) 还是外连接(outer)
join_ _axes	用于指定其他n-1轴的特定索引，可以替代内/外连接的逻辑
keys	与要连接的对象关联的值，沿着连接轴形成分层索引;可以是任意值的列表或数组，也可以是元组的数组，也可以是数组的列表(如果向levels参数传入多层数组)
leels	在键值传递时，该参数用于指定多层索引的层级

arr = np.arange(12).reshape(3,4)
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

np.concatenate([arr,arr],axis=1)

array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

s1 = pd.Series([0,1],index=['a','b'])
s1

a    0
b    1
dtype: int64

s2 = pd.Series([2,3,4],index=['c','d','e'])
s2

c    2
d    3
e    4
dtype: int64

s3 = pd.Series([5,6],index=['f','g'])
s3

f    5
g    6
dtype: int64

pd.concat([s1,s2,s3])

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

#concat方法是沿着axis=0的轴向生效的，生成另一个Series。
#如果你传递axis=1，返回的结果则是一个DataFrame（axis=1时是列）
pd.concat([s1,s2,s3],axis=1)

	0	1	2
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

s4 = pd.concat([s1,s3])
s4

a    0
b    1
f    5
g    6
dtype: int64

pd.concat([s1,s4],axis=1)

	0	1
a	0.0	0
b	1.0	1
f	NaN	5
g	NaN	6

pd.concat([s1,s4],axis=1,join='inner')

	0	1
a	0	0
b	1	1

#可以使用join_axes来指定用于连接其他轴向的轴
pd.concat([s1,s4],axis=1,join_axes=[['a','c','b','e']])

	0	1
a	0.0	0
b	1.0	1
f	NaN	5
g	NaN	6

result = pd.concat([s1,s2,s3],keys=['one','two','three'])
result

one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: int64

result.unstack()

	a	b	c	d	e	f	g
one	0.0	1.0	NaN	NaN	NaN	NaN	NaN
two	NaN	NaN	2.0	3.0	4.0	NaN	NaN
three	NaN	NaN	NaN	NaN	NaN	5.0	6.0

pd.concat([s1,s2,s3],axis = 1,keys=['one','two','three'])

	one	two	three
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

df1 = pd.DataFrame(np.arange(6).reshape(3,2),
                  index = ['a','b','c'],
                  columns = ['one','two'])
df1

	one	two
a	0	1
b	2	3
c	4	5

df2 = pd.DataFrame(np.arange(4).reshape(2,2)+5,
                  index = ['a','c'],
                  columns = ['three','four'])
df2

	three	four
a	5	6
c	7	8

pd.concat([df1,df2],axis=1,keys=['lever1','level2'])

	lever1		level2
	one	two	three	four
a	0	1	5.0	6.0
b	2	3	NaN	NaN
c	4	5	7.0	8.0

#如果你传递的是对象的字典而不是列表的话，则字典的键会用于keys选项
pd.concat({'level1':df1,'level2':df2},axis=1)

	level1		level2
	one	two	three	four
a	0	1	5.0	6.0
b	2	3	NaN	NaN
c	4	5	7.0	8.0

pd.concat([df1,df2],axis=1,keys=['lever1','level2'],names=['upper','lower'])

upper	lever1		level2
lower	one	two	three	four
a	0	1	5.0	6.0
b	2	3	NaN	NaN
c	4	5	7.0	8.0

df1 = pd.DataFrame(np.random.randn(3,4),columns = ['a','b','c','d'])
df1

	a	b	c	d
0	-1.119593	1.953114	-1.514807	-1.054782
1	0.543393	1.172903	0.945829	0.656643
2	1.012695	1.481920	-0.413033	-1.280521

df2 = pd.DataFrame(np.random.randn(2,3),columns = ['b','d','a'])
df2

	b	d	a
0	1.638046	-0.850112	1.895532
1	-1.175952	1.370474	-0.992356

pd.concat([df1,df2],ignore_index=True)

	a	b	c	d
0	-1.119593	1.953114	-1.514807	-1.054782
1	0.543393	1.172903	0.945829	0.656643
2	1.012695	1.481920	-0.413033	-1.280521
3	1.895532	1.638046	NaN	-0.850112
4	-0.992356	-1.175952	NaN	1.370474

8.2.4 联合重叠数据

a = pd.Series([np.nan,2.5,0,3.5,4.5,np.nan],
             index=['f','e','d','c','b','a'])
a

f    NaN
e    2.5
d    0.0
c    3.5
b    4.5
a    NaN
dtype: float64

b = pd.Series([0,np.nan,2,np.nan,np.nan,5],
             index=['a','b','c','d','e','f'])
b

a    0.0
b    NaN
c    2.0
d    NaN
e    NaN
f    5.0
dtype: float64

#考虑NumPy的where函数，这个函数可以进行面向数组的if-else等价操作
np.where(pd.isnull(a),b,a)

array([0. , 2.5, 0. , 3.5, 4.5, 5. ])

#Series有一个combine_first方法，该方法可以等价于下面这种使用pandas常见数据对齐逻辑的轴向操作
b.combine_first(a)

a    0.0
b    4.5
c    2.0
d    0.0
e    2.5
f    5.0
dtype: float64

df1 = pd.DataFrame({'a':[1,np.nan,5,np.nan],
                   'b':[np.nan,2,np.nan,6],
                   'c':range(2,18,4)})
df1

	a	b	c
0	1.0	NaN	2
1	NaN	2.0	6
2	5.0	NaN	10
3	NaN	6.0	14

df2 = pd.DataFrame({'a':[5,4,np.nan,3,7],
                   'b':[np.nan,3,4,6,8]})
df2

	a	b
0	5.0	NaN
1	4.0	3.0
2	NaN	4.0
3	3.0	6.0
4	7.0	8.0

df1.combine_first(df2)

	a	b	c
0	1.0	NaN	2.0
1	4.0	2.0	6.0
2	5.0	4.0	10.0
3	3.0	6.0	14.0
4	7.0	8.0	NaN

8.3 重塑和透视

重排列表格型数据有多种基础操作。这些操作被称为重塑或透视。

8.3.1 使用多层索引进行重塑

statck（堆叠）该操作会“旋转”或将列中的数据透视到行。
unstack（拆堆）该操作会将行中的数据透视到列。

data = pd.DataFrame(np.arange(6).reshape(2,3),
                    index = pd.Index(['ohio','colorado'],name='state'),
                    columns = pd.Index(['one','two','three'],name='number'))
data

number	one	two	three
state
ohio	0	1	2
colorado	3	4	5

result = data.stack()
result

state     number
ohio      one       0
          two       1
          three     2
colorado  one       3
          two       4
          three     5
dtype: int32

result.unstack()

number	one	two	three
state
ohio	0	1	2
colorado	3	4	5

result.unstack(0)

state	ohio	colorado
number
one	0	3
two	1	4
three	2	5

result.unstack('state')

state	ohio	colorado
number
one	0	3
two	1	4
three	2	5

#如果层级中的所有值并未包含于每个子分组中时，拆分可能会引入缺失值
s1 = pd.Series([0,1,2,3],index=['a','b','c','d'])
s1

a    0
b    1
c    2
d    3
dtype: int64

s2 = pd.Series([4,5,6],index=['c','d','e'])
s2

c    4
d    5
e    6
dtype: int64

data = pd.concat([s1,s2],keys=['one','two'])
data

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

data.unstack()

	a	b	c	d	e
one	0.0	1.0	2.0	3.0	NaN
two	NaN	NaN	4.0	5.0	6.0

#默认情况下，堆叠会过滤出缺失值，因此堆叠拆堆的操作是可逆的
data.unstack().stack()

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64

data.unstack().stack(dropna = False)

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64

#当你在DataFrame中拆堆时，被拆堆的层级会变为结果中最低的层级
df = pd.DataFrame({'left':result,'right':result+5},
                 columns=pd.Index(['left','right'],name='side'))
df

	side	left	right
state	number
ohio	one	0	5
	two	1	6
	three	2	7
colorado	one	3	8
	two	4	9
	three	5	10

df.unstack('state')

side	left		right
state	ohio	colorado	ohio	colorado
number
one	0	3	5	8
two	1	4	6	9
three	2	5	7	10

#在调用stack方法时，我们可以指明需要堆叠的轴向名称
df.unstack('state').stack('side')

	state	colorado	ohio
number	side
one	left	3	0
one	right	8	5
two	left	4	1
two	right	9	6
three	left	5	2
three	right	10	7

df.unstack('state').stack()

	side	left	right
number	state
one	ohio	0	5
one	colorado	3	8
two	ohio	1	6
two	colorado	4	9
three	ohio	2	7
three	colorado	5	10

8.3.2 将“长”透视为“宽”

data = pd.read_csv('examples/macrodata.csv')
data.head()

	year	quarter	realgdp	realcons	realinv	realgovt	realdpi	cpi	m1	tbilrate	unemp	pop	infl	realint
0	1959.0	1.0	2710.349	1707.4	286.898	470.045	1886.9	28.98	139.7	2.82	5.8	177.146	0.00	0.00
1	1959.0	2.0	2778.801	1733.7	310.859	481.301	1919.7	29.15	141.7	3.08	5.1	177.830	2.34	0.74
2	1959.0	3.0	2775.488	1751.8	289.226	491.260	1916.4	29.35	140.5	3.82	5.3	178.657	2.74	1.09
3	1959.0	4.0	2785.204	1753.7	299.356	484.052	1931.3	29.37	140.0	4.33	5.6	179.386	0.27	4.06
4	1960.0	1.0	2847.699	1770.5	331.722	462.199	1955.5	29.54	139.6	3.50	5.2	180.007	2.31	1.19

periods = pd.PeriodIndex(year=data.year,quarter=data.quarter,name='date')
periods

PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', name='date', length=203, freq='Q-DEC')

columns = pd.Index(['realgdp','infl','unemp'],name='item')
columns

Index(['realgdp', 'infl', 'unemp'], dtype='object', name='item')

data = data.reindex(columns=columns)
data.head()

item	realgdp	infl	unemp
0	2710.349	NaN	5.8
1	2778.801	NaN	5.1
2	2775.488	NaN	5.3
3	2785.204	NaN	5.6
4	2847.699	NaN	5.2

data.index = periods.to_timestamp('D','end')
data.head()

item	realgdp	infl	unemp
date
1959-03-31 23:59:59.999999999	2710.349	NaN	5.8
1959-06-30 23:59:59.999999999	2778.801	NaN	5.1
1959-09-30 23:59:59.999999999	2775.488	NaN	5.3
1959-12-31 23:59:59.999999999	2785.204	NaN	5.6
1960-03-31 23:59:59.999999999	2847.699	NaN	5.2

ldata = data.stack().reset_index().rename(columns={0:'value'})
ldata[:10]

	date	item	value
0	1959-03-31 23:59:59.999999999	realgdp	2710.349
1	1959-03-31 23:59:59.999999999	unemp	5.800
2	1959-06-30 23:59:59.999999999	realgdp	2778.801
3	1959-06-30 23:59:59.999999999	unemp	5.100
4	1959-09-30 23:59:59.999999999	realgdp	2775.488
5	1959-09-30 23:59:59.999999999	unemp	5.300
6	1959-12-31 23:59:59.999999999	realgdp	2785.204
7	1959-12-31 23:59:59.999999999	unemp	5.600
8	1960-03-31 23:59:59.999999999	realgdp	2847.699
9	1960-03-31 23:59:59.999999999	unemp	5.200

pivoted = ldata.pivot('date','item','value')
pivoted

item	realgdp	unemp
date
1959-03-31 23:59:59.999999999	2710.349	5.8
1959-06-30 23:59:59.999999999	2778.801	5.1
1959-09-30 23:59:59.999999999	2775.488	5.3
1959-12-31 23:59:59.999999999	2785.204	5.6
1960-03-31 23:59:59.999999999	2847.699	5.2
...	...	...
2008-09-30 23:59:59.999999999	13324.600	6.0
2008-12-31 23:59:59.999999999	13141.920	6.9
2009-03-31 23:59:59.999999999	12925.410	8.1
2009-06-30 23:59:59.999999999	12901.504	9.2
2009-09-30 23:59:59.999999999	12990.341	9.6

203 rows × 2 columns

8.3.3 将“宽”透视为“长”

在DataFrame中，pivot方法的反操作是pandas.melt

df = pd.DataFrame({'key':['foo','bar','baz'],
                  'A':[1,2,3],
                  'B':[4,5,6],
                  'C':[7,8,9]})
df

	key	A	B	C
0	foo	1	4	7
1	bar	2	5	8
2	baz	3	6	9

#当使用pandas.melt时，我们必须指明哪些列是分组指标（如果有的话）
melted = pd.melt(df)
melted

	variable	value
0	key	foo
1	key	bar
2	key	baz
3	A	1
4	A	2
5	A	3
6	B	4
7	B	5
8	B	6
9	C	7
10	C	8
11	C	9

melted = pd.melt(df,['key'])
melted

	key	variable	value
0	foo	A	1
1	bar	A	2
2	baz	A	3
3	foo	B	4
4	bar	B	5
5	baz	B	6
6	foo	C	7
7	bar	C	8
8	baz	C	9

reshaped = melted.pivot('key','variable','value')
reshaped

variable	A	B	C
key
bar	2	5	8
baz	3	6	9
foo	1	4	7

#使用reset_index来将数据回移一列
reshaped.reset_index()

variable	key	A	B	C
0	bar	2	5	8
1	baz	3	6	9
2	foo	1	4	7

pd.melt(df,id_vars=['key'],value_vars=['A','B'])

	key	variable	value
0	foo	A	1
1	bar	A	2
2	baz	A	3
3	foo	B	4
4	bar	B	5
5	baz	B	6

pd.melt(df,value_vars=['A','B','C'])

	variable	value
0	A	1
1	A	2
2	A	3
3	B	4
4	B	5
5	B	6
6	C	7
7	C	8
8	C	9

pd.melt(df,value_vars=['A','B','key'])

	variable	value
0	A	1
1	A	2
2	A	3
3	B	4
4	B	5
5	B	6
6	key	foo
7	key	bar
8	key	baz

第九章绘图与可视化

%matplotlib notebook

9.1 简明matplotlib API入门

使用Jupyter notebook时有个细节需要注意，在每个单元格运行后，图表被重置，因此对于更复杂的图表，你必须将所有的绘图命令放在单个的notebook单元格中

import matplotlib.pyplot as plt

import numpy as np
data = np.arange(10)
data

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

plt.plot(data)

9.1.1 图片与子图

matplotlib所绘制的图位于图片（Figure）对象中
pyplot.subplots选项

参数	描述
nrows	子图的行数
ncols	子图的列数
sharex	所有子图使用相同的x轴刻度(调整xlim会影响所有子图)
sharey	所有子图使用相同的y轴刻度(调整ylim会影响所有子图)
subplot_ kw	传入add_ subplot 的关键字参数字典，用于生成子图
**fig_ _kW	在生成图片时使用的额外关键字参数，例如plt. subplots (2，2,figsize= (8,6))

#你可以使用plt.figure生成一个新的图片
fig = plt.figure()
ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,2)
ax3 = fig.add_subplot(2,2,3)

<IPython.core.display.Javascript object>

标签：数据分析,index,python,dtype,DataFrame,利用,pd,00.00,data
来源： https://blog.csdn.net/weixin_43816759/article/details/122214004

	(0.0, 0.2]	(0.2, 0.4]	(0.4, 0.6]	(0.6, 0.8]	(0.8, 1.0]
0	0	0	0	0	1
1	0	1	0	0	0
2	1	0	0	0	0
3	0	1	0	0	0
4	0	0	1	0	0
5	0	0	1	0	0
6	0	0	0	0	1
7	0	0	0	1	0
8	0	0	0	1	0
9	0	0	0	1	0

	key	data1	data2
0	b	0.0	1.0
1	b	1.0	1.0
2	b	6.0	1.0
3	a	2.0	0.0
4	a	4.0	0.0
5	a	5.0	0.0
6	c	3.0	NaN
7	d	NaN	2.0

	key	data1	data2
0	b	0	1.0
1	b	0	3.0
2	b	1	1.0
3	b	1	3.0
4	a	2	0.0
5	a	2	2.0
6	c	3	NaN
7	a	4	0.0
8	a	4	2.0
9	b	5	1.0
10	b	5	3.0

	key1	key2	lval	rval
0	foo	one	1.0	4.0
1	foo	one	1.0	5.0
2	foo	two	2.0	NaN
3	bar	one	3.0	6.0
4	bar	two	NaN	7.0

	key1	key2_x	lval	key2_y	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	key1	key2_left	lval	key2_right	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	0	1	2
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

	one	two	three
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

	(0.0, 0.2]	(0.2, 0.4]	(0.4, 0.6]	(0.6, 0.8]	(0.8, 1.0]
0	0	0	0	0	1
1	0	1	0	0	0
2	1	0	0	0	0
3	0	1	0	0	0
4	0	0	1	0	0
5	0	0	1	0	0
6	0	0	0	0	1
7	0	0	0	1	0
8	0	0	0	1	0
9	0	0	0	1	0

	key	data1	data2
0	b	0.0	1.0
1	b	1.0	1.0
2	b	6.0	1.0
3	a	2.0	0.0
4	a	4.0	0.0
5	a	5.0	0.0
6	c	3.0	NaN
7	d	NaN	2.0

	key	data1	data2
0	b	0	1.0
1	b	0	3.0
2	b	1	1.0
3	b	1	3.0
4	a	2	0.0
5	a	2	2.0
6	c	3	NaN
7	a	4	0.0
8	a	4	2.0
9	b	5	1.0
10	b	5	3.0

	key1	key2	lval	rval
0	foo	one	1.0	4.0
1	foo	one	1.0	5.0
2	foo	two	2.0	NaN
3	bar	one	3.0	6.0
4	bar	two	NaN	7.0

	key1	key2_x	lval	key2_y	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	key1	key2_left	lval	key2_right	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	0	1	2
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

	one	two	three
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

	(0.0, 0.2]	(0.2, 0.4]	(0.4, 0.6]	(0.6, 0.8]	(0.8, 1.0]
0	0	0	0	0	1
1	0	1	0	0	0
2	1	0	0	0	0
3	0	1	0	0	0
4	0	0	1	0	0
5	0	0	1	0	0
6	0	0	0	0	1
7	0	0	0	1	0
8	0	0	0	1	0
9	0	0	0	1	0

	key	data1	data2
0	b	0.0	1.0
1	b	1.0	1.0
2	b	6.0	1.0
3	a	2.0	0.0
4	a	4.0	0.0
5	a	5.0	0.0
6	c	3.0	NaN
7	d	NaN	2.0

	key	data1	data2
0	b	0	1.0
1	b	0	3.0
2	b	1	1.0
3	b	1	3.0
4	a	2	0.0
5	a	2	2.0
6	c	3	NaN
7	a	4	0.0
8	a	4	2.0
9	b	5	1.0
10	b	5	3.0

	key1	key2	lval	rval
0	foo	one	1.0	4.0
1	foo	one	1.0	5.0
2	foo	two	2.0	NaN
3	bar	one	3.0	6.0
4	bar	two	NaN	7.0

	key1	key2_x	lval	key2_y	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	key1	key2_left	lval	key2_right	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	0	1	2
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

	one	two	three
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0