其他分享
首页 > 其他分享> > pandas

pandas

作者:互联网

\(matplotlib\) 全是图,懒得用.md格式记笔记了

pandas

\(pandas\) 库中包含两个重要的数据类型: \(Series\) 和 \(DataFrame\),前者是一维数据类型,后者是多维数据类型。

\(Series\) 数据类型

索引+一维数据

创建

没有指定索引时,索引值从0开始

>>> a=[1,2,3]
>>> m=pd.Series(a)
>>> m
0    1
1    2
2    3
dtype: int64
#左侧为索引,右侧为数据

指定索引

>>> a=["Google","Runoob","Wiki"]
>>> m=pd.Series(a,index=[1,2,3])
>>> m
1    Google
2    Runoob
3      Wiki
dtype: object

使用时就像\(c\)++里面的 \(map\) 一样

>>> m[1]
'Google'

也可以使用键值对字典创立 \(Series\)

>>> mp={1:"Google",2:"Runoob",3:"Wiki"}
>>> m=pd.Series(mp)
>>> m
1    Google
2    Runoob
3      Wiki
dtype: object

还可以给 \(Series\) 命名

>>> a=["Google","Runoob","Wiki"]
>>> m=pd.Series(a,index=[1,2,3],name="misasteria")
>>> m.index.name="me"
>>> m
me
1    Google
2    Runoob
3      Wiki
Name: misasteria, dtype: object

\(DataFrame\)数据类型

创建

\(pandas.DataFrame( data, index, columns, dtype, copy)\)

用列表创建

>>> data=[["Google",10],["Runoob",12],["Wiki",13]]
>>> df=pd.DataFrame(data,index=[1,2,3],columns=["site","age"])
>>> df
     site  age
1  Google   10
2  Runoob   12
3    Wiki   13

用 \(numpy.ndarray\) 创建

>>> import numpy as np
>>> df=pd.DataFrame(np.arange(10).reshape(2,5))
>>> df
   0  1  2  3  4
0  0  1  2  3  4
1  5  6  7  8  9

用字典创建

>>> data=[{'a':1,'b':2},{'a':5,'b':10,'c':20}]
>>> df=pd.DataFrame(data)
>>> df
   a   b     c
0  1   2   NaN
1  5  10  20.0
#没有数据为NaN

\(pandas\) 处理 csv 文件

csv转DataFrame

>>> df=pd.read_csv("D:\\nba.csv")
>>> print(df.to_string) #转化为DataFrame格式
<bound method DataFrame.to_string of               Name            Team  Number Position   Age Height  Weight            College     Salary
0    Avery Bradley  Boston Celtics     0.0       PG  25.0    6-2   180.0              Texas  7730337.0
1      Jae Crowder  Boston Celtics    99.0       SF  25.0    6-6   235.0          Marquette  6796117.0
2     John Holland  Boston Celtics    30.0       SG  27.0    6-5   205.0  Boston University        NaN
3      R.J. Hunter  Boston Celtics    28.0       SG  22.0    6-5   185.0      Georgia State  1148640.0
4    Jonas Jerebko  Boston Celtics     8.0       PF  29.0   6-10   231.0                NaN  5000000.0
..             ...             ...     ...      ...   ...    ...     ...                ...        ...
453   Shelvin Mack       Utah Jazz     8.0       PG  26.0    6-3   203.0             Butler  2433333.0
454      Raul Neto       Utah Jazz    25.0       PG  24.0    6-1   179.0                NaN   900000.0
455   Tibor Pleiss       Utah Jazz    21.0        C  26.0    7-3   256.0                NaN  2900000.0
456    Jeff Withey       Utah Jazz    24.0        C  26.0    7-0   231.0             Kansas   947276.0
457            NaN             NaN     NaN      NaN   NaN    NaN     NaN                NaN        NaN

[458 rows x 9 columns]>

DataFrame转csv

>>> df=pd.DataFrame(np.arange(10).reshape(2,5))
>>> df.to_csv("D:\\site.csv")
#不存在文件时会自动创建

数据处理

\(DataFrame.head(n)\) 读取前n行,缺省时为5
\(DataFrame.tail(n)\) 读取后n行,缺省时为5
\(DataFrame.info()\) 输出一些基本信息

>>> df=pd.read_csv("D:\\nba.csv")
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 458 entries, 0 to 457 #行数
Data columns (total 9 columns): #列数
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   Name      457 non-null    object
 1   Team      457 non-null    object
 2   Number    457 non-null    float64
 3   Position  457 non-null    object
 4   Age       457 non-null    float64
 5   Height    457 non-null    object
 6   Weight    457 non-null    float64
 7   College   373 non-null    object
 8   Salary    446 non-null    float64
dtypes: float64(4), object(5)
#non-null 非空的数据

数据运算

算数运算

自动补齐,缺项为NaN

>>> a=pd.DataFrame(np.arange(12).reshape(3,4))
>>> a
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
>>> b=pd.DataFrame(np.arange(20).reshape(4,5))
>>> b
    0   1   2   3   4
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
3  15  16  17  18  19
>>> a + b
      0     1     2     3   4
0   0.0   2.0   4.0   6.0 NaN
1   9.0  11.0  13.0  15.0 NaN
2  18.0  20.0  22.0  24.0 NaN
3   NaN   NaN   NaN   NaN NaN
>>> a * b
      0     1      2      3   4
0   0.0   1.0    4.0    9.0 NaN
1  20.0  30.0   42.0   56.0 NaN
2  80.0  99.0  120.0  143.0 NaN
3   NaN   NaN    NaN    NaN NaN

可以用\(fill\_value\)规定缺少的项的数值

>>> b.add(a,fill_value=100)
       0      1      2      3      4
0    0.0    2.0    4.0    6.0  104.0
1    9.0   11.0   13.0   15.0  109.0
2   18.0   20.0   22.0   24.0  114.0
3  115.0  116.0  117.0  118.0  119.0
>>> a.add(b,fill_value=100)
       0      1      2      3      4
0    0.0    2.0    4.0    6.0  104.0
1    9.0   11.0   13.0   15.0  109.0
2   18.0   20.0   22.0   24.0  114.0
3  115.0  116.0  117.0  118.0  119.0

只有不同维度之间会进行广播运算

>>> b=pd.DataFrame(np.arange(3))
>>> a+b
    0   1   2   3
0   0 NaN NaN NaN
1   5 NaN NaN NaN
2  10 NaN NaN NaN
>>> b-10
    0
0 -10
1  -9
2  -8

一维 \(Series\) 默认在1轴进行运算,可以强制使其在0轴运算

>>> b=pd.DataFrame(np.arange(20).reshape(4,5))
>>> c=pd.Series(np.arange(4))
>>> b
    0   1   2   3   4
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
3  15  16  17  18  19
>>> c
0    0
1    1
2    2
3    3
>>> b.sub(c)
      0     1     2     3   4
0   0.0   0.0   0.0   0.0 NaN
1   5.0   5.0   5.0   5.0 NaN
2  10.0  10.0  10.0  10.0 NaN
3  15.0  15.0  15.0  15.0 NaN
>>> b.sub(c,axis=0)
    0   1   2   3   4
0   0   1   2   3   4
1   4   5   6   7   8
2   8   9  10  11  12
3  12  13  14  15  16

标签:10,non,df,NaN,DataFrame,pd,pandas
来源: https://www.cnblogs.com/misasteria/p/16596439.html