EDA常用操作1

2020-12-31 04:01:17 作者：互联网

常用命令记录：

可能包括pandas,numpy,matplotlib,seaborn,scipy
默认数据为df = pd.read_csv()

1. 看column的数据类型

df.info()

2. 看data的简单指标:min,max,Q1,2,3,4,mean,std,count

df.describe()

3. 看有哪些列

df.columns

4. 看每个列的不重复值

df["xxx"].unique()
df[["xxx"]].drop_duplicates()

5. 看每列里面有多少nan

len(df[df['xxx'].isna()])

6. 看哪些列里面存在nan

# 看哪些列里面存在nan
def check_nan(df):
  col_with_nan = []
  for col in df.columns:
    # print(col)
    if len(df[df[col].isna()]) >= 1:
      col_with_nan.append(col)
      continue
  return col_with_nan
check_nan(df)

7. 转变datetime（得重新赋值，然后可以不用自定义格式）

df['xxx'] = pd.to_datetime(df['xxx'],infer_datetime_format=True)

8. 常见matplotlib开头设置

%matplotlib inline
plt.style.use('ggplot')
import warnings
warnings.filterwarnings('ignore')

9. 删除重复行

def drop_duplicates(df):
  len1 = len(df)
  df = df.drop_duplicates()
  len2 = len(df)
  print(f'there are {len2-len1} duplicated rows')
  return df
drop_duplicates(df)

10. 删除所有NaN

# 删除行，方式：all（只有全部是nan才删）
df.dropna(axis=0,how='all')
# 删除列，方式：any（有nan就删），但是要大于thresh个nan
df.dropna(axis=1,how='any',thresh=2)

11. 填充NaN

# 填充常值
df.fillna(0)
# 按列填充常值
values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
df.fillna(value=values)
# 按前面的数propagate
df.fillna(method='ffill')
# 按后面的数propagate
df.fillna(method='bfill')
# 用Mean来填充
df.fillna(df.mean())
# 用median来填充
df.fillna(df.median())
# 用最大/最小
df.fillna(df.max())
df.fillna(df.min())
# 用出现最多次来填充(用Mode会新生成一个带有频次的表，第一行是每个列里面次数最多的）
df.fillna(df.mode().iloc[0])

12. one-hot-encoding

# 可以添加sparse=True，如果有必要
pd.get_dummies(df['xxx'])
# 添加拼接功能到原来的matrix
def one_hot_encoding(col,df):
  encoding = pd.get_dummies(df[col],sparse=True)
  df.drop(col,axis=1)
  df = pd.concat([df,encoding],axis=0)
  return df

13. freqency encoding

标签：常用,EDA,df,xxx,nan,操作,fillna,drop,col
来源： https://www.cnblogs.com/niemand-01/p/14214352.html