其他分享
首页 > 其他分享> > 0.5小时学会用pandas解决真实世界中的数据科学任务

0.5小时学会用pandas解决真实世界中的数据科学任务

作者:互联网

Github source code & data: https://github.com/KeithGalli/Pandas-Data-Science-Tasks

 1.将12个月的销售记录整合到一个文件中

import pandas as pd
import os

#读取某文件
df = pd.read_csv('./Sales_Data/Sales_September_2019.csv')
print(df.head())

# 读取所有文件
files = [file for file in os.listdir('./Sales_Data')]
all_months_data = pd.DataFrame()
for file in files:
    #print(file)
    df = pd.read_csv('./Sales_Data/' + file)
    all_months_data = pd.concat([all_months_data, df])      #上下堆在一起
# print(all_months_data.head())
all_months_data.to_csv('all_data.csv', index = False)

#读取整合文件
all_data = pd.read_csv('all_data.csv')
print(all_data.head())

 

2.数据清理

DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

 

import pandas as pd

all_data = pd.read_csv('all_data.csv')

# 去掉NAN
nan_df = all_data[all_data.isna().any(axis=1)]
# print(nan_df.head())
all_data = all_data.dropna(how='all')
# print(all_data.head())

# 去掉数据中的索引
all_data = all_data[all_data['Order Date'].str[0:2] != 'Or']
# print(all_data.head())

# 将字符串转换成数字类型
all_data['Quantity Ordered'] = pd.to_numeric(all_data['Quantity Ordered'])
all_data['Price Each'] = pd.to_numeric(all_data['Price Each'])

all_data.to_csv('all_data_clean.csv', index = False)

3.增加数据的新列

import pandas as pd


def get_city(address):
    return address.split(',')[1]


def get_state(address):
    return address.split(',')[2].split(' ')[1]  # 左边有空格,所以是1


all_data = pd.read_csv('all_data_clean.csv')

# 增加月份列
all_data['Month'] = all_data['Order Date'].str[0:2]
all_data['Month'] = all_data['Month'].astype('int32')

# 增加销售额列
all_data['Sales'] = all_data['Quantity Ordered'] * all_data['Price Each']
# print(all_data.head())

# 增加城市列
all_data['City'] = all_data['Purchase Address'].apply(lambda x: f"{get_city(x)} ({get_state(x)})")
# print(all_data.head())

# 增加时间列
all_data['Order Date'] = pd.to_datetime(all_data['Order Date'])
all_data['Hour'] = all_data['Order Date'].dt.hour
all_data['Minute'] = all_data['Order Date'].dt.minute
# print(all_data.head())

all_data.to_csv('all_data_new.csv', index=False)

 

问题1:销售最好的月份是几月份?赚了多少钱?

问题2:哪个城市销售额最高?

问题3:什么时间发布广告最好?

groupby核心:

  1. 不论分组键是数组、列表、字典、Series、函数,只要其与待分组变量的轴长度一致都可以传入groupby进行分组
  2. 默认axis=0按行分组,可指定axis=1对列分组

groupby('Month').sum():按月份分组,以月份为索引,相同月份的其他值相加

 

 

import pandas as pd
import matplotlib.pyplot as plt

all_data = pd.read_csv('all_data_new.csv')

# 哪一月份销售额最好
result_mouth = all_data.groupby('Month').sum()
# print(result)

# 哪个城市销售额最高
result_city = all_data.groupby('City').sum()

plt.figure()

# 可视化月份
plt.subplot(221)
months = range(1, 13)
plt.bar(months, result_mouth['Sales'])
plt.xticks(months)
plt.ylabel('Sales in USD($)')
plt.xlabel('Month number')

# 可视化城市
plt.subplot(222)
cities = [city for city, df in all_data.groupby('City')]
plt.bar(cities, result_city['Sales'])
plt.xticks(cities, rotation='vertical', size=8)
plt.ylabel('Sales in USD($)')
plt.xlabel('City name')

# 可视化时间,什么时间发布广告最好
plt.subplot(223)
hours = [hour for hour, df in all_data.groupby('Hour')]
plt.plot(hours, all_data.groupby('Hour').count())
plt.xticks(hours)
plt.xlabel('Hour')
plt.ylabel('Number of orders')
plt.grid()  # 显示网格

plt.show()

问题4:什么产品经常被一起买?

分析:如果产品一起被买,那么它们的ID是一样的。

 

duplicated函数用于标记Series中的值、DataFrame中的记录行是否是重复,重复为True,不重复为False

pandas.DataFrame.duplicated(self, subset=None, keep='first')

pandas.Series.duplicated(self, keep='first')

其中参数解释如下:

 

import pandas as pd
from itertools import combinations
from collections import Counter

all_data = pd.read_csv('all_data_clean.csv')
df = all_data[all_data['Order ID'].duplicated(keep=False)]
# print(df.head(20))
df['Grouped'] = df.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))
df = df[['Order ID', 'Grouped']].drop_duplicates()
# print(df.head())

count = Counter()
for row in df['Grouped']:
    row_list = row.split(',')
    count.update(Counter(combinations(row_list, 2)))

# print(count)
# print(count.most_common(10))

for key, value in count.most_common(10):
    print(key, value)

问题5:什么产品最畅销?为什么?

 

import pandas as pd
import matplotlib.pyplot as plt

all_data = pd.read_csv('all_data_new.csv')

product_group = all_data.groupby('Product')
quantity_ordered = product_group.sum()['Quantity Ordered']
products = [product for product, df in product_group]
plt.bar(products, quantity_ordered)
plt.xticks(products, rotation='vertical', size=8)
plt.ylabel('Quantity ordered')
plt.xlabel('product')
# plt.show()

# 将价格写在上面
prices = all_data.groupby('Product').mean()['Price Each']

fig, ax1 = plt.subplots()

ax2 = ax1.twinx()  # 产生一个ax1的镜面坐标
ax1.bar(products, quantity_ordered, color='g')
ax2.plot(products, prices, 'b-')

ax1.set_xlabel('Product name')
ax1.set_ylabel('Quantity ordered', color='g')
ax2.set_ylabel('Price($)', color='b')
ax1.set_xticklabels(products, rotation='vertical', size=8)
plt.show()

 

标签:plt,df,0.5,真实世界,pd,print,csv,data,pandas
来源: https://blog.csdn.net/SimbaG/article/details/117451709