0.5小时学会用pandas解决真实世界中的数据科学任务
作者:互联网
Github source code & data: https://github.com/KeithGalli/Pandas-Data-Science-Tasks
1.将12个月的销售记录整合到一个文件中
import pandas as pd
import os
#读取某文件
df = pd.read_csv('./Sales_Data/Sales_September_2019.csv')
print(df.head())
# 读取所有文件
files = [file for file in os.listdir('./Sales_Data')]
all_months_data = pd.DataFrame()
for file in files:
#print(file)
df = pd.read_csv('./Sales_Data/' + file)
all_months_data = pd.concat([all_months_data, df]) #上下堆在一起
# print(all_months_data.head())
all_months_data.to_csv('all_data.csv', index = False)
#读取整合文件
all_data = pd.read_csv('all_data.csv')
print(all_data.head())
2.数据清理
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
- 函数作用:删除含有空值的行或列
- axis:维度,axis=0表示index行,axis=1表示columns列,默认为0
- how:"all"表示这一行或列中的元素全部缺失(为nan)才删除这一行或列,"any"表示这一行或列中只要有元素缺失,就删除这一行或列
- thresh:一行或一列中至少出现了thresh个才删除。
- subset:在某些列的子集中选择出现了缺失值的列删除,不在子集中的含有缺失值得列或行不会删除(有axis决定是行还是列)
- inplace:刷选过缺失值得新数据是存为副本还是直接在原数据上进行修改。
import pandas as pd
all_data = pd.read_csv('all_data.csv')
# 去掉NAN
nan_df = all_data[all_data.isna().any(axis=1)]
# print(nan_df.head())
all_data = all_data.dropna(how='all')
# print(all_data.head())
# 去掉数据中的索引
all_data = all_data[all_data['Order Date'].str[0:2] != 'Or']
# print(all_data.head())
# 将字符串转换成数字类型
all_data['Quantity Ordered'] = pd.to_numeric(all_data['Quantity Ordered'])
all_data['Price Each'] = pd.to_numeric(all_data['Price Each'])
all_data.to_csv('all_data_clean.csv', index = False)
3.增加数据的新列
import pandas as pd
def get_city(address):
return address.split(',')[1]
def get_state(address):
return address.split(',')[2].split(' ')[1] # 左边有空格,所以是1
all_data = pd.read_csv('all_data_clean.csv')
# 增加月份列
all_data['Month'] = all_data['Order Date'].str[0:2]
all_data['Month'] = all_data['Month'].astype('int32')
# 增加销售额列
all_data['Sales'] = all_data['Quantity Ordered'] * all_data['Price Each']
# print(all_data.head())
# 增加城市列
all_data['City'] = all_data['Purchase Address'].apply(lambda x: f"{get_city(x)} ({get_state(x)})")
# print(all_data.head())
# 增加时间列
all_data['Order Date'] = pd.to_datetime(all_data['Order Date'])
all_data['Hour'] = all_data['Order Date'].dt.hour
all_data['Minute'] = all_data['Order Date'].dt.minute
# print(all_data.head())
all_data.to_csv('all_data_new.csv', index=False)
问题1:销售最好的月份是几月份?赚了多少钱?
问题2:哪个城市销售额最高?
问题3:什么时间发布广告最好?
groupby核心:
- 不论分组键是数组、列表、字典、Series、函数,只要其与待分组变量的轴长度一致都可以传入groupby进行分组
- 默认axis=0按行分组,可指定axis=1对列分组
groupby('Month').sum():按月份分组,以月份为索引,相同月份的其他值相加
import pandas as pd
import matplotlib.pyplot as plt
all_data = pd.read_csv('all_data_new.csv')
# 哪一月份销售额最好
result_mouth = all_data.groupby('Month').sum()
# print(result)
# 哪个城市销售额最高
result_city = all_data.groupby('City').sum()
plt.figure()
# 可视化月份
plt.subplot(221)
months = range(1, 13)
plt.bar(months, result_mouth['Sales'])
plt.xticks(months)
plt.ylabel('Sales in USD($)')
plt.xlabel('Month number')
# 可视化城市
plt.subplot(222)
cities = [city for city, df in all_data.groupby('City')]
plt.bar(cities, result_city['Sales'])
plt.xticks(cities, rotation='vertical', size=8)
plt.ylabel('Sales in USD($)')
plt.xlabel('City name')
# 可视化时间,什么时间发布广告最好
plt.subplot(223)
hours = [hour for hour, df in all_data.groupby('Hour')]
plt.plot(hours, all_data.groupby('Hour').count())
plt.xticks(hours)
plt.xlabel('Hour')
plt.ylabel('Number of orders')
plt.grid() # 显示网格
plt.show()
问题4:什么产品经常被一起买?
分析:如果产品一起被买,那么它们的ID是一样的。
duplicated函数用于标记Series中的值、DataFrame中的记录行是否是重复,重复为True,不重复为False
pandas.DataFrame.duplicated(self, subset=None, keep='first')
pandas.Series.duplicated(self, keep='first')
其中参数解释如下:
- subset:用于识别重复的列标签或列标签序列,默认所有列标签
- keep=‘frist':除了第一次出现外,其余相同的被标记为重复
- keep='last':除了最后一次出现外,其余相同的被标记为重复
- keep=‘False’:所有相同的都被标记为重复
import pandas as pd
from itertools import combinations
from collections import Counter
all_data = pd.read_csv('all_data_clean.csv')
df = all_data[all_data['Order ID'].duplicated(keep=False)]
# print(df.head(20))
df['Grouped'] = df.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))
df = df[['Order ID', 'Grouped']].drop_duplicates()
# print(df.head())
count = Counter()
for row in df['Grouped']:
row_list = row.split(',')
count.update(Counter(combinations(row_list, 2)))
# print(count)
# print(count.most_common(10))
for key, value in count.most_common(10):
print(key, value)
问题5:什么产品最畅销?为什么?
- plt.subplot(111)是plt.subplot(1, 1, 1)另一个写法而已,更完整的写法是plt.subplot(nrows=1, ncols=1, index=1)
- fig, ax = plt.subplots()等价于fig, ax = plt.subplots(11)
- fig, axes = plt.subplots(23):即表示一次性在figure上创建成2*3的网格,使用plt.subplot()只能一个一个的添加
import pandas as pd
import matplotlib.pyplot as plt
all_data = pd.read_csv('all_data_new.csv')
product_group = all_data.groupby('Product')
quantity_ordered = product_group.sum()['Quantity Ordered']
products = [product for product, df in product_group]
plt.bar(products, quantity_ordered)
plt.xticks(products, rotation='vertical', size=8)
plt.ylabel('Quantity ordered')
plt.xlabel('product')
# plt.show()
# 将价格写在上面
prices = all_data.groupby('Product').mean()['Price Each']
fig, ax1 = plt.subplots()
ax2 = ax1.twinx() # 产生一个ax1的镜面坐标
ax1.bar(products, quantity_ordered, color='g')
ax2.plot(products, prices, 'b-')
ax1.set_xlabel('Product name')
ax1.set_ylabel('Quantity ordered', color='g')
ax2.set_ylabel('Price($)', color='b')
ax1.set_xticklabels(products, rotation='vertical', size=8)
plt.show()
标签:plt,df,0.5,真实世界,pd,print,csv,data,pandas 来源: https://blog.csdn.net/SimbaG/article/details/117451709