kaggle比赛实践M5-baseline研读
作者:互联网
采用lightGBM模型
准备数据与训练
calendar.csv数据集导入。
该数据数聚包含物品的售卖时间与物品类型
- date: The date in a “y-m-d” format.
- wm_yr_wk: The id of the week the date belongs to.
- weekday: The type of the day (Saturday, Sunday, …, Friday).
- wday: The id of the weekday, starting from Saturday.
- month: The month of the date.
- year: The year of the date.
- event_name_1: If the date includes an event, the name of this event.
- event_type_1: If the date includes an event, the type of this event.
- event_name_2: If the date includes a second event, the name of this event.
- event_type_2: If the date includes a second event, the type of this event.
- snap_CA, snap_TX, and snap_WI: A binary variable (0 or 1) indicating whether the stores of CA, TX or WI allow SNAPpurchases on the examined date. 1 indicates that SNAP purchases are allowed.
# Correct data types for "calendar.csv" calendarDTypes = {"event_name_1": "category", "event_name_2": "category", "event_type_1": "category", "event_type_2": "category", "weekday": "category", 'wm_yr_wk': 'int16', "wday": "int16", "month": "int16", "year": "int16", "snap_CA": "float32", 'snap_TX': 'float32', 'snap_WI': 'float32' } # Read csv file calendar = pd.read_csv("./calendar.csv", dtype = calendarDTypes) calendar["date"] = pd.to_datetime(calendar["date"]) calendar.head(10)
# Transform categorical features into integers for col, colDType in calendarDTypes.items(): if colDType == "category": calendar[col] = calendar[col].cat.codes.astype("int16") calendar[col] -= calendar[col].min() calendar.head(10)
- calendar[col].cat.codes.astype("int16") 这个是属于简单的编码标签类别编码。后面我们尝试改为one编码试试
sell_prices.csv
File 2: “sell_prices.csv”
该数据数聚包含物品的每天每单位的售卖价格
- store_id: The id of the store where the product is sold.
- item_id: The id of the product.
- wm_yr_wk: The id of the week.
- sell_price: The price of the product for the given week/store. The price is provided per week (average across seven days). If not available, this means that the product was not sold during the examined week. Note that although prices are constant at weekly basis, they may change through time (both training and test set).
# Correct data types for "sell_prices.csv" priceDTypes = {"store_id": "category", "item_id": "category", "wm_yr_wk": "int16", "sell_price":"float32"} # Read csv file prices = pd.read_csv("./sell_prices.csv", dtype = priceDTypes) prices.head()
# Transform categorical features into integers for col, colDType in priceDTypes.items(): if colDType == "category": prices[col] = prices[col].cat.codes.astype("int16") prices[col] -= prices[col].min() prices.head()
sales_train_validation.csv
File 3: “sales_train.csv”
Contains the historical daily unit sales data per product and store.
- item_id: The id of the product.
- dept_id: The id of the department the product belongs to.
- cat_id: The id of the category the product belongs to.
- store_id: The id of the store where the product is sold.
- state_id: The State where the store is located.
- d_1, d_2, …, d_i, … d_1941: The number of units sold at day i, starting from 2011-01-29.
firstDay = 250 lastDay = 1913 # Use x sales days (columns) for training numCols = [f"d_{day}" for day in range(firstDay, lastDay+1)] # Define all categorical columns catCols = ['id', 'item_id', 'dept_id','store_id', 'cat_id', 'state_id'] # Define the correct data types for "sales_train_validation.csv" dtype = {numCol: "float32" for numCol in numCols} dtype.update({catCol: "category" for catCol in catCols if catCol != "id"}) [(k,v) for k,v in dtype.items()][:10]
# Read csv file ds = pd.read_csv("./sales_train_validation.csv", usecols = catCols + numCols, dtype = dtype) ds.head()
# Transform categorical features into integers for col in catCols: if col != "id": ds[col] = ds[col].cat.codes.astype("int16") ds[col] -= ds[col].min() ds = pd.melt(ds, id_vars = catCols, value_vars = [col for col in ds.columns if col.startswith("d_")], var_name = "d", value_name = "sales") # Merge "ds" with "calendar" and "prices" dataframe ds = ds.merge(calendar, on = "d", copy = False) ds = ds.merge(prices, on = ["store_id", "item_id", "wm_yr_wk"], copy = False) ds.head()
1·1
标签:event,baseline,kaggle,ds,col,M5,calendar,csv,id 来源: https://www.cnblogs.com/wqbin/p/12785680.html