数据科学与大数据分析项目练习-3将Apriori算法应用于R中提供的“Groceries”数据集
作者:互联网
R语言Apriori算法
我们需要安装arules and arulesViz包。
项目要求:
生成频繁项目集满足下面条件:
– The minimum support threshold as 0.02
– The minimum length of the itemsets as 1
– The maximum length of the itemsets as 10
生成的关联规则满足下面条件:
– The minimum support threshold as 0.001
– The minimum confidence threshold as 0.6
Project Start
# 下载并加载相关的包(可能需要梯子)
install.packages('arules')
install.packages('arulesViz')
library('arules')
library('arulesViz')
# 读取并展示“Groceries”包
data(Groceries)
Groceries
summary(Groceries)
class(Groceries)
可以看到这个“Groceries”包一共有9835行和169列。
# 展示前20个grocery labels
Groceries@itemInfo[1:20,]
# 显示第10至20的transactions
apply(Groceries@data[,10:20], 2,
function(r) paste(Groceries@itemInfo[r,"labels"], collapse=", ")
)
接下来是生成频繁项目集
参数的设置按照要求
# frequent 1-itemsets
itemsets <- apriori(Groceries, parameter=list(minlen=1, maxlen=1, support=0.02, target="frequent itemsets"))
summary(itemsets)
inspect(head(sort(itemsets, by = "support"), 10))
# frequent 2-itemsets
itemsets <- apriori(Groceries, parameter=list(minlen=2, maxlen=2, support=0.02, target="frequent itemsets"))
summary(itemsets)
inspect(head(sort(itemsets, by ="support"),10))
# frequent 3-itemsets
itemsets <- apriori(Groceries, parameter=list(minlen=3, maxlen=3, support=0.02, target="frequent itemsets"))
inspect(sort(itemsets, by ="support"))
# frequent 4-itemsets
itemsets <- apriori(Groceries, parameter=list(minlen=4, maxlen=4, support=0.02, target="frequent itemsets"))
inspect(sort(itemsets, by ="support"))
生成的频繁项目集结果如下:
规则生成和可视化
rules <- apriori(Groceries, parameter=list(support=0.001,
confidence=0.6, target = "rules"))
summary(rules)
plot(rules)
plot(rules@quality)
# 显示rules与最高lift scores
inspect(head(sort(rules, by="lift"), 10))
筛选出置信度大于0.9的rules
confidentRules <- rules[quality(rules)$confidence > 0.9]
confidentRules
plot(confidentRules, method="matrix", measure=c("lift", "confidence"), control=list(recorder=TRUE))
一个127个,之后polt成为matrix得到
# 选择lift最高的5项规则
highLiftRules <- head(sort(rules, by="lift"), 5)
plot(highLiftRules, method="graph", control=list(type="items"))
得到图片如下所示。
从图中我们可以得到买火腿的基本上都会购买加工芝士,买爆米花或者soda饮料的很大概率会购买咸味小吃。
参考书目
- Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, EMC Education Services, John Wiley & Sons, 27 Jan. 2015
标签:数据分析,10,20,rules,Apriori,minimum,Groceries,itemsets 来源: https://blog.csdn.net/Jifu_M/article/details/111996893