6 Association Analysis: Advanced Concepts
This chapter discusses a few advanced concepts of association analysis. First, we look at how categorical and continuous attributes are converted into items. The we look at integrating item hierarchies into the analysis. Finally, sequence pattern mining is introduced.
Packages Used in this Chapter
pkgs <- c("arules", "arulesSequences", "tidyverse")
pkgs_install <- pkgs[!(pkgs %in% installed.packages()[,"Package"])]
if(length(pkgs_install)) install.packages(pkgs_install)
The packages used for this chapter are:
- arules (Hahsler et al. 2024)
- arulesSequences (Buchta and Hahsler 2024)
- tidyverse (Wickham 2023c)
6.1 Handling Categorical Attributes
Categorical attributes are nominal or ordinal variables.
In R they are factors
or ordinal
. They are
translated into a series of binary items (one for each level constructed as variable
name = level
). Items cannot represent order and this ordered factors lose
the order information. Note that nominal variables need to be encoded as
factors (and not characters or numbers) before converting them into transactions.
For the special case of Boolean variables (logical
), the TRUE
value is
converted into an item with the name of the variable and for the FALSE
values no item is created.
We will give an example in the next section.
6.2 Handling Continuous Attributes
Continuous variables cannot directly be represented as items and need to be
discretized first (see [Discretization] in Chapter 2).
An item resulting from discretization might be age>18
and
the column contains only TRUE
or FALSE
. Alternatively, it can be a factor
with levels age<=18
, 50=>age>18
and age>50
. These will be automatically
converted into 3 items, one for each level. Discretization is described in functions
discretize()
and discretizeDF()
to discretize all columns in a data.frame.
We give a short example using the iris dataset. We add an extra logical
column
to show how Boolean attributes are converted in to items.
data(iris)
## add a Boolean attribute
iris$Versicolor <- iris$Species == "versicolor"
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## Versicolor
## 1 FALSE
## 2 FALSE
## 3 FALSE
## 4 FALSE
## 5 FALSE
## 6 FALSE
The first step is to
discretize continuous attributes (marked as <dbl>
in the table above).
We discretize the two Petal features.
library(tidyverse)
library(arules)
iris_disc <- iris %>%
mutate(Petal.Length = discretize(Petal.Length,
method = "frequency",
breaks = 3,
labels = c("short", "medium", "long")),
Petal.Width = discretize(Petal.Width,
method = "frequency",
breaks = 2,
labels = c("narrow", "wide"))
)
head(iris_disc)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 short narrow setosa
## 2 4.9 3.0 short narrow setosa
## 3 4.7 3.2 short narrow setosa
## 4 4.6 3.1 short narrow setosa
## 5 5.0 3.6 short narrow setosa
## 6 5.4 3.9 short narrow setosa
## Versicolor
## 1 FALSE
## 2 FALSE
## 3 FALSE
## 4 FALSE
## 5 FALSE
## 6 FALSE
Next, we convert the dataset into transactions.
trans <- transactions(iris_disc)
## Warning: Column(s) 1, 2 not logical or factor. Applying
## default discretization (see '? discretizeDF').
trans
## transactions in sparse format with
## 150 transactions (rows) and
## 15 items (columns)
The conversion creates a warning because there are still two undiscretized columns in the data. The warning indicates that the default discretization is used automatically.
itemLabels(trans)
## [1] "Sepal.Length=[4.3,5.4)" "Sepal.Length=[5.4,6.3)"
## [3] "Sepal.Length=[6.3,7.9]" "Sepal.Width=[2,2.9)"
## [5] "Sepal.Width=[2.9,3.2)" "Sepal.Width=[3.2,4.4]"
## [7] "Petal.Length=short" "Petal.Length=medium"
## [9] "Petal.Length=long" "Petal.Width=narrow"
## [11] "Petal.Width=wide" "Species=setosa"
## [13] "Species=versicolor" "Species=virginica"
## [15] "Versicolor"
We see that all continuous variables are discretized and the different ranges
create an item. For example Petal.Width
has the two items Petal.Width=narrow
and Petal.Width=wide
. The automatically discretized variables show intervals.
Sepal.Length=[4.3,5.4)
means that this item used for flowers with
a sepal length between 4.3 and 5.4 cm.
The species is converted into three items, one for each class. The logical
variable Versicolor
created only a single item that is used when
the variable is TRUE
.
6.3 Handling Concept Hierarchies
Often an item hierarchy is available for transactions used for association rule mining. For example in a supermarket dataset items like “bread” and “beagle” might belong to the item group (category) “baked goods.” Transactions can store item hierarchies as additional columns in the itemInfo data.frame.
6.3.1 Aggregation
To perform analysis at a group level of the item hierarchy, aggregate()
produces a new object with items aggregated to a given group level. A group-level item is present if one or more of the items in the group are present in the original object. If rules are aggregated, and the aggregation would lead to the same aggregated group item in the lhs and in the rhs, then that group item is removed from the lhs. Rules or itemsets, which are not unique after the aggregation, are also removed. Note also that the quality measures are not applicable to the new rules and thus are removed. If these measures are required, then aggregate the transactions before mining rules.
We use the Groceries data set in this example. It contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The items are 169 products categories.
data("Groceries")
Groceries
## transactions in sparse format with
## 9835 transactions (rows) and
## 169 items (columns)
The dataset also contains two aggregation levels.
head(itemInfo(Groceries))
## labels level2 level1
## 1 frankfurter sausage meat and sausage
## 2 sausage sausage meat and sausage
## 3 liver loaf sausage meat and sausage
## 4 ham sausage meat and sausage
## 5 meat sausage meat and sausage
## 6 finished products sausage meat and sausage
We aggregate to level1 stored in Groceries. All items with the same level2 label will become a single item with that name. This reduces the number of items to the 55 level2 categories
Groceries_level2 <- aggregate(Groceries, by = "level2")
Groceries_level2
## transactions in sparse format with
## 9835 transactions (rows) and
## 55 items (columns)
head(itemInfo(Groceries_level2)) ## labels are alphabetically sorted!
## labels level2 level1
## 1 baby food baby food canned food
## 2 bags bags non-food
## 3 bakery improver bakery improver processed food
## 4 bathroom cleaner bathroom cleaner detergent
## 5 beef beef meat and sausage
## 6 beer beer drinks
We can now compare an original transaction with the aggregated transaction.
inspect(head(Groceries, 3))
## items
## [1] {citrus fruit,
## semi-finished bread,
## margarine,
## ready soups}
## [2] {tropical fruit,
## yogurt,
## coffee}
## [3] {whole milk}
inspect(head(Groceries_level2, 3))
## items
## [1] {bread and backed goods,
## fruit,
## soups/sauces,
## vinegar/oils}
## [2] {coffee,
## dairy produce,
## fruit}
## [3] {dairy produce}
For example, citrus fruit in the first transaction was translated to the category fruit. Note that the order of items in a transaction is not important, so it might change during aggregation.
It is now easy to mine rules on the aggregated data.
rules <- apriori(Groceries_level2, support = 0.005)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime
## 0.8 0.1 1 none FALSE TRUE 5
## support minlen maxlen target ext
## 0.005 1 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 49
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[55 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [47 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [243 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules |> head(3, by = "support") |> inspect()
## lhs rhs support confidence coverage lift count
## [1] {bread and backed goods,
## cheese,
## fruit} => {dairy produce} 0.02481 0.8385 0.02959 1.893 244
## [2] {bread and backed goods,
## cheese,
## vegetables} => {dairy produce} 0.02379 0.8239 0.02888 1.860 234
## [3] {cheese,
## fruit,
## vegetables} => {dairy produce} 0.02267 0.8479 0.02674 1.914 223
You can add your own aggregation to an existing dataset by constructing
the and iteminfo data.frame and adding it to the transactions. See ? hierarchy
for details.
6.3.2 Multi-level Analysis
To analyze relationships between individual items and item groups at the same time, addAggregate()
can be used to create a new transactions object which contains both, the original items and group-level items.
Groceries_multilevel <- addAggregate(Groceries, "level2")
Groceries_multilevel |> head(n=3) |> inspect()
## items
## [1] {citrus fruit,
## semi-finished bread,
## margarine,
## ready soups,
## bread and backed goods*,
## fruit*,
## soups/sauces*,
## vinegar/oils*}
## [2] {tropical fruit,
## yogurt,
## coffee,
## coffee*,
## dairy produce*,
## fruit*}
## [3] {whole milk,
## dairy produce*}
The added group-level items are marked with an * after the name. Now we can mine rules including items from multiple levels.
rules <- apriori(Groceries_multilevel,
parameter = list(support = 0.005))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime
## 0.8 0.1 1 none FALSE TRUE 5
## support minlen maxlen target ext
## 0.005 1 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 49
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[224 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [167 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 done [0.05s].
## writing ... [21200 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules
## set of 21200 rules
Mining rules with group-level items added will create many spurious rules of the type
with a confidence of 1.
This will also happen if you mine itemsets. filterAggregate()
can be used to filter these spurious rules or itemsets.
rules <- filterAggregate(rules)
rules
## set of 838 rules
rules |> head(n = 3, by = "lift") |> inspect()
## lhs rhs support confidence coverage lift count
## [1] {whole milk,
## whipped/sour cream,
## bread and backed goods*,
## cheese*} => {vegetables*} 0.005186 0.8095 0.006406 2.965 51
## [2] {sausage,
## poultry*} => {vegetables*} 0.005084 0.8065 0.006304 2.954 50
## [3] {other vegetables,
## soda,
## fruit*,
## sausage*} => {bread and backed goods*} 0.005287 0.8525 0.006202 2.467 52
Using multi-level mining can reduce the number of rules and help to analyze if customers differentiate between products in a group.
6.4 Sequential Patterns
The frequent sequential pattern mining algorithm cSPADE (Zaki 2000) is implemented in
the arules
extension package arulesSequences
.
Sequential pattern mining starts with sequences of events. Each sequence is identified by a sequence ID and each event is a set of items that happen together. The order of events is specified using event IDs. The goal is to find subsequences of items in events that follow each other frequently. These are called frequent sequential pattern.
We will look at a small example dataset that comes with the package arulesSequences.
library(arulesSequences)
##
## Attaching package: 'arulesSequences'
## The following object is masked from 'package:arules':
##
## itemsets
data(zaki)
inspect(zaki)
## items sequenceID eventID SIZE
## [1] {C, D} 1 10 2
## [2] {A, B, C} 1 15 3
## [3] {A, B, F} 1 20 3
## [4] {A, C, D, F} 1 25 4
## [5] {A, B, F} 2 15 3
## [6] {E} 2 20 1
## [7] {A, B, F} 3 10 3
## [8] {D, G, H} 4 10 3
## [9] {B, F} 4 20 2
## [10] {A, G, H} 4 25 3
The dataset contains four sequences (see sequenceID
) and the event IDs
are integer numbers to provide the order events in a sequence.
In arulesSequences, this set of sequences is implemented as a regular transaction
set, where each transaction is an event. The temporal information is added
as extra columns to the transaction’s transactionInfo()
data.frame.
Mine frequent sequence patterns using cspade is very similar to using apriori. Here we set support so we will find patterns that occur in 50% of the sequences.
fsp <- cspade(zaki, parameter = list(support = .5))
fsp |> inspect()
## items support
## 1 <{A}> 1.00
## 2 <{B}> 1.00
## 3 <{D}> 0.50
## 4 <{F}> 1.00
## 5 <{A,
## F}> 0.75
## 6 <{B,
## F}> 1.00
## 7 <{D},
## {F}> 0.50
## 8 <{D},
## {B,
## F}> 0.50
## 9 <{A,
## B,
## F}> 0.75
## 10 <{A,
## B}> 0.75
## 11 <{D},
## {B}> 0.50
## 12 <{B},
## {A}> 0.50
## 13 <{D},
## {A}> 0.50
## 14 <{F},
## {A}> 0.50
## 15 <{D},
## {F},
## {A}> 0.50
## 16 <{B,
## F},
## {A}> 0.50
## 17 <{D},
## {B,
## F},
## {A}> 0.50
## 18 <{D},
## {B},
## {A}> 0.50
##
For example, pattern 17 shows that D in an event, it is often followed by an event by containing B and F which in turn is followed by an event containing A.
The cspade algorithm supports many additional parameters to control gaps
and windows. Details can be found in the manual page for cspade
.
Rules, similar to regular association rules can be generated
from frequent sequence patterns using ruleInduction()
.
rules <- ruleInduction(fsp, confidence = .8)
rules |> inspect()
## lhs rhs support confidence lift
## 1 <{D}> => <{F}> 0.5 1 1
## 2 <{D}> => <{B, 0.5 1 1
## F}>
## 3 <{D}> => <{B}> 0.5 1 1
## 4 <{D}> => <{A}> 0.5 1 1
## 5 <{D},
## {F}> => <{A}> 0.5 1 1
## 6 <{D},
## {B,
## F}> => <{A}> 0.5 1 1
## 7 <{D},
## {B}> => <{A}> 0.5 1 1
##
The usual measures of confidence and lift are used.