6 Association Analysis: Advanced Concepts

This chapter discusses a few advanced concepts of association analysis. First, we look at how categorical and continuous attributes are converted into items. The we look at integrating item hierarchies into the analysis. Finally, sequence pattern mining is introduced.

Packages Used in this Chapter

pkgs <- c("arules", "arulesSequences", "tidyverse")

pkgs_install <- pkgs[!(pkgs %in% installed.packages()[,"Package"])]
if(length(pkgs_install)) install.packages(pkgs_install)

The packages used for this chapter are:

6.1 Handling Categorical Attributes

Categorical attributes are nominal or ordinal variables. In R they are factors or ordinal. They are translated into a series of binary items (one for each level constructed as ⁠variable name = level⁠). Items cannot represent order and this ordered factors lose the order information. Note that nominal variables need to be encoded as factors (and not characters or numbers) before converting them into transactions.

For the special case of Boolean variables (logical), the TRUE value is converted into an item with the name of the variable and for the FALSE values no item is created.

We will give an example in the next section.

6.2 Handling Continuous Attributes

Continuous variables cannot directly be represented as items and need to be discretized first (see Discretization in Chapter 2). An item resulting from discretization might be age>18 and the column contains only TRUE or FALSE. Alternatively, it can be a factor with levels age<=18, ⁠50=>age>18⁠ and age>50. These will be automatically converted into 3 items, one for each level. Discretization is described in functions discretize() and discretizeDF() to discretize all columns in a data.frame.

We give a short example using the iris dataset. We add an extra logical column to show how Boolean attributes are converted in to items.

data(iris)

## add a Boolean attribute
iris$Versicolor <- iris$Species == "versicolor"
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
##   Versicolor
## 1      FALSE
## 2      FALSE
## 3      FALSE
## 4      FALSE
## 5      FALSE
## 6      FALSE

The first step is to discretize continuous attributes (marked as <dbl> in the table above). We discretize the two Petal features.

library(tidyverse)
library(arules)

iris_disc <- iris %>% 
  mutate(Petal.Length = discretize(Petal.Length, 
                          method = "frequency", 
                          breaks = 3, 
                          labels = c("short", "medium", "long")),
         Petal.Width = discretize(Petal.Width,
                          method = "frequency", 
                          breaks = 2, 
                          labels = c("narrow", "wide"))
         )
  

head(iris_disc)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5        short      narrow  setosa
## 2          4.9         3.0        short      narrow  setosa
## 3          4.7         3.2        short      narrow  setosa
## 4          4.6         3.1        short      narrow  setosa
## 5          5.0         3.6        short      narrow  setosa
## 6          5.4         3.9        short      narrow  setosa
##   Versicolor
## 1      FALSE
## 2      FALSE
## 3      FALSE
## 4      FALSE
## 5      FALSE
## 6      FALSE

Next, we convert the dataset into transactions.

trans <- transactions(iris_disc)
## Warning: Column(s) 1, 2 not logical or factor. Applying
## default discretization (see '? discretizeDF').
trans
## transactions in sparse format with
##  150 transactions (rows) and
##  15 items (columns)

The conversion creates a warning because there are still two undiscretized columns in the data. The warning indicates that the default discretization is used automatically.

itemLabels(trans)
##  [1] "Sepal.Length=[4.3,5.4)" "Sepal.Length=[5.4,6.3)"
##  [3] "Sepal.Length=[6.3,7.9]" "Sepal.Width=[2,2.9)"   
##  [5] "Sepal.Width=[2.9,3.2)"  "Sepal.Width=[3.2,4.4]" 
##  [7] "Petal.Length=short"     "Petal.Length=medium"   
##  [9] "Petal.Length=long"      "Petal.Width=narrow"    
## [11] "Petal.Width=wide"       "Species=setosa"        
## [13] "Species=versicolor"     "Species=virginica"     
## [15] "Versicolor"

We see that all continuous variables are discretized and the different ranges create an item. For example Petal.Width has the two items Petal.Width=narrow and Petal.Width=wide. The automatically discretized variables show intervals. Sepal.Length=[4.3,5.4) means that this item used for flowers with a sepal length between 4.3 and 5.4 cm.

The species is converted into three items, one for each class. The logical variable Versicolor created only a single item that is used when the variable is TRUE.

6.3 Handling Concept Hierarchies

Often an item hierarchy is available for transactions used for association rule mining. For example in a supermarket dataset items like “bread” and “beagle” might belong to the item group (category) “baked goods.” Transactions can store item hierarchies as additional columns in the itemInfo data.frame.

6.3.1 Aggregation

To perform analysis at a group level of the item hierarchy, aggregate() produces a new object with items aggregated to a given group level. A group-level item is present if one or more of the items in the group are present in the original object. If rules are aggregated, and the aggregation would lead to the same aggregated group item in the lhs and in the rhs, then that group item is removed from the lhs. Rules or itemsets, which are not unique after the aggregation, are also removed. Note also that the quality measures are not applicable to the new rules and thus are removed. If these measures are required, then aggregate the transactions before mining rules.

We use the Groceries data set in this example. It contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The items are 169 products categories.

data("Groceries")
Groceries
## transactions in sparse format with
##  9835 transactions (rows) and
##  169 items (columns)

The dataset also contains two aggregation levels.

head(itemInfo(Groceries))
##              labels  level2           level1
## 1       frankfurter sausage meat and sausage
## 2           sausage sausage meat and sausage
## 3        liver loaf sausage meat and sausage
## 4               ham sausage meat and sausage
## 5              meat sausage meat and sausage
## 6 finished products sausage meat and sausage

We aggregate to level1 stored in Groceries. All items with the same level2 label will become a single item with that name. This reduces the number of items to the 55 level2 categories

Groceries_level2 <- aggregate(Groceries, by = "level2")
Groceries_level2
## transactions in sparse format with
##  9835 transactions (rows) and
##  55 items (columns)
head(itemInfo(Groceries_level2)) ## labels are alphabetically sorted!
##             labels           level2           level1
## 1        baby food        baby food      canned food
## 2             bags             bags         non-food
## 3  bakery improver  bakery improver   processed food
## 4 bathroom cleaner bathroom cleaner        detergent
## 5             beef             beef meat and sausage
## 6             beer             beer           drinks

We can now compare an original transaction with the aggregated transaction.

inspect(head(Groceries, 3))
##     items                 
## [1] {citrus fruit,        
##      semi-finished bread, 
##      margarine,           
##      ready soups}         
## [2] {tropical fruit,      
##      yogurt,              
##      coffee}              
## [3] {whole milk}
inspect(head(Groceries_level2, 3))
##     items                    
## [1] {bread and backed goods, 
##      fruit,                  
##      soups/sauces,           
##      vinegar/oils}           
## [2] {coffee,                 
##      dairy produce,          
##      fruit}                  
## [3] {dairy produce}

For example, citrus fruit in the first transaction was translated to the category fruit. Note that the order of items in a transaction is not important, so it might change during aggregation.

It is now easy to mine rules on the aggregated data.

rules <- apriori(Groceries_level2, support = 0.005)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime
##         0.8    0.1    1 none FALSE            TRUE       5
##  support minlen maxlen target  ext
##    0.005      1     10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 49 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[55 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [47 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [243 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
rules |> head(3, by = "support") |> inspect()
##     lhs                          rhs             support confidence coverage  lift count
## [1] {bread and backed goods,                                                            
##      cheese,                                                                            
##      fruit}                   => {dairy produce} 0.02481     0.8385  0.02959 1.893   244
## [2] {bread and backed goods,                                                            
##      cheese,                                                                            
##      vegetables}              => {dairy produce} 0.02379     0.8239  0.02888 1.860   234
## [3] {cheese,                                                                            
##      fruit,                                                                             
##      vegetables}              => {dairy produce} 0.02267     0.8479  0.02674 1.914   223

You can add your own aggregation to an existing dataset by constructing the and iteminfo data.frame and adding it to the transactions. See ? hierarchy for details.

6.3.2 Multi-level Analysis

To analyze relationships between individual items and item groups at the same time, addAggregate() can be used to create a new transactions object which contains both, the original items and group-level items.

Groceries_multilevel <- addAggregate(Groceries, "level2")
Groceries_multilevel |> head(n=3) |> inspect()
##     items                     
## [1] {citrus fruit,            
##      semi-finished bread,     
##      margarine,               
##      ready soups,             
##      bread and backed goods*, 
##      fruit*,                  
##      soups/sauces*,           
##      vinegar/oils*}           
## [2] {tropical fruit,          
##      yogurt,                  
##      coffee,                  
##      coffee*,                 
##      dairy produce*,          
##      fruit*}                  
## [3] {whole milk,              
##      dairy produce*}

The added group-level items are marked with an * after the name. Now we can mine rules including items from multiple levels.

rules <- apriori(Groceries_multilevel,
  parameter = list(support = 0.005))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime
##         0.8    0.1    1 none FALSE            TRUE       5
##  support minlen maxlen target  ext
##    0.005      1     10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 49 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[224 item(s), 9835 transaction(s)] done [0.02s].
## sorting and recoding items ... [167 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 7 8 done [0.08s].
## writing ... [21200 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].
rules
## set of 21200 rules

Mining rules with group-level items added will create many spurious rules of the type

⁠item A => group of item A⁠

with a confidence of 1. This will also happen if you mine itemsets. filterAggregate() can be used to filter these spurious rules or itemsets.

rules <- filterAggregate(rules)
rules
## set of 838 rules
rules |> head(n = 3, by = "lift") |> inspect()
##     lhs                           rhs                        support confidence coverage  lift count
## [1] {whole milk,                                                                                    
##      whipped/sour cream,                                                                            
##      bread and backed goods*,                                                                       
##      cheese*}                  => {vegetables*}             0.005186     0.8095 0.006406 2.965    51
## [2] {sausage,                                                                                       
##      poultry*}                 => {vegetables*}             0.005084     0.8065 0.006304 2.954    50
## [3] {other vegetables,                                                                              
##      soda,                                                                                          
##      fruit*,                                                                                        
##      sausage*}                 => {bread and backed goods*} 0.005287     0.8525 0.006202 2.467    52

Using multi-level mining can reduce the number of rules and help to analyze if customers differentiate between products in a group.

6.4 Sequential Patterns

The frequent sequential pattern mining algorithm cSPADE (Zaki 2000) is implemented in the arules extension package arulesSequences.

Sequential pattern mining starts with sequences of events. Each sequence is identified by a sequence ID and each event is a set of items that happen together. The order of events is specified using event IDs. The goal is to find subsequences of items in events that follow each other frequently. These are called frequent sequential pattern.

We will look at a small example dataset that comes with the package arulesSequences.

library(arulesSequences)
## 
## Attaching package: 'arulesSequences'
## The following object is masked from 'package:arules':
## 
##     itemsets
data(zaki)

inspect(zaki)
##      items        sequenceID eventID SIZE
## [1]  {C, D}       1          10      2   
## [2]  {A, B, C}    1          15      3   
## [3]  {A, B, F}    1          20      3   
## [4]  {A, C, D, F} 1          25      4   
## [5]  {A, B, F}    2          15      3   
## [6]  {E}          2          20      1   
## [7]  {A, B, F}    3          10      3   
## [8]  {D, G, H}    4          10      3   
## [9]  {B, F}       4          20      2   
## [10] {A, G, H}    4          25      3

The dataset contains four sequences (see sequenceID) and the event IDs are integer numbers to provide the order events in a sequence. In arulesSequences, this set of sequences is implemented as a regular transaction set, where each transaction is an event. The temporal information is added as extra columns to the transaction’s transactionInfo() data.frame.

Mine frequent sequence patterns using cspade is very similar to using apriori. Here we set support so we will find patterns that occur in 50% of the sequences.

fsp <- cspade(zaki, parameter = list(support = .5))
fsp |> inspect()
##     items support 
##   1 <{A}>    1.00 
##   2 <{B}>    1.00 
##   3 <{D}>    0.50 
##   4 <{F}>    1.00 
##   5 <{A,   
##       F}>    0.75 
##   6 <{B,   
##       F}>    1.00 
##   7 <{D},  
##      {F}>    0.50 
##   8 <{D},  
##      {B,   
##       F}>    0.50 
##   9 <{A,   
##       B,   
##       F}>    0.75 
##  10 <{A,   
##       B}>    0.75 
##  11 <{D},  
##      {B}>    0.50 
##  12 <{B},  
##      {A}>    0.50 
##  13 <{D},  
##      {A}>    0.50 
##  14 <{F},  
##      {A}>    0.50 
##  15 <{D},  
##      {F},  
##      {A}>    0.50 
##  16 <{B,   
##       F},  
##      {A}>    0.50 
##  17 <{D},  
##      {B,   
##       F},  
##      {A}>    0.50 
##  18 <{D},  
##      {B},  
##      {A}>    0.50 
## 

For example, pattern 17 shows that D in an event, it is often followed by an event by containing B and F which in turn is followed by an event containing A.

The cspade algorithm supports many additional parameters to control gaps and windows. Details can be found in the manual page for cspade.

Rules, similar to regular association rules can be generated from frequent sequence patterns using ruleInduction().

rules <- ruleInduction(fsp, confidence = .8)
rules |> inspect()
##    lhs      rhs   support confidence lift 
##  1 <{D}> => <{F}>     0.5          1    1 
##  2 <{D}> => <{B,      0.5          1    1 
##               F}>    
##  3 <{D}> => <{B}>     0.5          1    1 
##  4 <{D}> => <{A}>     0.5          1    1 
##  5 <{D},             
##     {F}> => <{A}>     0.5          1    1 
##  6 <{D},             
##     {B,              
##      F}> => <{A}>     0.5          1    1 
##  7 <{D},             
##     {B}> => <{A}>     0.5          1    1 
## 

The usual measures of confidence and lift are used.