How to use the R package arules from Python

The data need to be prepared as a Pandas dataframe. Here we have 9 transactions with three items called A, B and C. True means that a transaction contains the item.

In [1]:
import pandas as pd

df = pd.DataFrame (
    [
        [True,True, True],
        [True, False,False],
        [True, True, True],
        [True, False, False],
        [True, True, True],
        [True, False, True],
        [True, True, True],
        [False, False, True],
        [False, True, True],
        [True, False, True],
    ],
    columns=list ('ABC')) 

df
Out[1]:
A B C
0 True True True
1 True False False
2 True True True
3 True False False
4 True True True
5 True False True
6 True True True
7 False False True
8 False True True
9 True False True

Next, we need to set up the R package arules and rpy2 to connect to R. To install arules, open R and install the package arules using install.packages("arules"). To install rpy2, you can use pip install rpy2.

In [2]:
from rpy2.robjects import pandas2ri
pandas2ri.activate()

import rpy2.robjects as ro
from rpy2.robjects.packages import importr

arules = importr("arules")

# some helper functions
def arules_as_matrix(x, what = "items"):
    return ro.r('function(x) as(' + what + '(x), "matrix")')(x)

def arules_as_dict(x, what = "items"):
    l = ro.r('function(x) as(' + what + '(x), "list")')(x)
    l.names = [*range(0, len(l))]
    return dict(zip(l.names, map(list,list(l))))

def arules_quality(x):
    return x.slots["quality"]
/opt/conda/lib/python3.8/site-packages/rpy2/robjects/pandas2ri.py:14: FutureWarning: pandas.core.index is deprecated and will be removed in a future version.  The public classes are available in the top-level namespace.
  from pandas.core.index import Index as PandasIndex

Mine frequent itemsets

In [3]:
itsets = arules.apriori(df, 
   parameter = ro.ListVector({"supp": 0.1, "target": "frequent itemsets"}))
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
         NA    0.1    1 none FALSE            TRUE       5     0.1      1
 maxlen            target  ext
     10 frequent itemsets TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 1 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[3 item(s), 10 transaction(s)] done [0.00s].
sorting and recoding items ... [3 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 done [0.00s].
sorting transactions ... done [0.00s].
writing ... [7 set(s)] done [0.00s].
creating S4 object  ... done [0.00s].
In [4]:
print(arules.DATAFRAME(itsets))
     items  support  transIdenticalToItemsets  count
1      {B}      0.5                       0.0      5
2      {A}      0.8                       0.2      8
3      {C}      0.8                       0.1      8
4    {A,B}      0.4                       0.0      4
5    {B,C}      0.5                       0.1      5
6    {A,C}      0.6                       0.2      6
7  {A,B,C}      0.4                       0.4      4

The frequent itemsets can be accessed as a binary matrix.

In [5]:
its = arules_as_matrix(itsets)
print(its)
[[0 1 0]
 [1 0 0]
 [0 0 1]
 [1 1 0]
 [0 1 1]
 [1 0 1]
 [1 1 1]]

Access itemset as a dictionary

In [6]:
its = arules_as_dict(itsets)
print(its)
{'0': ['B'], '1': ['A'], '2': ['C'], '3': ['A', 'B'], '4': ['B', 'C'], '5': ['A', 'C'], '6': ['A', 'B', 'C']}

Accessing the quality measures

In [7]:
arules_quality(itsets)
Out[7]:
support transIdenticalToItemsets count
1 0.5 0.0 5
2 0.8 0.2 8
3 0.8 0.1 8
4 0.4 0.0 4
5 0.5 0.1 5
6 0.6 0.2 6
7 0.4 0.4 4

Mine association rules

In [8]:
rules = arules.apriori(df, 
   parameter = ro.ListVector({"supp": 0.1, "conf": 0.8}))
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.8    0.1    1 none FALSE            TRUE       5     0.1      1
 maxlen target  ext
     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 1 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[3 item(s), 10 transaction(s)] done [0.00s].
sorting and recoding items ... [3 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 done [0.00s].
writing ... [6 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].
In [9]:
print(arules.DATAFRAME(rules))
     LHS  RHS  support  confidence  coverage  lift  count
1     {}  {A}      0.8         0.8       1.0  1.00      8
2     {}  {C}      0.8         0.8       1.0  1.00      8
3    {B}  {A}      0.4         0.8       0.5  1.00      4
4    {B}  {C}      0.5         1.0       0.5  1.25      5
5  {A,B}  {C}      0.4         1.0       0.4  1.25      4
6  {B,C}  {A}      0.4         0.8       0.5  1.00      4

Get the left-hand-side, the right-hand-side and the rule quality.

In [10]:
lhs = arules_as_matrix(rules, what = "lhs")
print (lhs)
[[0 0 0]
 [0 0 0]
 [0 1 0]
 [0 1 0]
 [1 1 0]
 [0 1 1]]
In [11]:
rhs = arules_as_matrix(rules, what = "rhs")
print(rhs)
[[1 0 0]
 [0 0 1]
 [1 0 0]
 [0 0 1]
 [0 0 1]
 [1 0 0]]
In [12]:
lhs = arules_as_dict(rules, what = "lhs")
print (lhs)
{'0': [], '1': [], '2': ['B'], '3': ['B'], '4': ['A', 'B'], '5': ['B', 'C']}
In [13]:
rhs = arules_as_dict(rules, what = "rhs")
print (rhs)
{'0': ['A'], '1': ['C'], '2': ['A'], '3': ['C'], '4': ['C'], '5': ['A']}
In [14]:
arules_quality(rules)
Out[14]:
support confidence coverage lift count
1 0.8 0.8 1.0 1.00 8
2 0.8 0.8 1.0 1.00 8
3 0.4 0.8 0.5 1.00 4
4 0.5 1.0 0.5 1.25 5
5 0.4 1.0 0.4 1.25 4
6 0.4 0.8 0.5 1.00 4