How to use the R package arules
from Python using arulespy
This document is also available as an IPython notebook or you can open and run it directly in Google Colab.
The package can be installed using pip via the terminal
pip install arulespy
Or using the following magic command (note: use %conda
if you use conda)
%pip install arulespy
Requirement already satisfied: arulespy in /home/hahsler/baR/arulespy/.venv/lib/python3.10/site-packages (0.1.4) Requirement already satisfied: pandas>1.5.3 in /home/hahsler/baR/arulespy/.venv/lib/python3.10/site-packages (from arulespy) (2.1.0) Requirement already satisfied: numpy>=1.14.2 in /home/hahsler/baR/arulespy/.venv/lib/python3.10/site-packages (from arulespy) (1.25.2) Requirement already satisfied: scipy>=1.10.1 in /home/hahsler/baR/arulespy/.venv/lib/python3.10/site-packages (from arulespy) (1.11.2) Requirement already satisfied: rpy2>=3.5.11 in /home/hahsler/baR/arulespy/.venv/lib/python3.10/site-packages (from arulespy) (3.5.14) Requirement already satisfied: python-dateutil>=2.8.2 in /home/hahsler/baR/arulespy/.venv/lib/python3.10/site-packages (from pandas>1.5.3->arulespy) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /home/hahsler/baR/arulespy/.venv/lib/python3.10/site-packages (from pandas>1.5.3->arulespy) (2023.3.post1) Requirement already satisfied: tzdata>=2022.1 in /home/hahsler/baR/arulespy/.venv/lib/python3.10/site-packages (from pandas>1.5.3->arulespy) (2023.3) Requirement already satisfied: cffi>=1.10.0 in /home/hahsler/baR/arulespy/.venv/lib/python3.10/site-packages (from rpy2>=3.5.11->arulespy) (1.15.1) Requirement already satisfied: jinja2 in /home/hahsler/baR/arulespy/.venv/lib/python3.10/site-packages (from rpy2>=3.5.11->arulespy) (3.1.2) Requirement already satisfied: tzlocal in /home/hahsler/baR/arulespy/.venv/lib/python3.10/site-packages (from rpy2>=3.5.11->arulespy) (5.0.1) Requirement already satisfied: pycparser in /home/hahsler/baR/arulespy/.venv/lib/python3.10/site-packages (from cffi>=1.10.0->rpy2>=3.5.11->arulespy) (2.21) Requirement already satisfied: six>=1.5 in /home/hahsler/baR/arulespy/.venv/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas>1.5.3->arulespy) (1.16.0) Requirement already satisfied: MarkupSafe>=2.0 in /home/hahsler/baR/arulespy/.venv/lib/python3.10/site-packages (from jinja2->rpy2>=3.5.11->arulespy) (2.1.3) Note: you may need to restart the kernel to use updated packages.
The code below may be needed for Windows users.
## Windows users: These environment variables may be necessary till rpy2 does this automatically
#from rpy2 import situation
#import os
#r_home = situation.r_home_from_registry()
#r_bin = r_home + '\\bin\\x64\\'
#os.environ['R_HOME'] = r_home
#os.environ['PATH'] = r_bin + ";" + os.environ['PATH']
Basic Usage¶
Import the arules
module from package arulespy
. This will take a while if you run it for the first time since it needs to install all the needed R packages.
from arulespy.arules import Transactions, apriori, parameters, concat
Creating transaction data¶
The data need to be prepared as a Pandas dataframe. Here we have 9 transactions with three items called A, B and C. True means that a transaction contains the item.
import pandas as pd
df = pd.DataFrame (
[True,True, True],
[True, False,False],
[True, True, True],
[True, False, False],
[True, True, True],
[True, False, True],
[True, True, True],
[False, False, True],
[False, True, True],
[True, False, True],
columns=list ('ABC'))
A | B | C | |
0 | True | True | True |
1 | True | False | False |
2 | True | True | True |
3 | True | False | False |
4 | True | True | True |
5 | True | False | True |
6 | True | True | True |
7 | False | False | True |
8 | False | True | True |
9 | True | False | True |
Convert the pandas dataframe into a sparse transactions object.
trans = Transactions.from_df(df)
transactions in sparse format with 10 transactions (rows) and 3 items (columns)
items | transactionID | |
1 | {A,B,C} | 0 |
2 | {A} | 1 |
3 | {A,B,C} | 2 |
4 | {A} | 3 |
5 | {A,B,C} | 4 |
6 | {A,C} | 5 |
7 | {A,B,C} | 6 |
8 | {C} | 7 |
9 | {B,C} | 8 |
10 | {A,C} | 9 |
['A', 'B', 'C']
Working with transactions¶
We can calculate item frequencies, sample transactions or remove duplicate transactions. All available functions can be found at the end of this document.
trans.itemFrequency(type = 'relative')
[0.8, 0.5, 0.8]
items | transactionID | |
8 | {C} | 7 |
10 | {A,C} | 9 |
9 | {B,C} | 8 |
items | transactionID | |
1 | {A,B,C} | 0 |
2 | {A} | 1 |
6 | {A,C} | 5 |
8 | {C} | 7 |
9 | {B,C} | 8 |
Create new data that uses the same encoding as an existing transaction set from a pandas dataframe. Note that the following dataframe
has the columns (items) in reverse order which is fixed when the itemencoding in trans
is used.
trans2 = Transactions.from_df(pd.DataFrame (
[True,True, False],
[False, False, True],
columns=list ('CBA')), trans)
items | transactionID | |
1 | {B,C} | 0 |
2 | {A} | 1 |
Create the same transaction, but from a list of lists. Note that the order of the items is fixed to match trans
trans3 = Transactions.from_list([['B', 'A'],
items | |
1 | {A,B} |
2 | {C} |
Add the new transaction to the existing transactions.
concat([trans, trans2]).as_df()
items | transactionID | |
1 | {A,B,C} | 0 |
2 | {A} | 1 |
3 | {A,B,C} | 2 |
4 | {A} | 3 |
5 | {A,B,C} | 4 |
6 | {A,C} | 5 |
7 | {A,B,C} | 6 |
8 | {C} | 7 |
9 | {B,C} | 8 |
10 | {A,C} | 9 |
11 | {B,C} | 0 |
12 | {A} | 1 |
Converting transactions into Python data strucutres¶
Transactions can be converted into several Python formats inclusing 0-1 matrices, lists of item labels, lists of item idices or a sparse matrix.
array([[1, 1, 1], [1, 0, 0], [1, 1, 1], [1, 0, 0], [1, 1, 1], [1, 0, 1], [1, 1, 1], [0, 0, 1], [0, 1, 1], [1, 0, 1]], dtype=int32)
[['A', 'B', 'C'], ['A'], ['A', 'B', 'C'], ['A'], ['A', 'B', 'C'], ['A', 'C'], ['A', 'B', 'C'], ['C'], ['B', 'C'], ['A', 'C']]
[[1, 2, 3], [1], [1, 2, 3], [1], [1, 2, 3], [1, 3], [1, 2, 3], [3], [2, 3], [1, 3]]
<3x10 sparse matrix of type '<class 'numpy.int64'>' with 21 stored elements in Compressed Sparse Column format>
Mixing nominal and numeric variables¶
Converting a dataframe with nominal and numeric variables. The nominal variables are converted into the form variable=value
numeric variables are first discretized (see arules.discretizeDF()
df2 = pd.DataFrame (
['red', 12, True],
['blue', 10, False],
['red', 18, True],
['green',18, False],
['red', 16, True],
['blue', 9, False]
columns=list(['color', 'size', 'class']))
trans2 = Transactions.from_df(df2)
items | transactionID | |
1 | {color=red,size=[11.3,16.7),class} | 0 |
2 | {color=blue,size=[9,11.3)} | 1 |
3 | {color=red,size=[16.7,18],class} | 2 |
4 | {color=green,size=[16.7,18]} | 3 |
5 | {color=red,size=[11.3,16.7),class} | 4 |
6 | {color=blue,size=[9,11.3)} | 5 |
Details on item label creation can be retrieved using arules.itemInfo()
R[write to console]: In addition: R[write to console]: Warning message: R[write to console]: Column(s) 1, 2 not logical or factor. Applying default discretization (see '? discretizeDF').
labels | variables | levels | |
1 | color=blue | color | blue |
2 | color=green | color | green |
3 | color=red | color | red |
4 | size=[9,11.3) | size | [9,11.3) |
5 | size=[11.3,16.7) | size | [11.3,16.7) |
6 | size=[16.7,18] | size | [16.7,18] |
7 | class | class | TRUE |
Mine association rules¶
calls the apriori algorithm and converts the results into a Python arulespy.arules.Rules
object. Parameters for the algorithm
are specified as dict
inside the arules.parameter()
rules = apriori(trans,
parameter = parameters({"supp": 0.1, "conf": 0.8}),
control = parameters({"verbose": False}))
LHS | RHS | support | confidence | coverage | lift | count | |
1 | {} | {A} | 0.8 | 0.8 | 1.0 | 1.00 | 8 |
2 | {} | {C} | 0.8 | 0.8 | 1.0 | 1.00 | 8 |
3 | {B} | {A} | 0.4 | 0.8 | 0.5 | 1.00 | 4 |
4 | {B} | {C} | 0.5 | 1.0 | 0.5 | 1.25 | 5 |
5 | {A,B} | {C} | 0.4 | 1.0 | 0.4 | 1.25 | 4 |
6 | {B,C} | {A} | 0.4 | 0.8 | 0.5 | 1.00 | 4 |
support | confidence | coverage | lift | count | |
1 | 0.8 | 0.8 | 1.0 | 1.00 | 8 |
2 | 0.8 | 0.8 | 1.0 | 1.00 | 8 |
3 | 0.4 | 0.8 | 0.5 | 1.00 | 4 |
4 | 0.5 | 1.0 | 0.5 | 1.25 | 5 |
5 | 0.4 | 1.0 | 0.4 | 1.25 | 4 |
6 | 0.4 | 0.8 | 0.5 | 1.00 | 4 |
Python-style len()
and slicing is available.
LHS | RHS | support | confidence | coverage | lift | count | |
1 | {} | {A} | 0.8 | 0.8 | 1.0 | 1.0 | 8 |
2 | {} | {C} | 0.8 | 0.8 | 1.0 | 1.0 | 8 |
3 | {B} | {A} | 0.4 | 0.8 | 0.5 | 1.0 | 4 |
rules[[True, False, True, False, True, False]].as_df()
LHS | RHS | support | confidence | coverage | lift | count | |
1 | {} | {A} | 0.8 | 0.8 | 1.0 | 1.00 | 8 |
3 | {B} | {A} | 0.4 | 0.8 | 0.5 | 1.00 | 4 |
5 | {A,B} | {C} | 0.4 | 1.0 | 0.4 | 1.25 | 4 |
Accessing Rules¶
rules can be converted into various Python data structures.
['{} => {A}', '{} => {C}', '{B} => {A}', '{B} => {C}', '{A,B} => {C}', '{B,C} => {A}']
items | |
1 | {A} |
2 | {C} |
3 | {A,B} |
4 | {B,C} |
5 | {A,B,C} |
6 | {A,B,C} |
items | |
1 | {} |
2 | {} |
3 | {B} |
4 | {B} |
5 | {A,B} |
6 | {B,C} |
[[], [], ['B'], ['B'], ['A', 'B'], ['B', 'C']]
items | |
1 | {A} |
2 | {C} |
3 | {A} |
4 | {C} |
5 | {C} |
6 | {A} |
The LHS and RHS of rules are of type itemMatrix
in the same way are transactions
are. Therefore, all conversions (to lists, sparce matrices, etc.) are also availabe.
rules.sort(by = 'lift').as_df()
LHS | RHS | support | confidence | coverage | lift | count | |
4 | {B} | {C} | 0.5 | 1.0 | 0.5 | 1.25 | 5 |
5 | {A,B} | {C} | 0.4 | 1.0 | 0.4 | 1.25 | 4 |
1 | {} | {A} | 0.8 | 0.8 | 1.0 | 1.00 | 8 |
2 | {} | {C} | 0.8 | 0.8 | 1.0 | 1.00 | 8 |
3 | {B} | {A} | 0.4 | 0.8 | 0.5 | 1.00 | 4 |
6 | {B,C} | {A} | 0.4 | 0.8 | 0.5 | 1.00 | 4 |
Work With Interest Measures¶
Interest measures are stored as the quality attribute in rules and itemsets.
support | confidence | coverage | lift | count | |
1 | 0.8 | 0.8 | 1.0 | 1.00 | 8 |
2 | 0.8 | 0.8 | 1.0 | 1.00 | 8 |
3 | 0.4 | 0.8 | 0.5 | 1.00 | 4 |
4 | 0.5 | 1.0 | 0.5 | 1.25 | 5 |
5 | 0.4 | 1.0 | 0.4 | 1.25 | 4 |
6 | 0.4 | 0.8 | 0.5 | 1.00 | 4 |
Additional interest measures can be calculated with interestMeasure()
and added to rules or itemsets using addQuality()
. See all available meassures. To calculate some measures, transactions need to
be specified.
im = rules.interestMeasure(["phi", 'support'])
phi | support | |
1 | NaN | 0.8 |
2 | NaN | 0.8 |
3 | 0.000000 | 0.4 |
4 | 0.500000 | 0.5 |
5 | 0.408248 | 0.4 |
6 | 0.000000 | 0.4 |
LHS | RHS | support | confidence | coverage | lift | count | phi | |
1 | {} | {A} | 0.8 | 0.8 | 1.0 | 1.00 | 8 | NaN |
2 | {} | {C} | 0.8 | 0.8 | 1.0 | 1.00 | 8 | NaN |
3 | {B} | {A} | 0.4 | 0.8 | 0.5 | 1.00 | 4 | 0.000000 |
4 | {B} | {C} | 0.5 | 1.0 | 0.5 | 1.25 | 5 | 0.500000 |
5 | {A,B} | {C} | 0.4 | 1.0 | 0.4 | 1.25 | 4 | 0.408248 |
6 | {B,C} | {A} | 0.4 | 0.8 | 0.5 | 1.00 | 4 | 0.000000 |
Filter Redundant Rules¶
rules[[not x for x in rules.is_redundant()]].as_df()
LHS | RHS | support | confidence | coverage | lift | count | phi | |
1 | {} | {A} | 0.8 | 0.8 | 1.0 | 1.00 | 8 | NaN |
2 | {} | {C} | 0.8 | 0.8 | 1.0 | 1.00 | 8 | NaN |
4 | {B} | {C} | 0.5 | 1.0 | 0.5 | 1.25 | 5 | 0.5 |
[False, False, True, False, True, True]
Find maximal rules.
[False, False, False, False, True, True]
Create Rules Objects¶
To import rules from other tools or to create rules manually, rules for arules
can be created from lists
of sets of items. The item labels (i.e., the sparse representation) is
taken from the transactions trans
The LHS and RHS of rules are of tpye itemMatrix
and can be created by conversion form pandas data fames of lists of lists.
import rpy2.robjects as ro
from arulespy.arules import Rules, ItemMatrix
trans = Transactions.from_df(pd.read_csv(""))
lhs = [
['hair', 'milk', 'predator'],
['hair', 'tail', 'predator'],
rhs = [
r =, itemLabels = trans),
ItemMatrix.from_list(rhs, itemLabels = trans))
LHS | RHS | |
1 | {hair,milk,predator} | {type=mammal} |
2 | {hair,predator,tail} | {type=mammal} |
3 | {fins} | {type=fish} |
Next, we add interest measures calculated on the transactions.
r.addQuality(r.interestMeasure(['support', 'confidence', 'lift'], trans))
R[write to console]: In addition: R[write to console]: Warning message: R[write to console]: Column(s) 13, 17 not logical or factor. Applying default discretization (see '? discretizeDF').
LHS | RHS | support | confidence | lift | |
1 | {hair,milk,predator} | {type=mammal} | 0.20 | 1.00 | 2.46 |
2 | {hair,predator,tail} | {type=mammal} | 0.16 | 1.00 | 2.46 |
3 | {fins} | {type=fish} | 0.13 | 0.76 | 5.94 |
Find Super and Subsets¶
Subset calcualtion returns a large binary matrix. Since this matrix is often sparse, it is represented as a sparse matrix. For example, subset can be used to check which transactions contain the items in the LHS of the rules. The result is a number of transactions by number of rules sparse matrix.
superset = trans.is_superset(r.lhs(), sparse = True)
<101x3 sparse matrix of type '<class 'numpy.int64'>' with 53 stored elements in Compressed Sparse Column format>
superset[0:1, ].toarray()
array([[1, 0, 0]])
Show first row as a dense vector. Transaction 1 is a superset of the LHS of the first rule. That is, transaction 1 contains the items in the LHS of Rule 1.
print("Transaction 1:", trans[0:1].as_list(), "\n")
print("Rule 1:\n", r[0:1].as_df())
Transaction 1: [['hair', 'milk', 'predator', 'toothed', 'backbone', 'breathes', 'legs=[4,8]', 'catsize', 'type=mammal']] Rule 1: LHS RHS support confidence lift 1 {hair,milk,predator} {type=mammal} 0.19802 1.0 2.463415
This information can be used to find the LHS support count for the three rules by summing along the columns.
superset.sum(axis = 2)
matrix([[20, 16, 17]])
Online Help for Functions Available via arulespy¶
Help on function wrapper in module arulespy.arules: wrapper(*args, **kwargs) Wrapper around an R function. The docstring below is built from the R documentation. description ----------- Mine frequent itemsets, association rules or association hyperedges using the Apriori algorithm. apriori( data, parameter = rinterface.NULL, appearance = rinterface.NULL, control = rinterface.NULL, ___ = (was "..."). R ellipsis (any number of parameters), ) Args: data : object of class transactions. Any data structure which can be coerced into transactions (e.g., a binary matrix, a data.frame or a tibble) can also be specified and will be internally coerced to transactions. parameter : object of class APparameter or named list. The default behavior is to mine rules with minimum support of 0.1, minimum confidence of 0.8, maximum of 10 items (maxlen), and a maximal time for subset checking of 5 seconds (‘maxtime’). appearance : object of class APappearance or named list. With this argument item appearance can be restricted (implements rule templates). By default all items can appear unrestricted. control : object of class APcontrol or named list. Controls the algorithmic performance of the mining algorithm (item sorting, report progress (verbose), etc.) ... : Additional arguments are for convenience added to the parameter list. details ------- The Apriori algorithm (Agrawal et al, 1993) employs level-wise search for frequent itemsets. The used C implementation of Apriori by Christian Borgelt (2003) includes some improvements (e.g., a prefix tree and item sorting). Warning about automatic conversion of matrices or data.frames to transactions. It is preferred to create transactions manually before calling apriori() to have control over item coding. This is especially important when you are working with multiple datasets or several subsets of the same dataset. To read about item coding, see itemCoding . If a data.frame is specified as x , then the data is automatically converted into transactions by discretizing numeric data using discretizeDF() and then coercion to transactions. The discretization may fail if the data is not well behaved. Apriori only creates rules with one item in the RHS (Consequent). The default value in APparameter for minlen is 1. This meains that rules with only one item (i.e., an empty antecedent/LHS) like \{\} => \{beer\} {} => {beer} will be created. These rules mean that no matter what other items are involved, the item in the RHS will appear with the probability given by the rule's confidence (which equals the support). If you want to avoid these rules then use the argument parameter = list(minlen = 2) . Notes on run time and memory usage: If the minimum support is chosen too low for the dataset, then the algorithm will try to create an extremely large set of itemsets/rules. This will result in very long run time and eventually the process will run out of memory. To prevent this, the default maximal length of itemsets/rules is restricted to 10 items (via the parameter element maxlen = 10 ) and the time for checking subsets is limited to 5 seconds (via maxtime = 5 ). The output will show if you hit these limits in the "checking subsets" line of the output. The time limit is only checked when the subset size increases, so it may run significantly longer than what you specify in maxtime. Setting maxtime = 0 disables the time limit. Interrupting execution with Control-C/Esc is not recommended. Memory cleanup will be prevented resulting in a memory leak. Also, interrupts are only checked when the subset size increases, so it may take some time till the execution actually stops.
Low-level R arules interface¶
arules functions can also be directly called using
R_arules.<arules R function>()
and R_arulesViz.<arules R function>()
. The result will be a rpy2
data type.
Transactions, itemsets and rules can manually be converted to Python
classes using.
from arulespy.arules import R_arules, Itemsets, arules2py
Help on DocumentedSTFunction in module rpy2.robjects.functions: <rpy2.robjects.functions.DocumentedSTFunction ob...ebe43c0> [RTYPES.CLOSXP] R classes: ('function',) Wrapper around an R function. The docstring below is built from the R documentation. description ----------- Simulate random transactions using different methods. random.patterns( nItems, nPats = 2000.0, method = rinterface.NULL, lPats = 4.0, corr = 0.5, cmean = 0.5, cvar = 0.1, iWeight = rinterface.NULL, verbose = False, ) Args: nItems : an integer. Number of items to simulate nTrans : an integer. Number of transactions to simulate method : name of the simulation method used (see Details Section). ... : further arguments used for the specific simulation method (see details). verbose : report progress? nPats : number of patterns (potential maximal frequent itemsets) used. lPats : average length of patterns. corr : correlation between consecutive patterns. cmean : mean of the corruption level (normal distribution). cvar : variance of the corruption level. iWeight : item selection weights to build patterns. details ------- Currently two simulation methods are implemented: "independent" (Hahsler et al, 2006): All items are treated as independent. The transaction size is determined by rpois(lambda - 1) + 1 , where lambda can be specified (defaults to 3). Note that one subtracted from lambda and added to the size to avoid empty transactions. The items in the transactions are randomly chosen using the numeric probability vector iProb of length nItems (default: 0.01 for each item). "agrawal" (see Agrawal and Srikant, 1994): This method creates transactions with correlated items using random.patters() . The simulation is a two-stage process. First, a set of nPats patterns (potential maximal frequent itemsets) is generated. The length of the patterns is Poisson distributed with mean lPats and consecutive patterns share some items controlled by the correlation parameter corr . For later use, for each pattern a pattern weight is generated by drawing from an exponential distribution with a mean of 1 and a corruption level is chosen from a normal distribution with mean cmean and variance cvar . The function returns the patterns as an itemsets objects which can be supplied to random.transactions() as the argument patterns . If no argument patterns is supplied, the default values given above are used. In the second step, the transactions are generated using the patterns. The length the transactions follows a Poisson distribution with mean lPats . For each transaction, patterns are randomly chosen using the pattern weights till the transaction length is reached. For each chosen pattern, the associated corruption level is used to drop some items before adding the pattern to the transaction.
its_r = R_arules.random_patterns(100, 10)
<rpy2.robjects.methods.RS4 object at 0x7f441886f600> [RTYPES.S4SXP] R classes: ('itemsets',)
Since we directly called a R function, we need to manually wrap the R object as a Python object before we use it in Python.
its_p = Itemsets(its_r)
items | pWeights | pCorrupts | |
1 | {item51,item53,item55,item59} | 0.016862 | 0.000000 |
2 | {item7,item10,item51,item62,item78} | 0.094877 | 0.479575 |
3 | {item62,item91} | 0.136921 | 0.030957 |
4 | {item53,item62,item76,item98} | 0.116791 | 0.770604 |
5 | {item53,item61,item74,item78,item93} | 0.184119 | 0.689259 |
6 | {item53,item61,item74,item93} | 0.261557 | 0.808408 |
7 | {item61,item93} | 0.019522 | 0.000000 |
8 | {item23,item79,item92} | 0.007628 | 0.860331 |
9 | {item23,item32,item62,item75,item82,item92} | 0.114453 | 0.892963 |
10 | {item62,item82} | 0.047270 | 0.856183 |
trans = arules2py(R_arules.random_transactions(10, 1000))
transactions in sparse format with 1000 transactions (rows) and 10 items (columns)
Access directly the sparse representation.
from scipy.sparse import csc_matrix
<10x1000 sparse matrix of type '<class 'numpy.int64'>' with 2976 stored elements in Compressed Sparse Column format>