1 Introduction

Data mining has the goal of finding patterns in large data sets. The popular data mining textbook Introduction to Data Mining (Tan et al. 2017) covers many important aspects of data mining. This companion contains annotated R code examples to complement the textbook. To make following along easier, we follow the chapters in data mining textbook which is organized by the main data mining tasks:

  1. Data covers types of data and also includes data preparation and exploratory data analysis in the chapter.

  2. Classification: Basic Concepts introduces the purpose of classification, basic classifiers using decision trees, and model training and evaluation.

  3. Classification: Alternative Techniques introduces and compares methods including rule-based classifiers, nearest neighbor classifiers, naive Bayes classifier, logistic regression and artificial neural networks.

  4. Association Analysis: Basic Concepts covers algorithms for frequent itemset and association rule generation and analysis including visualization.

  5. Association Analysis: Advanced Concepts covers categorical and continuous attributes, concept hierarchies, and frequent sequence pattern mining.

  6. Cluster Analysis discusses clustering approaches including k-means, hierarchical clustering, DBSCAN and how to evaluate clustering results.

For completeness, we have added chapters on Regression* and on Logistic Regression*. Sections in this book followed by an asterisk contain code examples for methods that are not described in the data mining textbook.

This book assumes that you are familiar with the basics of R, how to run R code, and install packages. The rest of this chapter will provide an overview and point you to where you can learn more about R and the used packages.

1.1 Used Software

To use this book you need to have R and RStudio Desktop installed.

Each book chapter will use a set of packages that must be installed. The installation code can be found at the beginning of each chapter. Here is the code to install the packages used in this chapter:

pkgs <- c('tidyverse')

pkgs_install <- pkgs[!(pkgs %in% installed.packages()[,"Package"])]
if(length(pkgs_install)) install.packages(pkgs_install)

The code examples in this book use the R package collection tidyverse (Wickham 2023c) to manipulate data. Tidyverse also includes the package ggplot2 (Wickham et al. 2024) for visualization. Tidyverse packages make working with data in R very convenient. Data analysis and data mining reports are typically done by creating R Markdown documents. Everything in R is built on top of the core R programming language and the packages that are automatically installed with R. This is referred to as Base-R.

1.2 Base-R

Base-R is covered in detail in An Introduction to R. The most important differences between R and many other programming languages like Python are:

  • R is a functional programming language (with some extensions).
  • R only uses vectors and operations are vectorized. You rarely will see loops.
  • R starts indexing into vectors with 1 not 0.
  • R uses <- for assignment. Do not use = for assignment.

1.2.1 Vectors

The basic data structure in R is a vector of real numbers called numeric. Scalars do not exist, they are just vectors of length 1.

We can combine values into a vector using the combine function c(). Special values for infinity (Inf) and missing values (NA) can be used.

x <- c(10.4, 5.6, Inf, NA, 21.7)
x
## [1] 10.4  5.6  Inf   NA 21.7

We often use sequences. A simple sequence of integers can be produced using the colon operator in the form from:to.

3:10
## [1]  3  4  5  6  7  8  9 10

More complicated sequences can be created using seq().

y <- seq(from = 0, to = 1, length.out = 5)
y
## [1] 0.00 0.25 0.50 0.75 1.00

1.2.1.1 Vectorized Operations

Operations are vectorized and applied to each element, so loops are typically not necessary in R.

x + 1
## [1] 11.4  6.6  Inf   NA 22.7

Comparisons are also vectorized and are performed element-wise. They return a logical vector (R’s name for the datatype Boolean).

x > y
## [1] TRUE TRUE TRUE   NA TRUE

1.2.1.2 Subsetting Vectors

We can select vector elements using the [ operator like in other programming languages but the index always starts at 1.

We can select elements 1 through 3 using an index sequence.

x[1:3]
## [1] 10.4  5.6  Inf

Negative indices remove elements. We can select all but the first element.

x[-1]
## [1]  5.6  Inf   NA 21.7

We can use any function that creates a logical vector for subsetting. Here we select all non-missing values using the function is.na().

is.na(x)
## [1] FALSE FALSE FALSE  TRUE FALSE
x[!is.na(x)] # select all non-missing values
## [1] 10.4  5.6  Inf 21.7

We can also assign values to a selection. For example, this code gets rid of infinite values.

x[!is.finite(x)] <- NA
x
## [1] 10.4  5.6   NA   NA 21.7

Another useful thing is that vectors and many other data structures in R support names. For example, we can store the count of cars with different colors in the following way.

counts <- c(12, 5, 18, 13)
names(counts) <- c("black", "red", "white", "silver")

counts
##  black    red  white silver 
##     12      5     18     13

Names can then be used for subsetting and we do not have to remember what entry is associated with what color.

counts[c("red", "silver")]
##    red silver 
##      5     13

1.2.2 Lists

Lists can store a sequence of elements of different datatypes.

lst <- list(name = "Fred", wife = "Mary", no.children = 3,
  child.ages = c(4, 7, 9))
lst
## $name
## [1] "Fred"
## 
## $wife
## [1] "Mary"
## 
## $no.children
## [1] 3
## 
## $child.ages
## [1] 4 7 9

Elements can be accessed by index or name.

lst[[2]]
## [1] "Mary"
lst$child.ages
## [1] 4 7 9

Many functions in R return a list.

1.2.3 Data Frames

A data frame looks like a spread sheet and represents a matrix-like structure with rows and columns where different columns can contain different data types.

df <-  data.frame(name = c("Michael", "Mark", "Maggie"), children = c(2, 0, 2))
df
##      name children
## 1 Michael        2
## 2    Mark        0
## 3  Maggie        2

Data frames are stored as a list of columns. We can access elements using matrix subsetting (missing indices mean the whole row or column) or list subsetting. Here are some examples

df[1,1]
## [1] "Michael"
df[, 2]
## [1] 2 0 2
df$name
## [1] "Michael" "Mark"    "Maggie"

Data frames are the most important way to represent data for data mining.

1.2.4 Functions

Functions work like in many other languages. However, they also operate vectorized and arguments can be specified positional or by named. An increment function can be defined as:

inc <- function(x, by = 1) { 
    x + by 
  }

The value of the last evaluated expression in the function body is automatically returned by the function. The function can also explicitly use return(return_value).

Calling the increment function on vector with the numbers 1 through 10.

v <- 1:10
inc(v, by = 2)
##  [1]  3  4  5  6  7  8  9 10 11 12

Before you implement your own function, check if R does not already provide an implementation. R has many built-in functions like min(), max(), mean(), and sum().

1.2.5 Strings

R uses character vectors where the datatype character represents a string and not like in other programming language just a single character. R accepts double or single quotation marks to delimit strings.

string <- c("Hello", "Goodbye")
string
## [1] "Hello"   "Goodbye"

Strings can be combined using paste().

paste(string, "World!")
## [1] "Hello World!"   "Goodbye World!"

Note that paste is vectorized and the string "World!" is used twice to match the length of string. This behavior is called in R recycling and works if one vector’s length is an exact multiple of the other vector’s length. The special case where one vector has one element is particularly often used. We have used it already above in the expression x + 1 where one is recycled for each element in x.

1.2.6 Plotting

Basic plotting in Base-R is done by calling plot().

plot(x, y)

Other plot functions are pairs(), hist(), barplot(). In this book, we will focus on plots created with the ggplot2 package introduced below.

1.2.7 Objects

Other often used data structures include list, data.frame, matrix, and factor. In R, everything is an object. Objects are printed by either just using the object’s name or put it explicitly into the print() function.

x
## [1] 10.4  5.6   NA   NA 21.7

For many objects, a summary can be created.

summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     5.6     8.0    10.4    12.6    16.1    21.7       2

Objects have a class.

class(x)
## [1] "numeric"

Some functions are generic, meaning that they do something else depending on the class of the first argument. For example, plot() has many methods implemented to visualize data of different classes. The specific function is specified with a period after the method name. There also exists a default method. For plot, this method is plot.default(). Different methods may have their own manual page, so specifying the class with the dot can help finding the documentation.

It is often useful to know what information it stored in an object. str() returns a humanly readable string representation of the object.

str(x)
##  num [1:5] 10.4 5.6 NA NA 21.7

There is much to learn about R. It is highly recommended to go through the official An Introduction to R manual. There is also a good Base R Cheat Sheet available.

1.3 R Markdown

R Markdown is a simple method to include R code inside a text document written using markdown syntax. Using R markdown is especially convention to analyze data and compose a data mining report. RStudio makes creating and translating R Markdown document easy. Just choose File -> New File -> R Markdown... RStudio will create a small demo markdown document that already includes examples for code and how to include plots. You can switch to visual mode if you prefer a What You See Is What You Get editor.

To convert the R markdown document to HTML, PDF or Word, just click the Knit button. All the code will be executed and combined with the text into a complete document.

Examples and detailed documentation can be found on RStudio’s R Markdown website and in the R Markdown Cheatsheet.

1.4 Tidyverse

tidyverse (Wickham 2023c) is a collection of many useful packages that work well together by sharing design principles and data structures. tidyverse also includes ggplot2 (Wickham et al. 2024) for visualization.

In this book, we will typically use

A good introduction can be found in the Section on Data Transformation (Wickham, Çetinkaya-Rundel, and Grolemund 2023).

To use tidyverse, we first have to load the tidyverse meta package.

library(tidyverse)
## ── Attaching core tidyverse packages ──── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Conflicts may indicate that base R functionality is changed by tidyverse packages.

1.4.1 Tibbles

Tibbles are tidyverse’s replacement for R’s data.frame. Here is a short example that analyses the vitamin C content of different fruits that will get you familiar with the basic syntax. We create a tibble with the price in dollars per pound and the vitamin C content in milligrams (mg) per pound for five different fruits.

fruit <- tibble(
  name = c("apple", "banana", "mango", "orange", "lime"), 
  price = c(2.5, 2.0, 4.0, 3.5, 2.5), 
  vitamin_c = c(20, 45, 130, 250, 132),
  type = c("pome", "tropical", "tropical", "citrus", "citrus"))
fruit
## # A tibble: 5 × 4
##   name   price vitamin_c type    
##   <chr>  <dbl>     <dbl> <chr>   
## 1 apple    2.5        20 pome    
## 2 banana   2          45 tropical
## 3 mango    4         130 tropical
## 4 orange   3.5       250 citrus  
## 5 lime     2.5       132 citrus

Next, we will transform the data to find affordable fruits with a high vitamin C content and then visualize the data.

1.4.2 Transformations

We can modify the table by adding a column with the vitamin C (in mg) that a dollar buys using mutate(). Then we filter only rows with fruit that provides more than 20 mg per dollar, and finally we arrange the data rows by the vitamin C per dollar from largest to smallest.

affordable_vitamin_c_sources <- fruit |>
  mutate(vitamin_c_per_dollar = vitamin_c / price) |> 
  filter(vitamin_c_per_dollar > 20) |>
  arrange(desc(vitamin_c_per_dollar))

affordable_vitamin_c_sources 
## # A tibble: 4 × 5
##   name   price vitamin_c type     vitamin_c_per_dollar
##   <chr>  <dbl>     <dbl> <chr>                   <dbl>
## 1 orange   3.5       250 citrus                   71.4
## 2 lime     2.5       132 citrus                   52.8
## 3 mango    4         130 tropical                 32.5
## 4 banana   2          45 tropical                 22.5

The pipes operator |> lets you pass the value to the left (often the result of a function) as the first argument to the function on the right. This makes composing a sequence of function calls to transform data much easier to write and read. You will often see %>% as the pipe operator, especially in examples using tidyverse. Both operators work similarly where |> is a native R operator while %>% is provided in the extension package magrittr.

The code above starts with the fruit data and pipes it through three transformation functions. The final result is assigned with <- to a variable.

We can create summary statistics of price using summarize().

affordable_vitamin_c_sources |> 
  summarize(min = min(price), 
            mean = mean(price), 
            max = max(price))
## # A tibble: 1 × 3
##     min  mean   max
##   <dbl> <dbl> <dbl>
## 1     2     3     4

We can also calculate statistics for groups by first grouping the data with group_by(). Here we produce statistics for fruit types.

affordable_vitamin_c_sources |> 
  group_by(type) |>
  summarize(min = min(price), 
            mean = mean(price), 
            max = max(price))
## # A tibble: 2 × 4
##   type       min  mean   max
##   <chr>    <dbl> <dbl> <dbl>
## 1 citrus     2.5     3   3.5
## 2 tropical   2       3   4

Often, we want to apply the same function to multiple columns. This can be achieved using across(columns, function).

affordable_vitamin_c_sources |> 
  summarize(across(c(price, vitamin_c), mean))
## # A tibble: 1 × 2
##   price vitamin_c
##   <dbl>     <dbl>
## 1     3      139.

dplyr syntax and evaluation is slightly different from standard R which may lead to confusion. One example is that column names can be references without quotation marks. A very useful reference resource when working with dplyr is the RStudio Data Transformation Cheatsheet which covers on two pages almost everything you will need and also contains contains simple example code that you can take and modify for your use case.

1.4.3 ggplot2

For visualization, we will use mainly ggplot2. The gg in ggplot2 stands for Grammar of Graphics introduced by Wilkinson (2005). The main idea is that every graph is built from the same basic components:

  • the data,
  • a coordinate system, and
  • visual marks representing the data (geoms).

In ggplot2, the components are combined using the + operator.

ggplot(data, mapping = aes(x = ..., y = ..., color = ...)) +
  geom_point()

Since we typically use a Cartesian coordinate system, ggplot uses that by default. Each geom_ function uses a stat_ function to calculate what is visualizes. For example, geom_bar uses stat_count to create a bar chart by counting how often each value appears in the data (see ? geom_bar). geom_point just uses the stat "identity" to display the points using the coordinates as they are. A great introduction can be found in the Section on Data Visualization (Wickham, Çetinkaya-Rundel, and Grolemund 2023), and very useful is RStudio’s Data Visualization Cheatsheet.

We can visualize our fruit data as a scatter plot.

ggplot(fruit, aes(x = price, y = vitamin_c)) + 
  geom_point()

It is easy to add more geoms. For example, we can add a regression line using geom_smooth with the method "lm" (linear model). We suppress the confidence interval since we only have 3 data points.

ggplot(fruit, aes(x = price, y = vitamin_c)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

Alternatively, we can visualize each fruit’s vitamin C content per dollar using a bar chart.

ggplot(fruit, aes(x = name, y = vitamin_c)) + 
  geom_bar(stat = "identity")

Note that geom_bar by default uses the stat_count function to aggregate data by counting, but we just want to visualize the value in the tibble, so we specify the identity function instead.