1 Introduction

Data mining has the goal of finding patterns in large data sets. In this chapter, we will talk about data its characteristics and how it is prepared for data mining.

This book is organized following the main data mining tasks:

  1. Data preparation and exploratory data analysis (Chapter 2)
  2. Classification (Chapters 3 and 4)
  3. Association analysis (Chapters 5 and 6)
  4. Clustering (Chapter 7)

First, we need to talk about the needed software.

1.1 Used Software

This companion book assumes that you have R and RStudio Desktop installed and are familiar with the basics of R, how to run R code, and install packages.

If you are new to R, working through the official R manual An Introduction to R (Venables, Smith, and the R Core Team 2021) will get you started. There are many introduction videos for RStudio available, and a basic video that shows how to run code and how to install packages will suffice.

Each book chapter will use a set of packages that must be installed. The installation code can be found at the beginning of each chapter. Here is the code to install the packages used in this chapter:

pkgs <- sort(c('tidyverse', 'ggplot2'))

pkgs_install <- pkgs[!(pkgs %in% installed.packages()[,"Package"])]
if(length(pkgs_install)) install.packages(pkgs_install)

The packages used for this chapter are: ggplot2 (Wickham, Chang, et al. 2023), tidyverse (Wickham 2023b)

The code in this book uses tidyverse to manipulate data and ggplot2 for visualization. A great introduction to these useful tools can be found in the freely available web book R for Data Science by Wickham and Grolemund (2017).

1.2 Tidyverse

## ── Attaching core tidyverse packages ────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ──────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

tidyverse (Wickham 2023b) is a collection of many useful packages that work well together by sharing design principles and data structures. tidyverse also includes ggplot2 (Wickham, Chang, et al. 2023) for visualization.

In this book, we will use

  • often tidyverse tibbles to replace R’s built-in data.frames,
  • the pipe operator |> to chain functions together, and
  • data transformation functions like filter(), arrange(), select(), group_by(), and mutate() provided by the tidyverse package dplyr.

A good introduction can be found in the Section on Data Wrangling (Wickham and Grolemund 2017), and a useful reference resource is the RStudio Data Transformation Cheat Sheet.

Here is a short example that will get you familiar with the basic syntax. We create a tibble with the price in dollars per pound and the vitamin C content in milligrams (mg) per pound for three fruit.

fruit <- tibble(
  name = c("apple", "banana", "orange"), 
  price = c(2.5, 2.0, 3.5), 
  vitamin_c = c(20, 45, 250))
fruit
## # A tibble: 3 × 3
##   name   price vitamin_c
##   <chr>  <dbl>     <dbl>
## 1 apple    2.5        20
## 2 banana   2          45
## 3 orange   3.5       250

Now we add a column with the vitamin C (in mg) that a dollar buys you, filter only fruit that provides more than 20 mg, and then order (arrange) the data by the vitamin C per dollar from largest to smallest.

affordable_vitamin_c_sources <- fruit |>
  mutate(vitamin_c_per_dollar = vitamin_c / price) |> 
  filter(vitamin_c_per_dollar > 20) |>
  arrange(desc(vitamin_c_per_dollar))

affordable_vitamin_c_sources 
## # A tibble: 2 × 4
##   name   price vitamin_c vitamin_c_per_dollar
##   <chr>  <dbl>     <dbl>                <dbl>
## 1 orange   3.5       250                 71.4
## 2 banana   2          45                 22.5

The pipes operator |> lets you compose a sequence of function calls more readably by passing the value to the left on as the first argument to the function to the right.

1.3 ggplot2

For visualization, we will use mainly ggplot2. The gg in ggplot2 stands for The Grammar of Graphics introduced by Wilkinson (2005). The main idea is that every graph is built from the same basic components:

  • the data,
  • a coordinate system, and
  • visual marks representing the data (geoms).

In ggplot2, the components are combined using the + operator.

ggplot(data, mapping = aes(x = ..., y = ..., color = ...)) + geom_point()

Since we typically use a Cartesian coordinate system, ggplot uses that by default. Each geom_ function uses a stat_ function to calculate what is visualizes. For example, geom_bar uses stat_count to create a bar chart by counting how often each value appears in the data (see ? geom_bar). geom_point just uses the stat "identity" to display the points using the coordinates as they are. A great introduction can be found in the Chapter on Data Visualization (Wickham and Grolemund 2017), and very useful is RStudio’s Data Visualization Cheat Sheet.

We can visualize our fruit data as a scatter plot.

ggplot(fruit, aes(x = price, y = vitamin_c)) + 
  geom_point()

It is easy to add more geoms. For example, we can add a regression line using geom_smooth with the method "lm" (linear model). We suppress the confidence interval since we only have 3 data points.

ggplot(fruit, aes(x = price, y = vitamin_c)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

Alternatively, we can visualize each fruit’s vitamin C content per dollar using a bar chart.

ggplot(fruit, aes(x = name, y = vitamin_c)) + 
  geom_bar(stat = "identity")

Note that geom_bar by default uses the stat_count statistics to aggregate data by counting, but we just want to visualize the value already available in the tibble, so we specify the identity statistic instead.