1 Introduction
Data mining has the goal of finding patterns in large data sets. In this chapter, we will talk about data its characteristics and how it is prepared for data mining.
This book is organized following the main data mining tasks:
- Data preparation and exploratory data analysis (Chapter 2)
- Classification (Chapters 3 and 4)
- Association analysis (Chapters 5 and 6)
- Clustering (Chapter 7)
First, we need to talk about the needed software.
1.1 Used Software
This companion book assumes that you have R and RStudio Desktop installed and are familiar with the basics of R, how to run R code, and install packages.
If you are new to R, working through the official R manual An Introduction to R (Venables, Smith, and the R Core Team 2021) will get you started. There are many introduction videos for RStudio available, and a basic video that shows how to run code and how to install packages will suffice.
Each book chapter will use a set of packages that must be installed. The installation code can be found at the beginning of each chapter. Here is the code to install the packages used in this chapter:
pkgs <- sort(c('tidyverse', 'ggplot2'))
pkgs_install <- pkgs[!(pkgs %in% installed.packages()[,"Package"])]
if(length(pkgs_install)) install.packages(pkgs_install)
The packages used for this chapter are: ggplot2 (Wickham, Chang, et al. 2023), tidyverse (Wickham 2023b)
The code in this book uses tidyverse
to manipulate data and ggplot2
for visualization. A great introduction to these useful tools can
be found in the freely available web book R for Data
Science by Wickham and Grolemund (2017).
1.2 Tidyverse
## ── Attaching core tidyverse packages ────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ──────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
tidyverse
(Wickham 2023b) is a collection of many useful packages
that work well together by sharing design principles and data
structures. tidyverse
also includes ggplot2
(Wickham, Chang, et al. 2023) for
visualization.
In this book, we will use
- often tidyverse tibbles to replace R’s built-in data.frames,
- the pipe operator
|>
to chain functions together, and - data transformation functions like
filter()
,arrange()
,select()
,group_by()
, andmutate()
provided by the tidyverse packagedplyr
.
A good introduction can be found in the Section on Data Wrangling (Wickham and Grolemund 2017), and a useful reference resource is the RStudio Data Transformation Cheat Sheet.
Here is a short example that will get you familiar with the basic syntax. We create a tibble with the price in dollars per pound and the vitamin C content in milligrams (mg) per pound for three fruit.
fruit <- tibble(
name = c("apple", "banana", "orange"),
price = c(2.5, 2.0, 3.5),
vitamin_c = c(20, 45, 250))
fruit
## # A tibble: 3 × 3
## name price vitamin_c
## <chr> <dbl> <dbl>
## 1 apple 2.5 20
## 2 banana 2 45
## 3 orange 3.5 250
Now we add a column with the vitamin C (in mg) that a dollar buys you, filter only fruit that provides more than 20 mg, and then order (arrange) the data by the vitamin C per dollar from largest to smallest.
affordable_vitamin_c_sources <- fruit |>
mutate(vitamin_c_per_dollar = vitamin_c / price) |>
filter(vitamin_c_per_dollar > 20) |>
arrange(desc(vitamin_c_per_dollar))
affordable_vitamin_c_sources
## # A tibble: 2 × 4
## name price vitamin_c vitamin_c_per_dollar
## <chr> <dbl> <dbl> <dbl>
## 1 orange 3.5 250 71.4
## 2 banana 2 45 22.5
The pipes operator |>
lets you compose a sequence of function calls
more readably by passing the value to the left on as the first
argument to the function to the right.
1.3 ggplot2
For visualization, we will use mainly ggplot2
. The gg in ggplot2
stands for The Grammar of Graphics introduced by Wilkinson (2005). The
main idea is that every graph is built from the same basic components:
- the data,
- a coordinate system, and
- visual marks representing the data (geoms).
In ggplot2
, the components are combined using the +
operator.
ggplot(data, mapping = aes(x = ..., y = ..., color = ...)) +
geom_point()
Since we typically use a Cartesian coordinate system, ggplot
uses that
by default. Each geom_
function uses a stat_
function to calculate
what is visualizes. For example, geom_bar
uses stat_count
to create
a bar chart by counting how often each value appears in the data (see
? geom_bar
). geom_point
just uses the stat "identity"
to display
the points using the coordinates as they are. A great introduction can
be found in the Chapter on Data
Visualization
(Wickham and Grolemund 2017), and very useful is RStudio’s Data Visualization Cheat
Sheet.
We can visualize our fruit data as a scatter plot.
ggplot(fruit, aes(x = price, y = vitamin_c)) +
geom_point()
It is easy to add more geoms. For example, we can add a regression line
using geom_smooth
with the method "lm"
(linear model). We suppress the
confidence interval since we only have 3 data points.
ggplot(fruit, aes(x = price, y = vitamin_c)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
Alternatively, we can visualize each fruit’s vitamin C content per dollar using a bar chart.
Note that geom_bar
by default uses the stat_count
statistics to
aggregate data by counting, but
we just want to visualize the value already available in the tibble, so
we specify the identity statistic instead.