Chapter 3 Data Preprocessing
While exploring data, we need to preprocess it properly for our purpose. And these work can be repeat until we construct a dataset that can make our probabilistic model well.
3.1 ifelse
ifelse
is Conditional Statments function
that is very simple and useful in R. We can use this with just 3 parameters - condition, return value when it’s true, return value when it’s false.
let’s see example code.
first, print the head of dataset
head(mpg$hwy)
## [1] 29 29 31 30 26 26
And let’s apply ifelse to this head. if the hwy(highway fuel efficiency) is higher than 30, returns ‘Good.’
ifelse(head(mpg$hwy) > 30, 'Good', 'Bad')
## [1] "Bad" "Bad" "Good" "Bad" "Bad" "Bad"
We can use this data by assign method(<-
). Let’s Control Entire variable.
<- ifelse(mpg$hwy > 30, 'Good', 'Bad')
evaluate_hwy print(evaluate_hwy)
## [1] "Bad" "Bad" "Good" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad"
## [11] "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad"
## [21] "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad"
## [31] "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad"
## [41] "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad"
## [51] "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad"
## [61] "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad"
## [71] "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad"
## [81] "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad"
## [91] "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Good"
## [101] "Good" "Good" "Bad" "Good" "Good" "Good" "Good" "Bad" "Bad" "Bad"
## [111] "Bad" "Good" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad"
## [121] "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad"
## [131] "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad"
## [141] "Bad" "Bad" "Bad" "Good" "Good" "Bad" "Bad" "Bad" "Bad" "Bad"
## [151] "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad"
## [161] "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad"
## [171] "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad"
## [181] "Bad" "Good" "Good" "Bad" "Bad" "Bad" "Bad" "Bad" "Good" "Good"
## [191] "Bad" "Bad" "Bad" "Bad" "Good" "Good" "Good" "Good" "Bad" "Bad"
## [201] "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad"
## [211] "Bad" "Bad" "Good" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad"
## [221] "Bad" "Good" "Good" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad" "Bad"
## [231] "Bad" "Bad" "Bad" "Bad"
Then, Why is this necessary?
Sometimes we need categorical data(nominal data or ordinal data) for not only visualization but also data handling. Especially, We do not always get gaussian distribution dataset, so since the data is not uniform, it is necessary to divide it into appropriate intervals.
We can use this function to not only create new variable but also add another column.
<- mpg
df_mpg 'eval_hwy'] <- ifelse(df_mpg$hwy>30, 'High', 'Low')
df_mpg[$eval_hwy df_mpg
## [1] "Low" "Low" "High" "Low" "Low" "Low" "Low" "Low" "Low" "Low"
## [11] "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low"
## [21] "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low"
## [31] "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low"
## [41] "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low"
## [51] "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low"
## [61] "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low"
## [71] "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low"
## [81] "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low"
## [91] "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "High"
## [101] "High" "High" "Low" "High" "High" "High" "High" "Low" "Low" "Low"
## [111] "Low" "High" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low"
## [121] "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low"
## [131] "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low"
## [141] "Low" "Low" "Low" "High" "High" "Low" "Low" "Low" "Low" "Low"
## [151] "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low"
## [161] "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low"
## [171] "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low"
## [181] "Low" "High" "High" "Low" "Low" "Low" "Low" "Low" "High" "High"
## [191] "Low" "Low" "Low" "Low" "High" "High" "High" "High" "Low" "Low"
## [201] "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low"
## [211] "Low" "Low" "High" "Low" "Low" "Low" "Low" "Low" "Low" "Low"
## [221] "Low" "High" "High" "Low" "Low" "Low" "Low" "Low" "Low" "Low"
## [231] "Low" "Low" "Low" "Low"
3.2 dplyr :: mutate
In R, dplyr
package is the most popular library for data preprocessing. Especially it provides pipe(‘%>%’), and it’s very helpful for intuitive coding. And mutate
in dplyr
package do exactly same work as the one above, However it is simpler and easier to use. Let’s see this.
(before load, you have to install “dplyr” package)
# install.packages("dplyr")
library(dplyr)
head(df_mpg)
## # A tibble: 6 x 12
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
## # … with 1 more variable: eval_hwy <chr>
We’ll use mutate
function and pipe
syntex to handle cty
(city fuel economy) column.
%>%
df_mpg mutate(mean_fuel = (cty + hwy)/2) %>%
head
## # A tibble: 6 x 13
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
## # … with 2 more variables: eval_hwy <chr>, mean_fuel <dbl>
We can see the last column is added and named with mean_fuel
. We didn’t have to use the name of dataframa(df_mpg) repeatly to use columns. It’s very nice.
And of course we can use ifelse
function with the pipe
and mutate
.
%>%
df_mpg mutate(mean_fuel = (cty + hwy)/2,
eval_fuel = ifelse(mean_fuel > 20, 'high', 'low')) %>%
head
## # A tibble: 6 x 14
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
## # … with 3 more variables: eval_hwy <chr>, mean_fuel <dbl>, eval_fuel <chr>
If you want to use this data table continuously, you have to assign this table to the new variable like this.
<- df_mpg %>%
new_mpg mutate(mean_fuel = (cty + hwy)/2,
eval_fuel = ifelse(mean_fuel > 20, 'high', 'low'))
str(new_mpg)
## tibble [234 × 14] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
## $ eval_hwy : chr [1:234] "Low" "Low" "High" "Low" ...
## $ mean_fuel : num [1:234] 23.5 25 25.5 25.5 21 22 22.5 22 20.5 24 ...
## $ eval_fuel : chr [1:234] "high" "high" "high" "high" ...
qplot(new_mpg$mean_fuel)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(new_mpg$eval_fuel)
We can use more functions from dplyr
package. The commonly used functions that except mutate we have seen already are filter()
, select()
, arrange()
, summarise()
. Let’s look at these one by one.
3.3 dplyr :: filter
First, filter()
function. filter
extracts rows by some conditions.
head(new_mpg)
## # A tibble: 6 x 14
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
## # … with 3 more variables: eval_hwy <chr>, mean_fuel <dbl>, eval_fuel <chr>
If you want to extract only audi
, you can use filter
.
%>%
new_mpg filter(manufacturer=='audi')
## # A tibble: 18 x 14
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manual… f 21 29 p comp…
## 3 audi a4 2 2008 4 manual… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto(a… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto(l… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manual… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto(a… f 18 27 p comp…
## 8 audi a4 quat… 1.8 1999 4 manual… 4 18 26 p comp…
## 9 audi a4 quat… 1.8 1999 4 auto(l… 4 16 25 p comp…
## 10 audi a4 quat… 2 2008 4 manual… 4 20 28 p comp…
## 11 audi a4 quat… 2 2008 4 auto(s… 4 19 27 p comp…
## 12 audi a4 quat… 2.8 1999 6 auto(l… 4 15 25 p comp…
## 13 audi a4 quat… 2.8 1999 6 manual… 4 17 25 p comp…
## 14 audi a4 quat… 3.1 2008 6 auto(s… 4 17 25 p comp…
## 15 audi a4 quat… 3.1 2008 6 manual… 4 15 25 p comp…
## 16 audi a6 quat… 2.8 1999 6 auto(l… 4 15 24 p mids…
## 17 audi a6 quat… 3.1 2008 6 auto(s… 4 17 25 p mids…
## 18 audi a6 quat… 4.2 2008 8 auto(s… 4 16 23 p mids…
## # … with 3 more variables: eval_hwy <chr>, mean_fuel <dbl>, eval_fuel <chr>
If you want to extract audi a4
model,
%>%
new_mpg filter(manufacturer=='audi' & model=='a4')
## # A tibble: 7 x 14
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
## 7 audi a4 3.1 2008 6 auto(av) f 18 27 p compa…
## # … with 3 more variables: eval_hwy <chr>, mean_fuel <dbl>, eval_fuel <chr>
If you want to extract audi a4
that have Good for highway fuel economy
,
%>%
new_mpg filter(manufacturer=='audi' & model=='a4' & eval_hwy == 'High')
## # A tibble: 1 x 14
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## # … with 3 more variables: eval_hwy <chr>, mean_fuel <dbl>, eval_fuel <chr>
Yes, now you can choose the car that you want.
3.4 dplyr :: select
Next, select()
function.
We have used filter()
to extract rows, now we’ll use select()
to extract columns.
%>%
new_mpg select(manufacturer) %>%
head
## # A tibble: 6 x 1
## manufacturer
## <chr>
## 1 audi
## 2 audi
## 3 audi
## 4 audi
## 5 audi
## 6 audi
Two or more columns can be extracted at the same time.
%>%
new_mpg select(manufacturer, class, hwy) %>%
head
## # A tibble: 6 x 3
## manufacturer class hwy
## <chr> <chr> <int>
## 1 audi compact 29
## 2 audi compact 29
## 3 audi compact 31
## 4 audi compact 30
## 5 audi compact 26
## 6 audi compact 26
And with pipe, we can extract some rows and some columns at the same time too.
%>%
new_mpg select(manufacturer, model, class, hwy) %>%
filter(hwy > 23) %>%
head(10)
## # A tibble: 10 x 4
## manufacturer model class hwy
## <chr> <chr> <chr> <int>
## 1 audi a4 compact 29
## 2 audi a4 compact 29
## 3 audi a4 compact 31
## 4 audi a4 compact 30
## 5 audi a4 compact 26
## 6 audi a4 compact 26
## 7 audi a4 compact 27
## 8 audi a4 quattro compact 26
## 9 audi a4 quattro compact 25
## 10 audi a4 quattro compact 28
3.5 dplyr :: arrange
In case of numerical data(variable), we can order it. If data is categorical variable, It’ll return error.
%>%
new_mpg select(manufacturer, model, class, hwy) %>%
arrange(hwy) %>%
head(10)
## # A tibble: 10 x 4
## manufacturer model class hwy
## <chr> <chr> <chr> <int>
## 1 dodge dakota pickup 4wd pickup 12
## 2 dodge durango 4wd suv 12
## 3 dodge ram 1500 pickup 4wd pickup 12
## 4 dodge ram 1500 pickup 4wd pickup 12
## 5 jeep grand cherokee 4wd suv 12
## 6 chevrolet k1500 tahoe 4wd suv 14
## 7 jeep grand cherokee 4wd suv 14
## 8 chevrolet c1500 suburban 2wd suv 15
## 9 chevrolet k1500 tahoe 4wd suv 15
## 10 dodge dakota pickup 4wd pickup 15
If you want to sort dataset from highest to lowest highway fuel economy, wrapping arrange variable by desc()
%>%
new_mpg select(manufacturer, model, class, hwy) %>%
arrange(desc(hwy)) %>%
head(10)
## # A tibble: 10 x 4
## manufacturer model class hwy
## <chr> <chr> <chr> <int>
## 1 volkswagen jetta compact 44
## 2 volkswagen new beetle subcompact 44
## 3 volkswagen new beetle subcompact 41
## 4 toyota corolla compact 37
## 5 honda civic subcompact 36
## 6 honda civic subcompact 36
## 7 toyota corolla compact 35
## 8 toyota corolla compact 35
## 9 honda civic subcompact 34
## 10 honda civic subcompact 33
3.6 dplyr :: summarise
The function summarise()
can be used to get basic statistics by specifying functions such as mean(), sd(), median(), etc.
%>%
new_mpg filter(manufacturer=='audi') %>%
summarise(mean_hwy = mean(hwy))
## # A tibble: 1 x 1
## mean_hwy
## <dbl>
## 1 26.4
head(10)
## [1] 10
%>%
new_mpg filter(manufacturer=='audi') %>%
summarise(mean_hwy = mean(hwy),
max_hwy = max(hwy),
min_hwy = min(hwy))
## # A tibble: 1 x 3
## mean_hwy max_hwy min_hwy
## <dbl> <int> <int>
## 1 26.4 31 23
head(10)
## [1] 10
But, this function can be more useful if used together with groupby.
%>%
new_mpg group_by(manufacturer) %>%
summarise(mean_hwy = mean(hwy),
max_hwy = max(hwy),
min_hwy = min(hwy))
## # A tibble: 15 x 4
## manufacturer mean_hwy max_hwy min_hwy
## <chr> <dbl> <int> <int>
## 1 audi 26.4 31 23
## 2 chevrolet 21.9 30 14
## 3 dodge 17.9 24 12
## 4 ford 19.4 26 15
## 5 honda 32.6 36 29
## 6 hyundai 26.9 31 24
## 7 jeep 17.6 22 12
## 8 land rover 16.5 18 15
## 9 lincoln 17 18 16
## 10 mercury 18 19 17
## 11 nissan 24.6 32 17
## 12 pontiac 26.4 28 25
## 13 subaru 25.6 27 23
## 14 toyota 24.9 37 15
## 15 volkswagen 29.2 44 23
we can apply arrange()
too.
%>%
new_mpg group_by(manufacturer) %>%
summarise(mean_hwy = mean(hwy),
max_hwy = max(hwy),
min_hwy = min(hwy)) %>%
arrange(desc(mean_hwy))
## # A tibble: 15 x 4
## manufacturer mean_hwy max_hwy min_hwy
## <chr> <dbl> <int> <int>
## 1 honda 32.6 36 29
## 2 volkswagen 29.2 44 23
## 3 hyundai 26.9 31 24
## 4 audi 26.4 31 23
## 5 pontiac 26.4 28 25
## 6 subaru 25.6 27 23
## 7 toyota 24.9 37 15
## 8 nissan 24.6 32 17
## 9 chevrolet 21.9 30 14
## 10 ford 19.4 26 15
## 11 mercury 18 19 17
## 12 dodge 17.9 24 12
## 13 jeep 17.6 22 12
## 14 lincoln 17 18 16
## 15 land rover 16.5 18 15
We select 2 or more columns to group_by
function like this.
%>%
new_mpg group_by(manufacturer, model) %>%
summarise(mean_hwy = mean(hwy),
max_hwy = max(hwy),
min_hwy = min(hwy))
## `summarise()` has grouped output by 'manufacturer'. You can override using the `.groups` argument.
## # A tibble: 38 x 5
## # Groups: manufacturer [15]
## manufacturer model mean_hwy max_hwy min_hwy
## <chr> <chr> <dbl> <int> <int>
## 1 audi a4 28.3 31 26
## 2 audi a4 quattro 25.8 28 25
## 3 audi a6 quattro 24 25 23
## 4 chevrolet c1500 suburban 2wd 17.8 20 15
## 5 chevrolet corvette 24.8 26 23
## 6 chevrolet k1500 tahoe 4wd 16.2 19 14
## 7 chevrolet malibu 27.6 30 26
## 8 dodge caravan 2wd 22.4 24 17
## 9 dodge dakota pickup 4wd 17 19 12
## 10 dodge durango 4wd 16 18 12
## # … with 28 more rows
and apply filter()
too.
%>%
new_mpg group_by(manufacturer, model) %>%
summarise(mean_hwy = mean(hwy),
max_hwy = max(hwy),
min_hwy = min(hwy)) %>%
filter(manufacturer=='audi' | manufacturer=='hyundai')
## `summarise()` has grouped output by 'manufacturer'. You can override using the `.groups` argument.
## # A tibble: 5 x 5
## # Groups: manufacturer [2]
## manufacturer model mean_hwy max_hwy min_hwy
## <chr> <chr> <dbl> <int> <int>
## 1 audi a4 28.3 31 26
## 2 audi a4 quattro 25.8 28 25
## 3 audi a6 quattro 24 25 23
## 4 hyundai sonata 27.7 31 26
## 5 hyundai tiburon 26 29 24
3.7 dplyr :: left_join
dplyr
package contain join
functions. First, we try left_join()
for our custom data.
<- c('Jennie', 'Tom', 'Minsu', 'Jay', 'Bob')
students <- c('A','B','C','D','E')
classes <- c('JN', 'T', 'M', 'J', 'B')
keys <- c(155, 190, 165, 177, 180)
heights <- c(57, 101, 64, 80, 88)
weights
<- data.frame(students, classes)
class_info <- data.frame(students, keys, heights, weights)
students_info
head(class_info)
## students classes
## 1 Jennie A
## 2 Tom B
## 3 Minsu C
## 4 Jay D
## 5 Bob E
head(students_info)
## students keys heights weights
## 1 Jennie JN 155 57
## 2 Tom T 190 101
## 3 Minsu M 165 64
## 4 Jay J 177 80
## 5 Bob B 180 88
If we have to combine this two table, We need some “point”s. This means some “column”s that two table have in common. In this example, students
column is that.
Let’s combine datasets by Setting students
column as a point.
<- left_join(class_info, students_info, by='students')
all_info all_info
## students classes keys heights weights
## 1 Jennie A JN 155 57
## 2 Tom B T 190 101
## 3 Minsu C M 165 64
## 4 Jay D J 177 80
## 5 Bob E B 180 88
3.8 dplyr :: right_join
What if we use right_join()
?
In this table, it’s same.
<- right_join(class_info, students_info, by='students')
all_info2 all_info2
## students classes keys heights weights
## 1 Jennie A JN 155 57
## 2 Tom B T 190 101
## 3 Minsu C M 165 64
## 4 Jay D J 177 80
## 5 Bob E B 180 88
Because left table(class_info) and right rable(students_info) have same variable(students).
If one side has Na in students
column, the result will be different.
<- students_info[-1,]
students_info2 students_info2
## students keys heights weights
## 2 Tom T 190 101
## 3 Minsu M 165 64
## 4 Jay J 177 80
## 5 Bob B 180 88
<- left_join(class_info, students_info2, by='students')
all_info3 all_info3
## students classes keys heights weights
## 1 Jennie A <NA> NA NA
## 2 Tom B T 190 101
## 3 Minsu C M 165 64
## 4 Jay D J 177 80
## 5 Bob E B 180 88
let’s see this. We have founded 3 NA values. It’s bacause students_info2 didn’t have the Jennie’s informatino(row) But,
<- right_join(class_info, students_info2, by='students')
all_info4 all_info4
## students classes keys heights weights
## 1 Tom B T 190 101
## 2 Minsu C M 165 64
## 3 Jay D J 177 80
## 4 Bob E B 180 88
If we have chosen right_join()
at this moment, we can get different result like this.
It’s because students_info2 didn’t have the Jennie’s informatino(row). Same.
right_join()
receives two inputs like left_join()
, and joins tables based on the right (second) input.
3.9 dplyr :: inner_join
How can we join tables if we met all of tables are have NA
at base column of reference?
Like this case,
<- class_info[-2,]
class_info2 class_info2
## students classes
## 1 Jennie A
## 3 Minsu C
## 4 Jay D
## 5 Bob E
<- left_join(class_info2, students_info2, by='students')
all_info4 all_info4
## students classes keys heights weights
## 1 Jennie A <NA> NA NA
## 2 Minsu C M 165 64
## 3 Jay D J 177 80
## 4 Bob E B 180 88
<- right_join(class_info2, students_info2, by='students')
all_info5 all_info5
## students classes keys heights weights
## 1 Minsu C M 165 64
## 2 Jay D J 177 80
## 3 Bob E B 180 88
## 4 Tom <NA> T 190 101
Yes, returns have NA
all as expected.
But we want to get result that without missing value. In this case, we can use inner_join()
<- inner_join(class_info2, students_info2, by='students')
all_info6 all_info6
## students classes keys heights weights
## 1 Minsu C M 165 64
## 2 Jay D J 177 80
## 3 Bob E B 180 88
3.10 dplyr :: full_join
Conversely, if we want to display all values together even if there are missing values in left or right, we can use full_join()
<- full_join(class_info2, students_info2, by='students')
all_info7 all_info7
## students classes keys heights weights
## 1 Jennie A <NA> NA NA
## 2 Minsu C M 165 64
## 3 Jay D J 177 80
## 4 Bob E B 180 88
## 5 Tom <NA> T 190 101
3.11 dplyr :: bind_cols
If two data tables have the same number of rows, you can use bind_cols()
. It is simple and return different result from join
.
bind_cols(class_info2, students_info2)
## New names:
## * students -> students...1
## * students -> students...3
## students...1 classes students...3 keys heights weights
## 1 Jennie A Tom T 190 101
## 2 Minsu C Minsu M 165 64
## 3 Jay D Jay J 177 80
## 4 Bob E Bob B 180 88
Yes, it just concatenate two tables. There is no reference column. So, same number of rows should be noted. If each number of rows are different, It returns error.
class_info
## students classes
## 1 Jennie A
## 2 Tom B
## 3 Minsu C
## 4 Jay D
## 5 Bob E
students_info2
## students keys heights weights
## 2 Tom T 190 101
## 3 Minsu M 165 64
## 4 Jay J 177 80
## 5 Bob B 180 88
bind_cols(class_info, students_info2)
## Error: Can't recycle `..1` (size 5) to match `..2` (size 4). Run `rlang::last_error()` to see where the error occurred.
3.12 dplyr :: bind_rows
Unlike bind_cols()
, it can join tables with different columns.
<- c('apple', 'banana', 'watermelon', 'mango')
fruit <- c('strawberry', 'raspberry', 'mulberry', 'wildberry')
berry
print(fruit)
## [1] "apple" "banana" "watermelon" "mango"
print(berry)
## [1] "strawberry" "raspberry" "mulberry" "wildberry"
<- data.frame(fruit)
fruit_df fruit_df
## fruit
## 1 apple
## 2 banana
## 3 watermelon
## 4 mango
<- data.frame(berry)
berry_df berry_df
## berry
## 1 strawberry
## 2 raspberry
## 3 mulberry
## 4 wildberry
$sugar <- c(3,8,2,9)
fruit_df$sugar <- c(2,1,1,1)
berry_df$color <- c('red', 'red', 'black', 'red')
berry_df
print(fruit_df)
## fruit sugar
## 1 apple 3
## 2 banana 8
## 3 watermelon 2
## 4 mango 9
print(berry_df)
## berry sugar color
## 1 strawberry 2 red
## 2 raspberry 1 red
## 3 mulberry 1 black
## 4 wildberry 1 red
<- rename(berry_df, fruit = berry)
berry_df print(berry_df)
## fruit sugar color
## 1 strawberry 2 red
## 2 raspberry 1 red
## 3 mulberry 1 black
## 4 wildberry 1 red
bind_rows(fruit_df, berry_df)
## fruit sugar color
## 1 apple 3 <NA>
## 2 banana 8 <NA>
## 3 watermelon 2 <NA>
## 4 mango 9 <NA>
## 5 strawberry 2 red
## 6 raspberry 1 red
## 7 mulberry 1 black
## 8 wildberry 1 red
Yes, It returns data table without error even if the structure of column is different. And of course, the number of rows doesn’t matter, since we’re joining rows. But let’s check them too.
bind_rows(fruit_df, berry_df[-1,])
## fruit sugar color
## 1 apple 3 <NA>
## 2 banana 8 <NA>
## 3 watermelon 2 <NA>
## 4 mango 9 <NA>
## 5 raspberry 1 red
## 6 mulberry 1 black
## 7 wildberry 1 red
Okay, Just one row decreased.
3.13 reshape :: melt
reshape
is the most representative r package for data restructuring.
We’ll cover two functions that make the reshape package useful. One melt()
that melts (deconstructs) the data, the other cast()
that casts (reconstructs) the data. As you can see from the word, the name is compared to the casting process of iron.
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
library(reshape)
##
## Attaching package: 'reshape'
## The following object is masked from 'package:dplyr':
##
## rename
<- melt(airquality, id=c('Month','Day'))
sample_decon_df head(sample_decon_df)
## Month Day variable value
## 1 5 1 Ozone 41
## 2 5 2 Ozone 36
## 3 5 3 Ozone 12
## 4 5 4 Ozone 18
## 5 5 5 Ozone NA
## 6 5 6 Ozone 28
summary(sample_decon_df$variable)
## Ozone Solar.R Wind Temp
## 153 153 153 153
We can see that all columns except Month and Day, specified as id values in the melt function, are expressed as variable
.
3.14 reshape :: cast
The cast()
function can cast the data melted above. Here, R’s fomula
should be handled flexibly, and you can learn it by implementing it in code.
Fomula means an expression that expresses a function like a regular expression in text terms, and it is good to learn ‘~’ and ‘+’ at the most basic level.
head(cast(sample_decon_df, Month+Day~variable))
## Month Day Ozone Solar.R Wind Temp
## 1 5 1 41 190 7.4 67
## 2 5 2 36 118 8.0 72
## 3 5 3 12 149 12.6 74
## 4 5 4 18 313 11.5 62
## 5 5 5 NA NA 14.3 56
## 6 5 6 28 NA 14.9 66
You can see that the melted data is cast. There is a difference from the original, in that an index is added and the position of the column is changed. It’s not that important, so let’s move on.
In formula, we defined to use both Month and Day as dependent variables through the ‘+’ sign, and defined variable as the independent variable through the ‘~’ sign.
More precisely, what comes before the ‘~’ sign is the dependent variable, and what comes after the sign is the independent variable. This can be thought of as f(x)~x.
Let’s try more.
cast(sample_decon_df, Month~variable)
## Aggregation requires fun.aggregate: length used as default
## Month Ozone Solar.R Wind Temp
## 1 5 31 31 31 31
## 2 6 30 30 30 30
## 3 7 31 31 31 31
## 4 8 31 31 31 31
## 5 9 30 30 30 30
Only month was used as the dependent variable. The explanation of the fomula
above will make more sense.
cast(sample_decon_df, Month~Day+variable)[1:4,1:5]
## Month 1_Ozone 1_Solar.R 1_Wind 1_Temp
## 1 5 41 190 7.4 67
## 2 6 NA 286 8.6 78
## 3 7 135 269 4.1 84
## 4 8 39 83 6.9 81
This time, Day was added to the independent variable, not the dependent variable. It’s a little complicated, but if you understand it, you can see that it is possible to further subdivide each independent variable into days, such as ozone on day 1 and solar on day 1.
3.15 reshape :: cast with aggregation
Cast also has data aggregation capabilities. It is very simple and powerful. let’s try.
head(cast(sample_decon_df, Month+Day~variable, mean))
## Month Day Ozone Solar.R Wind Temp
## 1 5 1 41 190 7.4 67
## 2 5 2 36 118 8.0 72
## 3 5 3 12 149 12.6 74
## 4 5 4 18 313 11.5 62
## 5 5 5 NA NA 14.3 56
## 6 5 6 28 NA 14.9 66
It doesn’t look any different from the original. This is because there is only one data value for each date, so it cannot be aggregated. To be precise, the aggregation for a day is the same as the data for that day.
Except for Day, you can look at monthly data like this.
cast(sample_decon_df, Month~variable, mean)
## Month Ozone Solar.R Wind Temp
## 1 5 NA NA 11.622581 65.54839
## 2 6 NA 190.1667 10.266667 79.10000
## 3 7 NA 216.4839 8.941935 83.90323
## 4 8 NA NA 8.793548 83.96774
## 5 9 NA 167.4333 10.180000 76.90000
Wait, if there is NA in some data like this, it is displayed as NA because aggregation is impossible. To prevent this we can set na.rm
to True.
cast(sample_decon_df, Month~variable, mean, na.rm=T)
## Month Ozone Solar.R Wind Temp
## 1 5 23.61538 181.2963 11.622581 65.54839
## 2 6 29.44444 190.1667 10.266667 79.10000
## 3 7 59.11538 216.4839 8.941935 83.90323
## 4 8 59.96154 171.8571 8.793548 83.96774
## 5 9 31.44828 167.4333 10.180000 76.90000
Of course, it is possible to count other than average.
cast(sample_decon_df, Month~variable, max, na.rm=T)
## Month Ozone Solar.R Wind Temp
## 1 5 115 334 20.1 81
## 2 6 71 332 20.7 93
## 3 7 135 314 14.9 92
## 4 8 168 273 15.5 97
## 5 9 96 259 16.6 93
cast(sample_decon_df, Month~variable, min, na.rm=T)
## Month Ozone Solar.R Wind Temp
## 1 5 1 8 5.7 56
## 2 6 12 31 1.7 65
## 3 7 7 7 4.1 73
## 4 8 9 24 2.3 72
## 5 9 7 14 2.8 63