Chapter 3 Data Preprocessing

While exploring data, we need to preprocess it properly for our purpose. And these work can be repeat until we construct a dataset that can make our probabilistic model well.

3.1 ifelse

ifelse is Conditional Statments function that is very simple and useful in R. We can use this with just 3 parameters - condition, return value when it’s true, return value when it’s false.

let’s see example code.

first, print the head of dataset

head(mpg$hwy)
## [1] 29 29 31 30 26 26

And let’s apply ifelse to this head. if the hwy(highway fuel efficiency) is higher than 30, returns ‘Good.’

ifelse(head(mpg$hwy) > 30, 'Good', 'Bad')
## [1] "Bad"  "Bad"  "Good" "Bad"  "Bad"  "Bad"

We can use this data by assign method(<-). Let’s Control Entire variable.

evaluate_hwy <- ifelse(mpg$hwy > 30, 'Good', 'Bad')
print(evaluate_hwy)
##   [1] "Bad"  "Bad"  "Good" "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad" 
##  [11] "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad" 
##  [21] "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad" 
##  [31] "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad" 
##  [41] "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad" 
##  [51] "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad" 
##  [61] "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad" 
##  [71] "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad" 
##  [81] "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad" 
##  [91] "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Good"
## [101] "Good" "Good" "Bad"  "Good" "Good" "Good" "Good" "Bad"  "Bad"  "Bad" 
## [111] "Bad"  "Good" "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad" 
## [121] "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad" 
## [131] "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad" 
## [141] "Bad"  "Bad"  "Bad"  "Good" "Good" "Bad"  "Bad"  "Bad"  "Bad"  "Bad" 
## [151] "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad" 
## [161] "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad" 
## [171] "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad" 
## [181] "Bad"  "Good" "Good" "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Good" "Good"
## [191] "Bad"  "Bad"  "Bad"  "Bad"  "Good" "Good" "Good" "Good" "Bad"  "Bad" 
## [201] "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad" 
## [211] "Bad"  "Bad"  "Good" "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad" 
## [221] "Bad"  "Good" "Good" "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad"  "Bad" 
## [231] "Bad"  "Bad"  "Bad"  "Bad"

Then, Why is this necessary?

Sometimes we need categorical data(nominal data or ordinal data) for not only visualization but also data handling. Especially, We do not always get gaussian distribution dataset, so since the data is not uniform, it is necessary to divide it into appropriate intervals.

We can use this function to not only create new variable but also add another column.

df_mpg <- mpg
df_mpg['eval_hwy'] <- ifelse(df_mpg$hwy>30, 'High', 'Low')
df_mpg$eval_hwy
##   [1] "Low"  "Low"  "High" "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low" 
##  [11] "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low" 
##  [21] "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low" 
##  [31] "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low" 
##  [41] "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low" 
##  [51] "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low" 
##  [61] "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low" 
##  [71] "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low" 
##  [81] "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low" 
##  [91] "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "High"
## [101] "High" "High" "Low"  "High" "High" "High" "High" "Low"  "Low"  "Low" 
## [111] "Low"  "High" "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low" 
## [121] "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low" 
## [131] "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low" 
## [141] "Low"  "Low"  "Low"  "High" "High" "Low"  "Low"  "Low"  "Low"  "Low" 
## [151] "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low" 
## [161] "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low" 
## [171] "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low" 
## [181] "Low"  "High" "High" "Low"  "Low"  "Low"  "Low"  "Low"  "High" "High"
## [191] "Low"  "Low"  "Low"  "Low"  "High" "High" "High" "High" "Low"  "Low" 
## [201] "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low" 
## [211] "Low"  "Low"  "High" "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low" 
## [221] "Low"  "High" "High" "Low"  "Low"  "Low"  "Low"  "Low"  "Low"  "Low" 
## [231] "Low"  "Low"  "Low"  "Low"

3.2 dplyr :: mutate

In R, dplyr package is the most popular library for data preprocessing. Especially it provides pipe(‘%>%’), and it’s very helpful for intuitive coding. And mutate in dplyr package do exactly same work as the one above, However it is simpler and easier to use. Let’s see this.

(before load, you have to install “dplyr” package)

# install.packages("dplyr") 
library(dplyr)
head(df_mpg)
## # A tibble: 6 x 12
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…
## # … with 1 more variable: eval_hwy <chr>

We’ll use mutate function and pipe syntex to handle cty(city fuel economy) column.

df_mpg %>%
  mutate(mean_fuel = (cty + hwy)/2) %>%
  head
## # A tibble: 6 x 13
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…
## # … with 2 more variables: eval_hwy <chr>, mean_fuel <dbl>

We can see the last column is added and named with mean_fuel. We didn’t have to use the name of dataframa(df_mpg) repeatly to use columns. It’s very nice.

And of course we can use ifelse function with the pipe and mutate.

df_mpg %>%
  mutate(mean_fuel = (cty + hwy)/2, 
         eval_fuel = ifelse(mean_fuel > 20, 'high', 'low')) %>%
  head
## # A tibble: 6 x 14
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…
## # … with 3 more variables: eval_hwy <chr>, mean_fuel <dbl>, eval_fuel <chr>

If you want to use this data table continuously, you have to assign this table to the new variable like this.

new_mpg <- df_mpg %>%
            mutate(mean_fuel = (cty + hwy)/2, 
            eval_fuel = ifelse(mean_fuel > 20, 'high', 'low'))
str(new_mpg)
## tibble [234 × 14] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...
##  $ eval_hwy    : chr [1:234] "Low" "Low" "High" "Low" ...
##  $ mean_fuel   : num [1:234] 23.5 25 25.5 25.5 21 22 22.5 22 20.5 24 ...
##  $ eval_fuel   : chr [1:234] "high" "high" "high" "high" ...
qplot(new_mpg$mean_fuel)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(new_mpg$eval_fuel)

We can use more functions from dplyr package. The commonly used functions that except mutate we have seen already are filter(), select(), arrange(), summarise(). Let’s look at these one by one.

3.3 dplyr :: filter

First, filter() function. filter extracts rows by some conditions.

head(new_mpg)
## # A tibble: 6 x 14
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…
## # … with 3 more variables: eval_hwy <chr>, mean_fuel <dbl>, eval_fuel <chr>

If you want to extract only audi, you can use filter.

new_mpg %>%
  filter(manufacturer=='audi')
## # A tibble: 18 x 14
##    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl    class
##    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr>
##  1 audi         a4         1.8  1999     4 auto(l… f        18    29 p     comp…
##  2 audi         a4         1.8  1999     4 manual… f        21    29 p     comp…
##  3 audi         a4         2    2008     4 manual… f        20    31 p     comp…
##  4 audi         a4         2    2008     4 auto(a… f        21    30 p     comp…
##  5 audi         a4         2.8  1999     6 auto(l… f        16    26 p     comp…
##  6 audi         a4         2.8  1999     6 manual… f        18    26 p     comp…
##  7 audi         a4         3.1  2008     6 auto(a… f        18    27 p     comp…
##  8 audi         a4 quat…   1.8  1999     4 manual… 4        18    26 p     comp…
##  9 audi         a4 quat…   1.8  1999     4 auto(l… 4        16    25 p     comp…
## 10 audi         a4 quat…   2    2008     4 manual… 4        20    28 p     comp…
## 11 audi         a4 quat…   2    2008     4 auto(s… 4        19    27 p     comp…
## 12 audi         a4 quat…   2.8  1999     6 auto(l… 4        15    25 p     comp…
## 13 audi         a4 quat…   2.8  1999     6 manual… 4        17    25 p     comp…
## 14 audi         a4 quat…   3.1  2008     6 auto(s… 4        17    25 p     comp…
## 15 audi         a4 quat…   3.1  2008     6 manual… 4        15    25 p     comp…
## 16 audi         a6 quat…   2.8  1999     6 auto(l… 4        15    24 p     mids…
## 17 audi         a6 quat…   3.1  2008     6 auto(s… 4        17    25 p     mids…
## 18 audi         a6 quat…   4.2  2008     8 auto(s… 4        16    23 p     mids…
## # … with 3 more variables: eval_hwy <chr>, mean_fuel <dbl>, eval_fuel <chr>

If you want to extract audi a4 model,

new_mpg %>%
  filter(manufacturer=='audi' & model=='a4')
## # A tibble: 7 x 14
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…
## 7 audi         a4      3.1  2008     6 auto(av)   f        18    27 p     compa…
## # … with 3 more variables: eval_hwy <chr>, mean_fuel <dbl>, eval_fuel <chr>

If you want to extract audi a4 that have Good for highway fuel economy,

new_mpg %>%
  filter(manufacturer=='audi' & model=='a4' & eval_hwy == 'High')
## # A tibble: 1 x 14
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4        2  2008     4 manual(m6) f        20    31 p     compa…
## # … with 3 more variables: eval_hwy <chr>, mean_fuel <dbl>, eval_fuel <chr>

Yes, now you can choose the car that you want.

3.4 dplyr :: select

Next, select() function. We have used filter() to extract rows, now we’ll use select() to extract columns.

new_mpg %>%
  select(manufacturer) %>%
  head
## # A tibble: 6 x 1
##   manufacturer
##   <chr>       
## 1 audi        
## 2 audi        
## 3 audi        
## 4 audi        
## 5 audi        
## 6 audi

Two or more columns can be extracted at the same time.

new_mpg %>%
  select(manufacturer, class, hwy) %>%
  head
## # A tibble: 6 x 3
##   manufacturer class     hwy
##   <chr>        <chr>   <int>
## 1 audi         compact    29
## 2 audi         compact    29
## 3 audi         compact    31
## 4 audi         compact    30
## 5 audi         compact    26
## 6 audi         compact    26

And with pipe, we can extract some rows and some columns at the same time too.

new_mpg %>%
  select(manufacturer, model, class, hwy) %>%
  filter(hwy > 23) %>%
  head(10)
## # A tibble: 10 x 4
##    manufacturer model      class     hwy
##    <chr>        <chr>      <chr>   <int>
##  1 audi         a4         compact    29
##  2 audi         a4         compact    29
##  3 audi         a4         compact    31
##  4 audi         a4         compact    30
##  5 audi         a4         compact    26
##  6 audi         a4         compact    26
##  7 audi         a4         compact    27
##  8 audi         a4 quattro compact    26
##  9 audi         a4 quattro compact    25
## 10 audi         a4 quattro compact    28

3.5 dplyr :: arrange

In case of numerical data(variable), we can order it. If data is categorical variable, It’ll return error.

new_mpg %>%
  select(manufacturer, model, class, hwy) %>%
  arrange(hwy) %>%
  head(10)
## # A tibble: 10 x 4
##    manufacturer model               class    hwy
##    <chr>        <chr>               <chr>  <int>
##  1 dodge        dakota pickup 4wd   pickup    12
##  2 dodge        durango 4wd         suv       12
##  3 dodge        ram 1500 pickup 4wd pickup    12
##  4 dodge        ram 1500 pickup 4wd pickup    12
##  5 jeep         grand cherokee 4wd  suv       12
##  6 chevrolet    k1500 tahoe 4wd     suv       14
##  7 jeep         grand cherokee 4wd  suv       14
##  8 chevrolet    c1500 suburban 2wd  suv       15
##  9 chevrolet    k1500 tahoe 4wd     suv       15
## 10 dodge        dakota pickup 4wd   pickup    15

If you want to sort dataset from highest to lowest highway fuel economy, wrapping arrange variable by desc()

new_mpg %>%
  select(manufacturer, model, class, hwy) %>%
  arrange(desc(hwy)) %>%
  head(10)
## # A tibble: 10 x 4
##    manufacturer model      class        hwy
##    <chr>        <chr>      <chr>      <int>
##  1 volkswagen   jetta      compact       44
##  2 volkswagen   new beetle subcompact    44
##  3 volkswagen   new beetle subcompact    41
##  4 toyota       corolla    compact       37
##  5 honda        civic      subcompact    36
##  6 honda        civic      subcompact    36
##  7 toyota       corolla    compact       35
##  8 toyota       corolla    compact       35
##  9 honda        civic      subcompact    34
## 10 honda        civic      subcompact    33

3.6 dplyr :: summarise

The function summarise() can be used to get basic statistics by specifying functions such as mean(), sd(), median(), etc.

new_mpg %>%
  filter(manufacturer=='audi') %>%
  summarise(mean_hwy = mean(hwy))
## # A tibble: 1 x 1
##   mean_hwy
##      <dbl>
## 1     26.4
  head(10)
## [1] 10
new_mpg %>%
  filter(manufacturer=='audi') %>%
  summarise(mean_hwy = mean(hwy),
            max_hwy = max(hwy),
            min_hwy = min(hwy))
## # A tibble: 1 x 3
##   mean_hwy max_hwy min_hwy
##      <dbl>   <int>   <int>
## 1     26.4      31      23
  head(10)
## [1] 10

But, this function can be more useful if used together with groupby.

new_mpg %>%
  group_by(manufacturer) %>%
  summarise(mean_hwy = mean(hwy),
            max_hwy = max(hwy),
            min_hwy = min(hwy))
## # A tibble: 15 x 4
##    manufacturer mean_hwy max_hwy min_hwy
##    <chr>           <dbl>   <int>   <int>
##  1 audi             26.4      31      23
##  2 chevrolet        21.9      30      14
##  3 dodge            17.9      24      12
##  4 ford             19.4      26      15
##  5 honda            32.6      36      29
##  6 hyundai          26.9      31      24
##  7 jeep             17.6      22      12
##  8 land rover       16.5      18      15
##  9 lincoln          17        18      16
## 10 mercury          18        19      17
## 11 nissan           24.6      32      17
## 12 pontiac          26.4      28      25
## 13 subaru           25.6      27      23
## 14 toyota           24.9      37      15
## 15 volkswagen       29.2      44      23

we can apply arrange() too.

new_mpg %>%
  group_by(manufacturer) %>%
  summarise(mean_hwy = mean(hwy),
            max_hwy = max(hwy),
            min_hwy = min(hwy)) %>%
  arrange(desc(mean_hwy))
## # A tibble: 15 x 4
##    manufacturer mean_hwy max_hwy min_hwy
##    <chr>           <dbl>   <int>   <int>
##  1 honda            32.6      36      29
##  2 volkswagen       29.2      44      23
##  3 hyundai          26.9      31      24
##  4 audi             26.4      31      23
##  5 pontiac          26.4      28      25
##  6 subaru           25.6      27      23
##  7 toyota           24.9      37      15
##  8 nissan           24.6      32      17
##  9 chevrolet        21.9      30      14
## 10 ford             19.4      26      15
## 11 mercury          18        19      17
## 12 dodge            17.9      24      12
## 13 jeep             17.6      22      12
## 14 lincoln          17        18      16
## 15 land rover       16.5      18      15

We select 2 or more columns to group_by function like this.

new_mpg %>%
  group_by(manufacturer, model) %>%
  summarise(mean_hwy = mean(hwy),
            max_hwy = max(hwy),
            min_hwy = min(hwy))
## `summarise()` has grouped output by 'manufacturer'. You can override using the `.groups` argument.
## # A tibble: 38 x 5
## # Groups:   manufacturer [15]
##    manufacturer model              mean_hwy max_hwy min_hwy
##    <chr>        <chr>                 <dbl>   <int>   <int>
##  1 audi         a4                     28.3      31      26
##  2 audi         a4 quattro             25.8      28      25
##  3 audi         a6 quattro             24        25      23
##  4 chevrolet    c1500 suburban 2wd     17.8      20      15
##  5 chevrolet    corvette               24.8      26      23
##  6 chevrolet    k1500 tahoe 4wd        16.2      19      14
##  7 chevrolet    malibu                 27.6      30      26
##  8 dodge        caravan 2wd            22.4      24      17
##  9 dodge        dakota pickup 4wd      17        19      12
## 10 dodge        durango 4wd            16        18      12
## # … with 28 more rows

and apply filter() too.

new_mpg %>%
  group_by(manufacturer, model) %>%
  summarise(mean_hwy = mean(hwy),
            max_hwy = max(hwy),
            min_hwy = min(hwy)) %>%
  filter(manufacturer=='audi' | manufacturer=='hyundai')
## `summarise()` has grouped output by 'manufacturer'. You can override using the `.groups` argument.
## # A tibble: 5 x 5
## # Groups:   manufacturer [2]
##   manufacturer model      mean_hwy max_hwy min_hwy
##   <chr>        <chr>         <dbl>   <int>   <int>
## 1 audi         a4             28.3      31      26
## 2 audi         a4 quattro     25.8      28      25
## 3 audi         a6 quattro     24        25      23
## 4 hyundai      sonata         27.7      31      26
## 5 hyundai      tiburon        26        29      24

3.7 dplyr :: left_join

dplyr package contain join functions. First, we try left_join() for our custom data.

students <- c('Jennie', 'Tom', 'Minsu', 'Jay', 'Bob')
classes <- c('A','B','C','D','E')
keys <- c('JN', 'T', 'M', 'J', 'B')
heights <- c(155, 190, 165, 177, 180)
weights <- c(57, 101, 64, 80, 88)

class_info <- data.frame(students, classes)
students_info <- data.frame(students, keys, heights, weights)

head(class_info)
##   students classes
## 1   Jennie       A
## 2      Tom       B
## 3    Minsu       C
## 4      Jay       D
## 5      Bob       E
head(students_info)
##   students keys heights weights
## 1   Jennie   JN     155      57
## 2      Tom    T     190     101
## 3    Minsu    M     165      64
## 4      Jay    J     177      80
## 5      Bob    B     180      88

If we have to combine this two table, We need some “point”s. This means some “column”s that two table have in common. In this example, students column is that.

Let’s combine datasets by Setting students column as a point.

all_info <- left_join(class_info, students_info, by='students')
all_info
##   students classes keys heights weights
## 1   Jennie       A   JN     155      57
## 2      Tom       B    T     190     101
## 3    Minsu       C    M     165      64
## 4      Jay       D    J     177      80
## 5      Bob       E    B     180      88

3.8 dplyr :: right_join

What if we use right_join()? In this table, it’s same.

all_info2 <- right_join(class_info, students_info, by='students')
all_info2
##   students classes keys heights weights
## 1   Jennie       A   JN     155      57
## 2      Tom       B    T     190     101
## 3    Minsu       C    M     165      64
## 4      Jay       D    J     177      80
## 5      Bob       E    B     180      88

Because left table(class_info) and right rable(students_info) have same variable(students).

If one side has Na in students column, the result will be different.

students_info2 <- students_info[-1,]
students_info2
##   students keys heights weights
## 2      Tom    T     190     101
## 3    Minsu    M     165      64
## 4      Jay    J     177      80
## 5      Bob    B     180      88
all_info3 <- left_join(class_info, students_info2, by='students')
all_info3
##   students classes keys heights weights
## 1   Jennie       A <NA>      NA      NA
## 2      Tom       B    T     190     101
## 3    Minsu       C    M     165      64
## 4      Jay       D    J     177      80
## 5      Bob       E    B     180      88

let’s see this. We have founded 3 NA values. It’s bacause students_info2 didn’t have the Jennie’s informatino(row) But,

all_info4 <- right_join(class_info, students_info2, by='students')
all_info4
##   students classes keys heights weights
## 1      Tom       B    T     190     101
## 2    Minsu       C    M     165      64
## 3      Jay       D    J     177      80
## 4      Bob       E    B     180      88

If we have chosen right_join() at this moment, we can get different result like this. It’s because students_info2 didn’t have the Jennie’s informatino(row). Same.

right_join() receives two inputs like left_join(), and joins tables based on the right (second) input.

3.9 dplyr :: inner_join

How can we join tables if we met all of tables are have NA at base column of reference? Like this case,

class_info2 <- class_info[-2,]
class_info2
##   students classes
## 1   Jennie       A
## 3    Minsu       C
## 4      Jay       D
## 5      Bob       E
all_info4 <- left_join(class_info2, students_info2, by='students')
all_info4
##   students classes keys heights weights
## 1   Jennie       A <NA>      NA      NA
## 2    Minsu       C    M     165      64
## 3      Jay       D    J     177      80
## 4      Bob       E    B     180      88
all_info5 <- right_join(class_info2, students_info2, by='students')
all_info5
##   students classes keys heights weights
## 1    Minsu       C    M     165      64
## 2      Jay       D    J     177      80
## 3      Bob       E    B     180      88
## 4      Tom    <NA>    T     190     101

Yes, returns have NA all as expected. But we want to get result that without missing value. In this case, we can use inner_join()

all_info6 <- inner_join(class_info2, students_info2, by='students')
all_info6
##   students classes keys heights weights
## 1    Minsu       C    M     165      64
## 2      Jay       D    J     177      80
## 3      Bob       E    B     180      88

3.10 dplyr :: full_join

Conversely, if we want to display all values together even if there are missing values in left or right, we can use full_join()

all_info7 <- full_join(class_info2, students_info2, by='students')
all_info7
##   students classes keys heights weights
## 1   Jennie       A <NA>      NA      NA
## 2    Minsu       C    M     165      64
## 3      Jay       D    J     177      80
## 4      Bob       E    B     180      88
## 5      Tom    <NA>    T     190     101

3.11 dplyr :: bind_cols

If two data tables have the same number of rows, you can use bind_cols(). It is simple and return different result from join.

bind_cols(class_info2, students_info2)
## New names:
## * students -> students...1
## * students -> students...3
##   students...1 classes students...3 keys heights weights
## 1       Jennie       A          Tom    T     190     101
## 2        Minsu       C        Minsu    M     165      64
## 3          Jay       D          Jay    J     177      80
## 4          Bob       E          Bob    B     180      88

Yes, it just concatenate two tables. There is no reference column. So, same number of rows should be noted. If each number of rows are different, It returns error.

class_info
##   students classes
## 1   Jennie       A
## 2      Tom       B
## 3    Minsu       C
## 4      Jay       D
## 5      Bob       E
students_info2
##   students keys heights weights
## 2      Tom    T     190     101
## 3    Minsu    M     165      64
## 4      Jay    J     177      80
## 5      Bob    B     180      88
bind_cols(class_info, students_info2)
## Error: Can't recycle `..1` (size 5) to match `..2` (size 4). Run `rlang::last_error()` to see where the error occurred.

3.12 dplyr :: bind_rows

Unlike bind_cols(), it can join tables with different columns.

fruit <- c('apple', 'banana', 'watermelon', 'mango')
berry <- c('strawberry', 'raspberry', 'mulberry', 'wildberry')

print(fruit)
## [1] "apple"      "banana"     "watermelon" "mango"
print(berry)
## [1] "strawberry" "raspberry"  "mulberry"   "wildberry"
fruit_df <- data.frame(fruit)
fruit_df
##        fruit
## 1      apple
## 2     banana
## 3 watermelon
## 4      mango
berry_df <- data.frame(berry)
berry_df
##        berry
## 1 strawberry
## 2  raspberry
## 3   mulberry
## 4  wildberry
fruit_df$sugar <- c(3,8,2,9)
berry_df$sugar <- c(2,1,1,1)
berry_df$color <- c('red', 'red', 'black', 'red')

print(fruit_df)
##        fruit sugar
## 1      apple     3
## 2     banana     8
## 3 watermelon     2
## 4      mango     9
print(berry_df)
##        berry sugar color
## 1 strawberry     2   red
## 2  raspberry     1   red
## 3   mulberry     1 black
## 4  wildberry     1   red
berry_df <- rename(berry_df, fruit = berry)
print(berry_df)
##        fruit sugar color
## 1 strawberry     2   red
## 2  raspberry     1   red
## 3   mulberry     1 black
## 4  wildberry     1   red
bind_rows(fruit_df, berry_df)
##        fruit sugar color
## 1      apple     3  <NA>
## 2     banana     8  <NA>
## 3 watermelon     2  <NA>
## 4      mango     9  <NA>
## 5 strawberry     2   red
## 6  raspberry     1   red
## 7   mulberry     1 black
## 8  wildberry     1   red

Yes, It returns data table without error even if the structure of column is different. And of course, the number of rows doesn’t matter, since we’re joining rows. But let’s check them too.

bind_rows(fruit_df, berry_df[-1,])
##        fruit sugar color
## 1      apple     3  <NA>
## 2     banana     8  <NA>
## 3 watermelon     2  <NA>
## 4      mango     9  <NA>
## 5  raspberry     1   red
## 6   mulberry     1 black
## 7  wildberry     1   red

Okay, Just one row decreased.

3.13 reshape :: melt

reshape is the most representative r package for data restructuring.

We’ll cover two functions that make the reshape package useful. One melt() that melts (deconstructs) the data, the other cast() that casts (reconstructs) the data. As you can see from the word, the name is compared to the casting process of iron.

head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
library(reshape)
## 
## Attaching package: 'reshape'
## The following object is masked from 'package:dplyr':
## 
##     rename
sample_decon_df <- melt(airquality, id=c('Month','Day'))
head(sample_decon_df)
##   Month Day variable value
## 1     5   1    Ozone    41
## 2     5   2    Ozone    36
## 3     5   3    Ozone    12
## 4     5   4    Ozone    18
## 5     5   5    Ozone    NA
## 6     5   6    Ozone    28
summary(sample_decon_df$variable)
##   Ozone Solar.R    Wind    Temp 
##     153     153     153     153

We can see that all columns except Month and Day, specified as id values in the melt function, are expressed as variable.

3.14 reshape :: cast

The cast() function can cast the data melted above. Here, R’s fomula should be handled flexibly, and you can learn it by implementing it in code.

Fomula means an expression that expresses a function like a regular expression in text terms, and it is good to learn ‘~’ and ‘+’ at the most basic level.

head(cast(sample_decon_df, Month+Day~variable))
##   Month Day Ozone Solar.R Wind Temp
## 1     5   1    41     190  7.4   67
## 2     5   2    36     118  8.0   72
## 3     5   3    12     149 12.6   74
## 4     5   4    18     313 11.5   62
## 5     5   5    NA      NA 14.3   56
## 6     5   6    28      NA 14.9   66

You can see that the melted data is cast. There is a difference from the original, in that an index is added and the position of the column is changed. It’s not that important, so let’s move on.

In formula, we defined to use both Month and Day as dependent variables through the ‘+’ sign, and defined variable as the independent variable through the ‘~’ sign.

More precisely, what comes before the ‘~’ sign is the dependent variable, and what comes after the sign is the independent variable. This can be thought of as f(x)~x.

Let’s try more.

cast(sample_decon_df, Month~variable)
## Aggregation requires fun.aggregate: length used as default
##   Month Ozone Solar.R Wind Temp
## 1     5    31      31   31   31
## 2     6    30      30   30   30
## 3     7    31      31   31   31
## 4     8    31      31   31   31
## 5     9    30      30   30   30

Only month was used as the dependent variable. The explanation of the fomula above will make more sense.

cast(sample_decon_df, Month~Day+variable)[1:4,1:5]
##   Month 1_Ozone 1_Solar.R 1_Wind 1_Temp
## 1     5      41       190    7.4     67
## 2     6      NA       286    8.6     78
## 3     7     135       269    4.1     84
## 4     8      39        83    6.9     81

This time, Day was added to the independent variable, not the dependent variable. It’s a little complicated, but if you understand it, you can see that it is possible to further subdivide each independent variable into days, such as ozone on day 1 and solar on day 1.

3.15 reshape :: cast with aggregation

Cast also has data aggregation capabilities. It is very simple and powerful. let’s try.

head(cast(sample_decon_df, Month+Day~variable, mean))
##   Month Day Ozone Solar.R Wind Temp
## 1     5   1    41     190  7.4   67
## 2     5   2    36     118  8.0   72
## 3     5   3    12     149 12.6   74
## 4     5   4    18     313 11.5   62
## 5     5   5    NA      NA 14.3   56
## 6     5   6    28      NA 14.9   66

It doesn’t look any different from the original. This is because there is only one data value for each date, so it cannot be aggregated. To be precise, the aggregation for a day is the same as the data for that day.

Except for Day, you can look at monthly data like this.

cast(sample_decon_df, Month~variable, mean)
##   Month Ozone  Solar.R      Wind     Temp
## 1     5    NA       NA 11.622581 65.54839
## 2     6    NA 190.1667 10.266667 79.10000
## 3     7    NA 216.4839  8.941935 83.90323
## 4     8    NA       NA  8.793548 83.96774
## 5     9    NA 167.4333 10.180000 76.90000

Wait, if there is NA in some data like this, it is displayed as NA because aggregation is impossible. To prevent this we can set na.rm to True.

cast(sample_decon_df, Month~variable, mean, na.rm=T)
##   Month    Ozone  Solar.R      Wind     Temp
## 1     5 23.61538 181.2963 11.622581 65.54839
## 2     6 29.44444 190.1667 10.266667 79.10000
## 3     7 59.11538 216.4839  8.941935 83.90323
## 4     8 59.96154 171.8571  8.793548 83.96774
## 5     9 31.44828 167.4333 10.180000 76.90000

Of course, it is possible to count other than average.

cast(sample_decon_df, Month~variable, max, na.rm=T)
##   Month Ozone Solar.R Wind Temp
## 1     5   115     334 20.1   81
## 2     6    71     332 20.7   93
## 3     7   135     314 14.9   92
## 4     8   168     273 15.5   97
## 5     9    96     259 16.6   93
cast(sample_decon_df, Month~variable, min, na.rm=T)
##   Month Ozone Solar.R Wind Temp
## 1     5     1       8  5.7   56
## 2     6    12      31  1.7   65
## 3     7     7       7  4.1   73
## 4     8     9      24  2.3   72
## 5     9     7      14  2.8   63