Chapter 2 Exploratory Data Analysis
EDA(Exploratory Data Analysis) is necessary for data analyst to check the dataset before statistical modeling(or machine learning)
2.1 head and tail
<- data.frame(ggplot2::mpg)
df_mpg head(df_mpg)
## manufacturer model displ year cyl trans drv cty hwy fl class
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
## 4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
we have loaded mpg
dataset from ggplot2
package. ggplot2
contains multiple useful datasets and various visualization tools.
head
function extract top 6 rows of input dataset. In contrast, we can extract bottom 6 rows using tail
function.
tail(df_mpg)
## manufacturer model displ year cyl trans drv cty hwy fl class
## 229 volkswagen passat 1.8 1999 4 auto(l5) f 18 29 p midsize
## 230 volkswagen passat 2.0 2008 4 auto(s6) f 19 28 p midsize
## 231 volkswagen passat 2.0 2008 4 manual(m6) f 21 29 p midsize
## 232 volkswagen passat 2.8 1999 6 auto(l5) f 16 26 p midsize
## 233 volkswagen passat 2.8 1999 6 manual(m5) f 18 26 p midsize
## 234 volkswagen passat 3.6 2008 6 auto(s6) f 17 26 p midsize
It’s important to check rows in this way, because most of dataset we meet are so large that it can’t be shown just in one page.
2.2 dim and str
using dim
function, we can check the size of dataset. The size means rows x columns.
dim(df_mpg)
## [1] 234 11
This mpg
dataset has 234 rows and 11 columns(features).
We do not need to use this function just for check entire dataset.
dim(head(df_mpg))
## [1] 6 11
dim(df_mpg[,2:4])
## [1] 234 3
dim(df_mpg['model'])
## [1] 234 1
but we can’t use dim
for 1-dim vector. Note that df_mpg[‘model’] is not a vector, it’s dataframe yet.
str(df_mpg['model'])
## 'data.frame': 234 obs. of 1 variable:
## $ model: chr "a4" "a4" "a4" "a4" ...
It’s different from df_mpg$model
str(df_mpg$model)
## chr [1:234] "a4" "a4" "a4" "a4" "a4" "a4" "a4" "a4 quattro" "a4 quattro" ...
This is vector. And it’s same with df_mpg[[‘model’]]
str(df_mpg[['model']])
## chr [1:234] "a4" "a4" "a4" "a4" "a4" "a4" "a4" "a4 quattro" "a4 quattro" ...
2.3 summary
we can check summary statistics using summary
function.
summary(df_mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
If column is categorical variable, like model
, trans
, drv
, manufacturer
, fl
, class
, it just shows Length, Class and Mode. But in case of numerical, it can shows mean of variable, median of variable, minimum of variable, maximum of variable and 1st and 3rd quartile.
2.4 table
And R also provides a frequency table. We can use this as call table
function.
table(df_mpg$model)
##
## 4runner 4wd a4 a4 quattro
## 6 7 8
## a6 quattro altima c1500 suburban 2wd
## 3 6 5
## camry camry solara caravan 2wd
## 7 7 11
## civic corolla corvette
## 9 5 5
## dakota pickup 4wd durango 4wd expedition 2wd
## 9 7 3
## explorer 4wd f150 pickup 4wd forester awd
## 6 7 6
## grand cherokee 4wd grand prix gti
## 8 5 5
## impreza awd jetta k1500 tahoe 4wd
## 8 9 4
## land cruiser wagon 4wd malibu maxima
## 2 5 3
## mountaineer 4wd mustang navigator 2wd
## 4 9 3
## new beetle passat pathfinder 4wd
## 6 7 4
## ram 1500 pickup 4wd range rover sonata
## 10 4 7
## tiburon toyota tacoma 4wd
## 7 7
table(df_mpg$class)
##
## 2seater compact midsize minivan pickup subcompact suv
## 5 47 41 11 33 35 62
It’s helpful to check counts of each values in variable. To better use, we can use this table with visualizing original variable.
first, let’s import ggplot2
library
library(ggplot2)
qplot(df_mpg$class)
You should be careful to visualize the table as it is. If you visualize the table, it shows the different one.
qplot(table(df_mpg$class))
## Don't know how to automatically pick scale for object of type table. Defaulting to continuous.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
2.5 qplot
Visualization is always important part for EDA. Let’s study more about qplot
.
we use mpg
dataset. Extract top 5 rows
head(mpg)
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
qplot
means “quick plot.” It is useful for quickly running our plot.
let’s see some feature’s frequency using qplot
qplot(data=mpg, x=manufacturer)
qplot(data=mpg, x=fl)
qplot(data=mpg, x=year)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
it’s weired. year
data have only 2 cases.
let’s check that.
table(mpg$year)
##
## 1999 2008
## 117 117
yes. it’s contain only the car data that manufacturered in 1999 and 2008. We have to think about this when analysis this dataset.
and let’s go ahead and check the other feature one more.
qplot(data=mpg, x=class)
like this, we can show data’s frequency by input x(feature).
and class
contains types of car like size or shape etc.
table(mpg$class)
##
## 2seater compact midsize minivan pickup subcompact suv
## 5 47 41 11 33 35 62
Then, what will happen if we input y into qplot function?
qplot(data=mpg, x=class, y=hwy)
This is scatter plot. From this plot, we can check highway fuel economy of each car class. Pick-up truck and SUV are not good at average fuel economy.
But, we have more useful plot than scatter plot for checking the mean of data. Let’s use that.
qplot(data=mpg, x=class, y=hwy, geom='boxplot', color=class)
It’s box plot. We can check means easily and even outliers. It’s very important that how our data visualized and what tools we use.
If we use line plot for this situation, It is difficult to accurately show the data.
qplot(data=mpg, x=class, y=hwy, geom='line', color=class)
2.6 hist
R have basic histplot function. We can use this by calling hist()
. hist
means histogram. It shows the distribution of continuous(numeric) variable.
hist(mpg$hwy)
If you input categorical variable(ex. mpg$class) into hist
function,
it returns error that Error in hist.default(mpg$class) : 'x' must be numeric
2.7 is.na
is.na()
is the function that check whether the input has missing value or not.
and the input can be vector, list, data.frame etc.
<- c('Kim', 'Jay', 'toreto', NA)
names is.na(names)
## [1] FALSE FALSE FALSE TRUE
we can count NA
by using table
function.
table(is.na(names))
##
## FALSE TRUE
## 3 1
How about a dataframe?
<- c('0000', '0010', '0022', NA)
id <- c('880201', '000106', '930128', NA)
birth
<- data.frame(names, id, birth)
customer_df customer_df
## names id birth
## 1 Kim 0000 880201
## 2 Jay 0010 000106
## 3 toreto 0022 930128
## 4 <NA> <NA> <NA>
table(is.na(customer_df))
##
## FALSE TRUE
## 9 3
we can visualize table by using plot()
method
plot(table(is.na(customer_df)))
If we just need to check only NA
, we can use sum()
.
sum(is.na(customer_df))
## [1] 3
There are many ways to fill NA
.
first, just fill NA
as specific value.
4,]['names'] <- 'haword'
customer_df[ customer_df
## names id birth
## 1 Kim 0000 880201
## 2 Jay 0010 000106
## 3 toreto 0022 930128
## 4 haword <NA> <NA>
or we can use ifelse()
function like this.
ifelse(is.na(customer_df$id), 'Missing', customer_df$id)
## [1] "0000" "0010" "0022" "Missing"
2.8 na.omit
or we can delete all missing values.
<- na.omit(customer_df)
customer_df customer_df
## names id birth
## 1 Kim 0000 880201
## 2 Jay 0010 000106
## 3 toreto 0022 930128
2.9 barplot
In case of outliers are in data frame, we have to select how to handle this outliers.
If you have great domain knowledge, you can just use visualization and judge whether this value is outlier or not.
<- c(180, 177, 17)
height $height <- height
customer_dfbarplot(customer_df$height, names=customer_df$names)
is that right? is Toreto’s height 17(cm)? No. It’s may be typing error. maybe it’s 170, 177 etc.
In many case, there are so many data in dataset so that it’s hard to check one by one through table, so it’s so useful to use visualization.
2.10 boxplot
Especially when you do not have deep domain knowledge or want to look for statistical outliers, boxplot is a great help.
boxplot(mpg$cty)
We can see 4 outliers in boxplot. let’s check them.
head(mpg)
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
%>%
mpg select(manufacturer, model, year, cty) %>%
arrange(desc(cty)) %>%
head(10)
## # A tibble: 10 x 4
## manufacturer model year cty
## <chr> <chr> <int> <int>
## 1 volkswagen new beetle 1999 35
## 2 volkswagen jetta 1999 33
## 3 volkswagen new beetle 1999 29
## 4 honda civic 1999 28
## 5 toyota corolla 2008 28
## 6 honda civic 2008 26
## 7 toyota corolla 1999 26
## 8 toyota corolla 2008 26
## 9 honda civic 1999 25
## 10 honda civic 2008 25
summary(mpg$cty)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 14.00 17.00 16.86 19.00 35.00
We can also extract boxplot’s output as numerical values.
boxplot(mpg$cty)$out
## [1] 28 28 33 35 29
And let’s check this values from our dataframe.
<- boxplot(mpg$cty)$out outputs
$cty %in% outputs, ] mpg[mpg
## # A tibble: 5 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 honda civic 1.6 1999 4 manual… f 28 33 r subcom…
## 2 toyota corolla 1.8 2008 4 manual… f 28 37 r compact
## 3 volkswagen jetta 1.9 1999 4 manual… f 33 44 d compact
## 4 volkswagen new be… 1.9 1999 4 manual… f 35 44 d subcom…
## 5 volkswagen new be… 1.9 1999 4 auto(l… f 29 41 d subcom…
2.11 ggplot2 :: ggplot
ggplot2 is a package that helps with structural visualization using the R.
scatter plot, box plot, line plot, hist plot etc. all type of chart that we discussed can all be used in ggplot. Let’s try it.
head(mpg)
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
library(ggplot2)
ggplot(data=mpg, aes(x=year, y=cty)) + geom_point()
Scatter plot is useful for checking distribution of dataset. But in this case, the
year
variable isn’t continuous. Let’s deal with other variable.
ggplot(data=mpg, aes(x=displ, y=cty)) + geom_point()
As with this one, if the two variables are not continuous, you should consider another chart.
As another example,
ggplot(data=mpg, aes(x=class, y=cty)) + geom_point()
It is better to use a box plot to compare the fuel efficiency of each class.
ggplot(data=mpg, aes(x=class, y=cty)) + geom_boxplot()