Chapter 1 R Basic
Let’s check the types and data structures in R. At the same time, let’s prepare to deal with R by taking a look at the built-in functions that R has by default.
1.1 Data Types I
R has 5 basic data types.
- character
- numeric
- integer
- complex
- logical
1.2 Data Types - character
And we can deal with these data types by using basic functions.
character(3) # show the collection of 3 character elements
## [1] "" "" ""
typeof('Hello') # show the type of 'Hello'(character)
## [1] "character"
length('Hello') # show the length of 'Hello'(1 character)
## [1] 1
str('Hello') # show the structure of 'Hello'
## chr "Hello"
1.3 Data Types - numeric, integer, complex
There are 3 ways to represent numbers
numeric(3)
## [1] 0 0 0
integer(3)
## [1] 0 0 0
complex(3)
## [1] 0+0i 0+0i 0+0i
typeof(numeric(3))
## [1] "double"
typeof(integer(3))
## [1] "integer"
typeof(complex(3))
## [1] "complex"
str(numeric(3))
## num [1:3] 0 0 0
str(integer(3))
## int [1:3] 0 0 0
str(complex(3))
## cplx [1:3] 0+0i 0+0i 0+0i
1.4 Data Types - logical
and logical type.
logical(5) # show the collection of 5 logical elements
## [1] FALSE FALSE FALSE FALSE FALSE
typeof(logical(5)) # show the type of the logical collection
## [1] "logical"
length(logical(5)) # show the length of the logical collection, or the number of the logical elements(5)
## [1] 5
str(logical(5)) # show the structure of the logical collection
## logi [1:5] FALSE FALSE FALSE FALSE FALSE
1.5 Data Types II
Here are some other useful data types.
- raw
- factor
- Date
1.6 Data Types - raw
The raw
can represent a hex digit. So if we want to get ASCII code, we can use this. Contrary, we can convert ASCII code to ASCII characters.
raw(5)
## [1] 00 00 00 00 00
typeof(raw(5))
## [1] "raw"
length(raw(5))
## [1] 5
<- charToRaw('hi this is ASCII')
ctr print(ctr)
## [1] 68 69 20 74 68 69 73 20 69 73 20 41 53 43 49 49
rawToChar(ctr)
## [1] "hi this is ASCII"
1.7 Data Types - factor
In statistics, variables are divided into categorical variables (discrete variables) and continuous variables. Going deeper, categorical variables can be divided into “normal variables with no order between variables” and “ordinal variables with order between variables,” and continuous variables are “interval variables without absolute zeros” and “Ratio variables with absolute zeros” can be divided into.
- Categorical variable
- norminal variable
- ordinal variable
- Continuous variable
- interval variable
- ratio variable
In this classification, Categorical variables
can be represented as factor
type in R. The factor
type can be used with factor() and ordered().
In particular, when data is composed of numbers, it is important to understand whether this has meaning in order, and factor() excludes meaning in order by expressing it as a category.
c(1,2,3,1,2,3,4)
## [1] 1 2 3 1 2 3 4
factor(c(1,2,3,1,2,3,4))
## [1] 1 2 3 1 2 3 4
## Levels: 1 2 3 4
1.8 Data Types - ordered factor
Also, an order can be added to the factor form. If you use factor(), you can create a norminal variable, and if you use ordered(), you can create an ordinal variable.
ordered(c(1,2,3,1,2,3,1,2))
## [1] 1 2 3 1 2 3 1 2
## Levels: 1 < 2 < 3
ordered() is especially effective when working with categorical data that has an ordered literal representation.
ordered(c('Short','Tall','Grande','Tall','Short','Tall'),
levels=c('Short','Tall','Grande'))
## [1] Short Tall Grande Tall Short Tall
## Levels: Short < Tall < Grande
Of course, it is also possible to change the levels in descending order or add a level that is not in the current data in advance.
ordered(c(1,2,3,1,2,1,2,3,1),
levels=c(4,3,2,1))
## [1] 1 2 3 1 2 1 2 3 1
## Levels: 4 < 3 < 2 < 1
1.9 Data Types - date
We’ll deal with the date type from now on, but we’ll use the method that converts a specific string to a date because we can put it in a function as a string or a number.
<- c('2018-12-24','2018-12-25')
day_info day_info
## [1] "2018-12-24" "2018-12-25"
typeof(day_info)
## [1] "character"
<- as.Date(day_info)
date_info date_info
## [1] "2018-12-24" "2018-12-25"
typeof(date_info)
## [1] "double"
The results don’t look much different. Why do we need this type?
The reason we use date type is usually for time series analysis. This enables time calculation.
2] - date_info[1] date_info[
## Time difference of 1 days
If we had subtracted from day_info, we would have had an error.
1.10 Data Structure
Data structure is container that contain data elements. There are 5 basic data structures in R.
- vector
- matrix
- dataframe
- array
- list
1.11 Data Structure - vector
A vector is a data structure composed of one or more elements, and all elements must have the same data type.
In fact, we made and output a vector while checking each type earlier.
<- 'A'
sample <- c(1,2)
sample2 <- c(1,2,'A')
sample3
str(sample)
## chr "A"
str(sample2)
## num [1:2] 1 2
str(sample3)
## chr [1:3] "1" "2" "A"
All of that are vectors. The peculiar thing is that sample3 is of type chr. Repeat, all elements must have the same data type in vector.
We see here that classes are not represented as vectors in vector structures. In R, a vector is the smallest data structure unit, so the class of the vector is data type as it is.
1.12 Data Structure - matrix
Matrix is 2 dimension data structure and it requires all elements in this structure to be of the same data type.
<- c(1,2,3,4,5,6,7,8,9,10) vector1
- Create a metric by filling the column first :
matrix(vector1, nrow=2, ncol=5)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
- Create a metric by filling the row first :
matrix(vector1, nrow=2, ncol=5, byrow=T)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 6 7 8 9 10
- Check the type of data(metrix) :
<- matrix(vector1, nrow=2, ncol=5, byrow=T)
mtr class(mtr)
## [1] "matrix" "array"
Unlike vectors, matrix represent classes well as “matrix.”
If, you want to combine multiple vectors, you can use cbind
or rbind
.
<- c(10,20,30,40,50,60,70,80,90,100)
vector2
cbind(vector1, vector2)
## vector1 vector2
## [1,] 1 10
## [2,] 2 20
## [3,] 3 30
## [4,] 4 40
## [5,] 5 50
## [6,] 6 60
## [7,] 7 70
## [8,] 8 80
## [9,] 9 90
## [10,] 10 100
If cbind combines vectors on a column basis, rbind can combine on a row basis.
rbind(vector1[1:5], vector2[1:5])
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 10 20 30 40 50
We used indexing here. Only a specific section of a vector can be extracted through ‘[a,b].’
Let’s take a quick look at indexing.
c(1,3,5,7)] vector2[
## [1] 10 30 50 70
seq(1,8,2)] vector2[
## [1] 10 30 50 70
-1] vector2[
## [1] 20 30 40 50 60 70 80 90 100
In this way, you can extract a value with a specific index through c()
, set the interval and search interval with seq()
, or exclude only the value of a specific index.(-)
Next, Matrix indexing.
mtr
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 6 7 8 9 10
- row - 1, column - 1~3 :
1,1:3] mtr[
## [1] 1 2 3
- row - all, column - 4~5 :
4:5] mtr[,
## [,1] [,2]
## [1,] 4 5
## [2,] 9 10
- row - 2, column - all :
2,] mtr[
## [1] 6 7 8 9 10
- row - all, column - 1,3,5 :
c(1,3,5)] mtr[,
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 6 8 10
- row - all, column - extract 1 and all :
-1] mtr[,
## [,1] [,2] [,3] [,4]
## [1,] 2 3 4 5
## [2,] 7 8 9 10
Indexing is very important. Handling vectors and matrices freely is a key skill that must be preceded by data preprocessing.
1.13 Data Structure - dataframe
Dataframe is 2 dimension data structure that it’s datatypes do not need to be uniform. That is, unlike a matrix, various types of vectors can be combined.
<- c(1,2,3,4,5)
column1 <- c(10,20,30,40,50)
column2 <- c('hi', 'this', 'is', 'vector', '!')
column3
<- data.frame(column1, column2, column3)
sample_df
str(sample_df)
## 'data.frame': 5 obs. of 3 variables:
## $ column1: num 1 2 3 4 5
## $ column2: num 10 20 30 40 50
## $ column3: chr "hi" "this" "is" "vector" ...
class(sample_df)
## [1] "data.frame"
The indexing used for metrics can be applied to data frames as well.
sample_df
## column1 column2 column3
## 1 1 10 hi
## 2 2 20 this
## 3 3 30 is
## 4 4 40 vector
## 5 5 50 !
- row - 1, column - all :
1, ] sample_df[
## column1 column2 column3
## 1 1 10 hi
- row - all, column - 1~2 :
1:2] sample_df[,
## column1 column2
## 1 1 10
## 2 2 20
## 3 3 30
## 4 4 40
## 5 5 50
- row - all, column - extract 1 and all :
-1] sample_df[,
## column2 column3
## 1 10 hi
## 2 20 this
## 3 30 is
## 4 40 vector
## 5 50 !
1.14 Data Structure - array
An array is a multi-dimensional data structure that can express two or more dimensions of a matrix, and all elements must be of the same data type, just like a matrix.
- 1 dimension :
array(1:10)
## [1] 1 2 3 4 5 6 7 8 9 10
- 2 dimension :
array(1:10, dim=c(5,2)) # c(row, column)
## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
- 3 dimension :
array(1:30, dim=c(5,2,3)) # c(row, column, N)
## , , 1
##
## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
##
## , , 2
##
## [,1] [,2]
## [1,] 11 16
## [2,] 12 17
## [3,] 13 18
## [4,] 14 19
## [5,] 15 20
##
## , , 3
##
## [,1] [,2]
## [1,] 21 26
## [2,] 22 27
## [3,] 23 28
## [4,] 24 29
## [5,] 25 30
- 4 dimension :
array(1:200, dim=c(5, 10, 2, 2)) # c(row, column, N, N)
## , , 1, 1
##
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 6 11 16 21 26 31 36 41 46
## [2,] 2 7 12 17 22 27 32 37 42 47
## [3,] 3 8 13 18 23 28 33 38 43 48
## [4,] 4 9 14 19 24 29 34 39 44 49
## [5,] 5 10 15 20 25 30 35 40 45 50
##
## , , 2, 1
##
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 51 56 61 66 71 76 81 86 91 96
## [2,] 52 57 62 67 72 77 82 87 92 97
## [3,] 53 58 63 68 73 78 83 88 93 98
## [4,] 54 59 64 69 74 79 84 89 94 99
## [5,] 55 60 65 70 75 80 85 90 95 100
##
## , , 1, 2
##
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 101 106 111 116 121 126 131 136 141 146
## [2,] 102 107 112 117 122 127 132 137 142 147
## [3,] 103 108 113 118 123 128 133 138 143 148
## [4,] 104 109 114 119 124 129 134 139 144 149
## [5,] 105 110 115 120 125 130 135 140 145 150
##
## , , 2, 2
##
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 151 156 161 166 171 176 181 186 191 196
## [2,] 152 157 162 167 172 177 182 187 192 197
## [3,] 153 158 163 168 173 178 183 188 193 198
## [4,] 154 159 164 169 174 179 184 189 194 199
## [5,] 155 160 165 170 175 180 185 190 195 200
We can use indexing for array too. But it could be more complex than matrix or vector.
<- array(1:200, dim=c(5, 10, 2, 2)) test
3:5,,,] # row indexing test[
## , , 1, 1
##
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 3 8 13 18 23 28 33 38 43 48
## [2,] 4 9 14 19 24 29 34 39 44 49
## [3,] 5 10 15 20 25 30 35 40 45 50
##
## , , 2, 1
##
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 53 58 63 68 73 78 83 88 93 98
## [2,] 54 59 64 69 74 79 84 89 94 99
## [3,] 55 60 65 70 75 80 85 90 95 100
##
## , , 1, 2
##
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 103 108 113 118 123 128 133 138 143 148
## [2,] 104 109 114 119 124 129 134 139 144 149
## [3,] 105 110 115 120 125 130 135 140 145 150
##
## , , 2, 2
##
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 153 158 163 168 173 178 183 188 193 198
## [2,] 154 159 164 169 174 179 184 189 194 199
## [3,] 155 160 165 170 175 180 185 190 195 200
3:5,5:8,,] # row and column indexing test[
## , , 1, 1
##
## [,1] [,2] [,3] [,4]
## [1,] 23 28 33 38
## [2,] 24 29 34 39
## [3,] 25 30 35 40
##
## , , 2, 1
##
## [,1] [,2] [,3] [,4]
## [1,] 73 78 83 88
## [2,] 74 79 84 89
## [3,] 75 80 85 90
##
## , , 1, 2
##
## [,1] [,2] [,3] [,4]
## [1,] 123 128 133 138
## [2,] 124 129 134 139
## [3,] 125 130 135 140
##
## , , 2, 2
##
## [,1] [,2] [,3] [,4]
## [1,] 173 178 183 188
## [2,] 174 179 184 189
## [3,] 175 180 185 190
3:5,5:8,2,2] # row and column + N(2,2) indexing ==> extract part(some rows and some columns) of last matrix in array test[
## [,1] [,2] [,3] [,4]
## [1,] 173 178 183 188
## [2,] 174 179 184 189
## [3,] 175 180 185 190
1.15 Data Structure - list
List is the most flexible data structure in R. It can be expressed in all dimensions and can be expressed by collecting various data types.
<- list(c(1,2,3,4), c('hi','im','in','list'), sample_df)
li li
## [[1]]
## [1] 1 2 3 4
##
## [[2]]
## [1] "hi" "im" "in" "list"
##
## [[3]]
## column1 column2 column3
## 1 1 10 hi
## 2 2 20 this
## 3 3 30 is
## 4 4 40 vector
## 5 5 50 !
We can find double square brackets for each indexes. So if we want to extract some values(not vector) from list, we have to use double bracket(‘[[’, ’]]’).
First, If we need to 1 vector in list, the indexing method above can be used. for example,
1] # extract 1 column(vector) sample_df[
## column1
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
1] li[
## [[1]]
## [1] 1 2 3 4
Here comes the other part, if you want to go deeper, i.e. extract only a few elements in smaller units, you have to use double brackets. Like this,
1]][1:3] li[[
## [1] 1 2 3
What makes difference?
In list, if we use one bracket, it returns list type.
2] li[
## [[1]]
## [1] "hi" "im" "in" "list"
typeof(li[2])
## [1] "list"
and if we use double brackets, it returns the collection’s data type.
typeof(li[[2]])
## [1] "character"
That is why you need to use double brackets to access a specific value of a list in a list
2]][2:4] li[[
## [1] "im" "in" "list"
3]][2:1,2:3] # indexing : first list, second dataframe li[[
## column2 column3
## 2 20 this
## 1 10 hi
In the list, you can name the internal list just like the columns of the data frame.
<- list(nu=c(1,2,3,4,5),
li2 ch=c('hi','hello','hey'),
df=data.frame(c('a','b','c'),c('any','baby','can'),c(10,11,12)))
str(li2)
## List of 3
## $ nu: num [1:5] 1 2 3 4 5
## $ ch: chr [1:3] "hi" "hello" "hey"
## $ df:'data.frame': 3 obs. of 3 variables:
## ..$ c..a....b....c.. : chr [1:3] "a" "b" "c"
## ..$ c..any....baby....can..: chr [1:3] "any" "baby" "can"
## ..$ c.10..11..12. : num [1:3] 10 11 12
And if you’ve name each lists, you can call them using ‘$.’
$nu li2
## [1] 1 2 3 4 5
$ch li2
## [1] "hi" "hello" "hey"
$df li2
## c..a....b....c.. c..any....baby....can.. c.10..11..12.
## 1 a any 10
## 2 b baby 11
## 3 c can 12
As you may have noticed, it does the same thing as double brackets.
str(li2[[3]])
## 'data.frame': 3 obs. of 3 variables:
## $ c..a....b....c.. : chr "a" "b" "c"
## $ c..any....baby....can..: chr "any" "baby" "can"
## $ c.10..11..12. : num 10 11 12
str(li2$df)
## 'data.frame': 3 obs. of 3 variables:
## $ c..a....b....c.. : chr "a" "b" "c"
## $ c..any....baby....can..: chr "any" "baby" "can"
## $ c.10..11..12. : num 10 11 12
Finally, let’s look at the difference between using only one bracket.
str(li2[3])
## List of 1
## $ df:'data.frame': 3 obs. of 3 variables:
## ..$ c..a....b....c.. : chr [1:3] "a" "b" "c"
## ..$ c..any....baby....can..: chr [1:3] "any" "baby" "can"
## ..$ c.10..11..12. : num [1:3] 10 11 12