Chapter 1 R Basic

Let’s check the types and data structures in R. At the same time, let’s prepare to deal with R by taking a look at the built-in functions that R has by default.

1.1 Data Types I

R has 5 basic data types.

  • character
  • numeric
  • integer
  • complex
  • logical

1.2 Data Types - character

And we can deal with these data types by using basic functions.

character(3) # show the collection of 3 character elements
## [1] "" "" ""
typeof('Hello') # show the type of 'Hello'(character)
## [1] "character"
length('Hello') # show the length of 'Hello'(1 character)
## [1] 1
str('Hello') # show the structure of 'Hello'
##  chr "Hello"

1.3 Data Types - numeric, integer, complex

There are 3 ways to represent numbers

numeric(3) 
## [1] 0 0 0
integer(3)
## [1] 0 0 0
complex(3)
## [1] 0+0i 0+0i 0+0i
typeof(numeric(3))
## [1] "double"
typeof(integer(3))
## [1] "integer"
typeof(complex(3))
## [1] "complex"
str(numeric(3))
##  num [1:3] 0 0 0
str(integer(3))
##  int [1:3] 0 0 0
str(complex(3))
##  cplx [1:3] 0+0i 0+0i 0+0i

1.4 Data Types - logical

and logical type.

logical(5) # show the collection of 5 logical elements
## [1] FALSE FALSE FALSE FALSE FALSE
typeof(logical(5)) # show the type of the logical collection
## [1] "logical"
length(logical(5)) # show the length of the logical collection, or the number of the logical elements(5)
## [1] 5
str(logical(5)) # show the structure of the logical collection
##  logi [1:5] FALSE FALSE FALSE FALSE FALSE

1.5 Data Types II

Here are some other useful data types.

  • raw
  • factor
  • Date

1.6 Data Types - raw

The raw can represent a hex digit. So if we want to get ASCII code, we can use this. Contrary, we can convert ASCII code to ASCII characters.

raw(5)
## [1] 00 00 00 00 00
typeof(raw(5))
## [1] "raw"
length(raw(5)) 
## [1] 5
ctr <- charToRaw('hi this is ASCII')
print(ctr)
##  [1] 68 69 20 74 68 69 73 20 69 73 20 41 53 43 49 49
rawToChar(ctr)
## [1] "hi this is ASCII"

1.7 Data Types - factor

In statistics, variables are divided into categorical variables (discrete variables) and continuous variables. Going deeper, categorical variables can be divided into “normal variables with no order between variables” and “ordinal variables with order between variables,” and continuous variables are “interval variables without absolute zeros” and “Ratio variables with absolute zeros” can be divided into.

  1. Categorical variable
  • norminal variable
  • ordinal variable
  1. Continuous variable
  • interval variable
  • ratio variable

In this classification, Categorical variables can be represented as factor type in R. The factor type can be used with factor() and ordered().

In particular, when data is composed of numbers, it is important to understand whether this has meaning in order, and factor() excludes meaning in order by expressing it as a category.

c(1,2,3,1,2,3,4)
## [1] 1 2 3 1 2 3 4
factor(c(1,2,3,1,2,3,4))
## [1] 1 2 3 1 2 3 4
## Levels: 1 2 3 4

1.8 Data Types - ordered factor

Also, an order can be added to the factor form. If you use factor(), you can create a norminal variable, and if you use ordered(), you can create an ordinal variable.

ordered(c(1,2,3,1,2,3,1,2))
## [1] 1 2 3 1 2 3 1 2
## Levels: 1 < 2 < 3

ordered() is especially effective when working with categorical data that has an ordered literal representation.

ordered(c('Short','Tall','Grande','Tall','Short','Tall'),
        levels=c('Short','Tall','Grande'))
## [1] Short  Tall   Grande Tall   Short  Tall  
## Levels: Short < Tall < Grande

Of course, it is also possible to change the levels in descending order or add a level that is not in the current data in advance.

ordered(c(1,2,3,1,2,1,2,3,1),
        levels=c(4,3,2,1))
## [1] 1 2 3 1 2 1 2 3 1
## Levels: 4 < 3 < 2 < 1

1.9 Data Types - date

We’ll deal with the date type from now on, but we’ll use the method that converts a specific string to a date because we can put it in a function as a string or a number.

day_info <- c('2018-12-24','2018-12-25')
day_info
## [1] "2018-12-24" "2018-12-25"
typeof(day_info)
## [1] "character"
date_info <- as.Date(day_info)
date_info
## [1] "2018-12-24" "2018-12-25"
typeof(date_info)
## [1] "double"

The results don’t look much different. Why do we need this type?

The reason we use date type is usually for time series analysis. This enables time calculation.

date_info[2] - date_info[1]
## Time difference of 1 days

If we had subtracted from day_info, we would have had an error.

1.10 Data Structure

Data structure is container that contain data elements. There are 5 basic data structures in R.

  • vector
  • matrix
  • dataframe
  • array
  • list

1.11 Data Structure - vector

A vector is a data structure composed of one or more elements, and all elements must have the same data type.

In fact, we made and output a vector while checking each type earlier.

sample <- 'A'
sample2 <- c(1,2)
sample3 <- c(1,2,'A')

str(sample)
##  chr "A"
str(sample2)
##  num [1:2] 1 2
str(sample3)
##  chr [1:3] "1" "2" "A"

All of that are vectors. The peculiar thing is that sample3 is of type chr. Repeat, all elements must have the same data type in vector.

We see here that classes are not represented as vectors in vector structures. In R, a vector is the smallest data structure unit, so the class of the vector is data type as it is.

1.12 Data Structure - matrix

Matrix is 2 dimension data structure and it requires all elements in this structure to be of the same data type.

vector1 <- c(1,2,3,4,5,6,7,8,9,10)
  1. Create a metric by filling the column first :
matrix(vector1, nrow=2, ncol=5)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10
  1. Create a metric by filling the row first :
matrix(vector1, nrow=2, ncol=5, byrow=T)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    6    7    8    9   10
  1. Check the type of data(metrix) :
mtr <- matrix(vector1, nrow=2, ncol=5, byrow=T)
class(mtr)
## [1] "matrix" "array"

Unlike vectors, matrix represent classes well as “matrix.”

If, you want to combine multiple vectors, you can use cbind or rbind.

vector2 <- c(10,20,30,40,50,60,70,80,90,100)

cbind(vector1, vector2)
##       vector1 vector2
##  [1,]       1      10
##  [2,]       2      20
##  [3,]       3      30
##  [4,]       4      40
##  [5,]       5      50
##  [6,]       6      60
##  [7,]       7      70
##  [8,]       8      80
##  [9,]       9      90
## [10,]      10     100

If cbind combines vectors on a column basis, rbind can combine on a row basis.

rbind(vector1[1:5], vector2[1:5])
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]   10   20   30   40   50

We used indexing here. Only a specific section of a vector can be extracted through ‘[a,b].’

Let’s take a quick look at indexing.

vector2[c(1,3,5,7)]
## [1] 10 30 50 70
vector2[seq(1,8,2)]
## [1] 10 30 50 70
vector2[-1]
## [1]  20  30  40  50  60  70  80  90 100

In this way, you can extract a value with a specific index through c(), set the interval and search interval with seq(), or exclude only the value of a specific index.(-)

Next, Matrix indexing.

mtr
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    6    7    8    9   10
  1. row - 1, column - 1~3 :
mtr[1,1:3]
## [1] 1 2 3
  1. row - all, column - 4~5 :
mtr[,4:5]
##      [,1] [,2]
## [1,]    4    5
## [2,]    9   10
  1. row - 2, column - all :
mtr[2,]
## [1]  6  7  8  9 10
  1. row - all, column - 1,3,5 :
mtr[,c(1,3,5)]
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    6    8   10
  1. row - all, column - extract 1 and all :
mtr[,-1]
##      [,1] [,2] [,3] [,4]
## [1,]    2    3    4    5
## [2,]    7    8    9   10

Indexing is very important. Handling vectors and matrices freely is a key skill that must be preceded by data preprocessing.

1.13 Data Structure - dataframe

Dataframe is 2 dimension data structure that it’s datatypes do not need to be uniform. That is, unlike a matrix, various types of vectors can be combined.

column1 <- c(1,2,3,4,5)
column2 <- c(10,20,30,40,50)
column3 <- c('hi', 'this', 'is', 'vector', '!')

sample_df <- data.frame(column1, column2, column3)

str(sample_df)
## 'data.frame':    5 obs. of  3 variables:
##  $ column1: num  1 2 3 4 5
##  $ column2: num  10 20 30 40 50
##  $ column3: chr  "hi" "this" "is" "vector" ...
class(sample_df)
## [1] "data.frame"

The indexing used for metrics can be applied to data frames as well.

sample_df
##   column1 column2 column3
## 1       1      10      hi
## 2       2      20    this
## 3       3      30      is
## 4       4      40  vector
## 5       5      50       !
  1. row - 1, column - all :
sample_df[1, ]
##   column1 column2 column3
## 1       1      10      hi
  1. row - all, column - 1~2 :
sample_df[,1:2]
##   column1 column2
## 1       1      10
## 2       2      20
## 3       3      30
## 4       4      40
## 5       5      50
  1. row - all, column - extract 1 and all :
sample_df[,-1]
##   column2 column3
## 1      10      hi
## 2      20    this
## 3      30      is
## 4      40  vector
## 5      50       !

1.14 Data Structure - array

An array is a multi-dimensional data structure that can express two or more dimensions of a matrix, and all elements must be of the same data type, just like a matrix.

  1. 1 dimension :
array(1:10)
##  [1]  1  2  3  4  5  6  7  8  9 10
  1. 2 dimension :
array(1:10, dim=c(5,2)) # c(row, column)
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
  1. 3 dimension :
array(1:30, dim=c(5,2,3)) # c(row, column, N)
## , , 1
## 
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
## 
## , , 2
## 
##      [,1] [,2]
## [1,]   11   16
## [2,]   12   17
## [3,]   13   18
## [4,]   14   19
## [5,]   15   20
## 
## , , 3
## 
##      [,1] [,2]
## [1,]   21   26
## [2,]   22   27
## [3,]   23   28
## [4,]   24   29
## [5,]   25   30
  1. 4 dimension :
array(1:200, dim=c(5, 10, 2, 2)) # c(row, column, N, N)
## , , 1, 1
## 
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    6   11   16   21   26   31   36   41    46
## [2,]    2    7   12   17   22   27   32   37   42    47
## [3,]    3    8   13   18   23   28   33   38   43    48
## [4,]    4    9   14   19   24   29   34   39   44    49
## [5,]    5   10   15   20   25   30   35   40   45    50
## 
## , , 2, 1
## 
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]   51   56   61   66   71   76   81   86   91    96
## [2,]   52   57   62   67   72   77   82   87   92    97
## [3,]   53   58   63   68   73   78   83   88   93    98
## [4,]   54   59   64   69   74   79   84   89   94    99
## [5,]   55   60   65   70   75   80   85   90   95   100
## 
## , , 1, 2
## 
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]  101  106  111  116  121  126  131  136  141   146
## [2,]  102  107  112  117  122  127  132  137  142   147
## [3,]  103  108  113  118  123  128  133  138  143   148
## [4,]  104  109  114  119  124  129  134  139  144   149
## [5,]  105  110  115  120  125  130  135  140  145   150
## 
## , , 2, 2
## 
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]  151  156  161  166  171  176  181  186  191   196
## [2,]  152  157  162  167  172  177  182  187  192   197
## [3,]  153  158  163  168  173  178  183  188  193   198
## [4,]  154  159  164  169  174  179  184  189  194   199
## [5,]  155  160  165  170  175  180  185  190  195   200

We can use indexing for array too. But it could be more complex than matrix or vector.

test <- array(1:200, dim=c(5, 10, 2, 2))
test[3:5,,,] # row indexing
## , , 1, 1
## 
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    3    8   13   18   23   28   33   38   43    48
## [2,]    4    9   14   19   24   29   34   39   44    49
## [3,]    5   10   15   20   25   30   35   40   45    50
## 
## , , 2, 1
## 
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]   53   58   63   68   73   78   83   88   93    98
## [2,]   54   59   64   69   74   79   84   89   94    99
## [3,]   55   60   65   70   75   80   85   90   95   100
## 
## , , 1, 2
## 
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]  103  108  113  118  123  128  133  138  143   148
## [2,]  104  109  114  119  124  129  134  139  144   149
## [3,]  105  110  115  120  125  130  135  140  145   150
## 
## , , 2, 2
## 
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]  153  158  163  168  173  178  183  188  193   198
## [2,]  154  159  164  169  174  179  184  189  194   199
## [3,]  155  160  165  170  175  180  185  190  195   200
test[3:5,5:8,,] # row and column indexing
## , , 1, 1
## 
##      [,1] [,2] [,3] [,4]
## [1,]   23   28   33   38
## [2,]   24   29   34   39
## [3,]   25   30   35   40
## 
## , , 2, 1
## 
##      [,1] [,2] [,3] [,4]
## [1,]   73   78   83   88
## [2,]   74   79   84   89
## [3,]   75   80   85   90
## 
## , , 1, 2
## 
##      [,1] [,2] [,3] [,4]
## [1,]  123  128  133  138
## [2,]  124  129  134  139
## [3,]  125  130  135  140
## 
## , , 2, 2
## 
##      [,1] [,2] [,3] [,4]
## [1,]  173  178  183  188
## [2,]  174  179  184  189
## [3,]  175  180  185  190
test[3:5,5:8,2,2] # row and column + N(2,2) indexing ==> extract part(some rows and some columns) of last matrix in array
##      [,1] [,2] [,3] [,4]
## [1,]  173  178  183  188
## [2,]  174  179  184  189
## [3,]  175  180  185  190

1.15 Data Structure - list

List is the most flexible data structure in R. It can be expressed in all dimensions and can be expressed by collecting various data types.

li <- list(c(1,2,3,4), c('hi','im','in','list'), sample_df)
li
## [[1]]
## [1] 1 2 3 4
## 
## [[2]]
## [1] "hi"   "im"   "in"   "list"
## 
## [[3]]
##   column1 column2 column3
## 1       1      10      hi
## 2       2      20    this
## 3       3      30      is
## 4       4      40  vector
## 5       5      50       !

We can find double square brackets for each indexes. So if we want to extract some values(not vector) from list, we have to use double bracket(‘[[’, ’]]’).

First, If we need to 1 vector in list, the indexing method above can be used. for example,

sample_df[1] # extract 1 column(vector)
##   column1
## 1       1
## 2       2
## 3       3
## 4       4
## 5       5
li[1]
## [[1]]
## [1] 1 2 3 4

Here comes the other part, if you want to go deeper, i.e. extract only a few elements in smaller units, you have to use double brackets. Like this,

li[[1]][1:3]
## [1] 1 2 3

What makes difference?

In list, if we use one bracket, it returns list type.

li[2]
## [[1]]
## [1] "hi"   "im"   "in"   "list"
typeof(li[2])
## [1] "list"

and if we use double brackets, it returns the collection’s data type.

typeof(li[[2]])
## [1] "character"

That is why you need to use double brackets to access a specific value of a list in a list

li[[2]][2:4]
## [1] "im"   "in"   "list"
li[[3]][2:1,2:3] # indexing : first list, second dataframe
##   column2 column3
## 2      20    this
## 1      10      hi

In the list, you can name the internal list just like the columns of the data frame.

li2 <- list(nu=c(1,2,3,4,5),
            ch=c('hi','hello','hey'),
            df=data.frame(c('a','b','c'),c('any','baby','can'),c(10,11,12)))
str(li2)
## List of 3
##  $ nu: num [1:5] 1 2 3 4 5
##  $ ch: chr [1:3] "hi" "hello" "hey"
##  $ df:'data.frame':  3 obs. of  3 variables:
##   ..$ c..a....b....c..       : chr [1:3] "a" "b" "c"
##   ..$ c..any....baby....can..: chr [1:3] "any" "baby" "can"
##   ..$ c.10..11..12.          : num [1:3] 10 11 12

And if you’ve name each lists, you can call them using ‘$.’

li2$nu
## [1] 1 2 3 4 5
li2$ch
## [1] "hi"    "hello" "hey"
li2$df
##   c..a....b....c.. c..any....baby....can.. c.10..11..12.
## 1                a                     any            10
## 2                b                    baby            11
## 3                c                     can            12

As you may have noticed, it does the same thing as double brackets.

str(li2[[3]])
## 'data.frame':    3 obs. of  3 variables:
##  $ c..a....b....c..       : chr  "a" "b" "c"
##  $ c..any....baby....can..: chr  "any" "baby" "can"
##  $ c.10..11..12.          : num  10 11 12
str(li2$df)
## 'data.frame':    3 obs. of  3 variables:
##  $ c..a....b....c..       : chr  "a" "b" "c"
##  $ c..any....baby....can..: chr  "any" "baby" "can"
##  $ c.10..11..12.          : num  10 11 12

Finally, let’s look at the difference between using only one bracket.

str(li2[3])
## List of 1
##  $ df:'data.frame':  3 obs. of  3 variables:
##   ..$ c..a....b....c..       : chr [1:3] "a" "b" "c"
##   ..$ c..any....baby....can..: chr [1:3] "any" "baby" "can"
##   ..$ c.10..11..12.          : num [1:3] 10 11 12