3 Data structures
In R we have objects which are functions and objects which are data.
- Function examples:
sin()
integrate()
plot()
paste()
- Data examples:
42
1:5
“R”
matrix(1:12, nrow=4, ncol=3)
data.frame(a=1:5, tmt=c(“a”,“b”,“a”,“b”,“a”))
list(x=2, y=“abc”, x=1:10)
3.1 Vector
> # Vector of numbers, e.g:
> c(1, 1.2, pi, exp(1))
## [1] 1.000 1.200 3.142 2.718
>
> # We can have vectors of other things too, e.g:
> c(TRUE, 1 == 2)
## [1] TRUE FALSE
> c("a", "ab", "abc")
## [1] "a" "ab" "abc"
>
> # But not combinations, e.g:
> c("a", 5, 1 == 2)
## [1] "a" "5" "FALSE"
> # Notice that R just turned everything into characters!
3.1.1 Constructing vectors
> # Integers from 9 to 17
> x <- 9:17
> x
## [1] 9 10 11 12 13 14 15 16 17
>
> # A sequence of 11 numbers from 0 to 1
> y <- seq(0, 1, length = 11)
> y
## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
>
> # The same number or the same vector several times
> z <- rep(1:2, 5)
> z
## [1] 1 2 1 2 1 2 1 2 1 2
>
> # Combine numbers, vectors or both into a new vector
> xz10 <- c(x, z, 10)
> xz10
## [1] 9 10 11 12 13 14 15 16 17 1 2 1 2 1 2 1 2 1 2 10
3.1.2 Index and logical index
> # Define a vector with integers from (-5) to 5 and extract the numbers with
> # absolute value less than 3:
> x <- (-5):5
> x
## [1] -5 -4 -3 -2 -1 0 1 2 3 4 5
>
> # by their index in the vector:
> x[4:8]
## [1] -2 -1 0 1 2
>
> # or, by negative selection (set a minus in front of the indices we don't
> # want):
> x[-c(1:3, 9:11)]
## [1] -2 -1 0 1 2
>
> # A logical vector can be defined by:
> index <- abs(x) < 3
> index
## [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
>
> # Now this vector can be used to extract the wanted numbers:
> x[index]
## [1] -2 -1 0 1 2
3.2 Factor
- A special kind of vector is a factor. It has a known finite set of levels (options), e.g:
> # gl = generate levels
> gl(2, 10, labels = c("male", "female"))
## [1] male male male male male male male male male male
## [11] female female female female female female female female female female
## Levels: male female
>
> # One could also do:
> as.factor(c(rep("male", 10), rep("female", 10)))
## [1] male male male male male male male male male male
## [11] female female female female female female female female female female
## Levels: female male
3.3 Matrix and array
- Similar to vectors we can have matrices of objects of the same type, e.g:
> matrix(c(1, 2, 3, 4, 5, 6) + pi, nrow = 2)
## [,1] [,2] [,3]
## [1,] 4.142 6.142 8.142
## [2,] 5.142 7.142 9.142
>
> matrix(c(1, 2, 3, 4, 5, 6) + pi, nrow = 2) < 6
## [,1] [,2] [,3]
## [1,] TRUE FALSE FALSE
## [2,] TRUE FALSE FALSE
>
> # We can create higher order arrays, e.g:
> array(c(1:24), dim = c(4, 3, 2))
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 13 17 21
## [2,] 14 18 22
## [3,] 15 19 23
## [4,] 16 20 24
3.3.1 Constructing matrices
>
> # Combine rows into a matrix
> A <- rbind(1:3, c(1, 1, 2))
> A
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 1 1 2
>
> # Or columns
> B <- cbind(1:3, c(1, 1, 2))
> B
## [,1] [,2]
## [1,] 1 1
## [2,] 2 1
## [3,] 3 2
>
> # Define a matrix from one long vector
> C <- matrix(c(1, 0, 0, 1, 1, 0, 1, 1, 1), nrow = 3, ncol = 3)
> C
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 0 1 1
## [3,] 0 0 1
>
> # Can also be done by rows by adding 'byrow=TRUE' before the last parenthesis.
> # Try!
3.3.2 Index and logical index
> A <- matrix((-4):5, nrow = 2, ncol = 5)
> A
## [,1] [,2] [,3] [,4] [,5]
## [1,] -4 -2 0 2 4
## [2,] -3 -1 1 3 5
>
>
> # Negative values
> A[A < 0]
## [1] -4 -3 -2 -1
>
> # Assignments
> A[A < 0] <- 0
> A
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0 0 2 4
## [2,] 0 0 1 3 5
>
> # Matrix rows can be selected by
> A[2, ]
## [1] 0 0 1 3 5
>
> # and similarly for columns
> A[, c(2, 4)]
## [,1] [,2]
## [1,] 0 2
## [2,] 0 3
3.3.3 Properties of vectors and matrices
- The
R
functionmode()
when applied to a vector or to a matrix detects the type of singles that is stored:
> A <- matrix(rep(c(TRUE, FALSE), 2), nrow = 2)
>
> B <- rnorm(4)
>
> C <- matrix(LETTERS[1:9], nrow = 3)
>
> A
## [,1] [,2]
## [1,] TRUE TRUE
## [2,] FALSE FALSE
> B
## [1] -0.1343 0.1892 -1.2469 -1.0376
> C
## [,1] [,2] [,3]
## [1,] "A" "D" "G"
## [2,] "B" "E" "H"
## [3,] "C" "F" "I"
>
> mode(A)
## [1] "logical"
> mode(B)
## [1] "numeric"
> mode(C)
## [1] "character"
- Vectors and matrices have lengths: the length is the number of elements:
> x <- matrix(c(NA, 2:12), ncol = 3)
> x
## [,1] [,2] [,3]
## [1,] NA 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
>
> length(x[1, ])
## [1] 3
>
> length(x)
## [1] 12
>
> # The dimension of a matrix is the number of rows and columns: The number of
> # columns is the second element:
> dim(x)
## [1] 4 3
> dim(x)[2]
## [1] 3
3.3.4 Naming rows and columns in a matrix
- We can add names to a matrix with the
colnames()
andrownames()
functions:
> x <- matrix(rnorm(12), nrow = 4)
> x
## [,1] [,2] [,3]
## [1,] -0.17693 -0.6982 -0.7909
## [2,] -1.82361 1.7129 0.7004
## [3,] -0.03294 1.4788 -1.6718
## [4,] -0.80677 0.4982 -0.8348
>
> colnames(x) <- paste("data", 1:3, sep = "")
>
> rownames(x) <- paste("obs", 1:4, sep = "")
>
> x
## data1 data2 data3
## obs1 -0.17693 -0.6982 -0.7909
## obs2 -1.82361 1.7129 0.7004
## obs3 -0.03294 1.4788 -1.6718
## obs4 -0.80677 0.4982 -0.8348
>
> y <- matrix(rnorm(15), nrow = 5)
> y
## [,1] [,2] [,3]
## [1,] -0.9181 1.3811 0.9727
## [2,] -0.2014 -0.2144 1.7128
## [3,] 0.3535 -0.3591 0.4331
## [4,] -0.1364 0.2015 0.8959
## [5,] -1.2797 -1.2802 0.9468
>
> colnames(y) <- LETTERS[1:ncol(y)]
>
> rownames(y) <- letters[1:nrow(y)]
>
> y
## A B C
## a -0.9181 1.3811 0.9727
## b -0.2014 -0.2144 1.7128
## c 0.3535 -0.3591 0.4331
## d -0.1364 0.2015 0.8959
## e -1.2797 -1.2802 0.9468
3.3.5 Matrix multiplication
> M <- matrix(rnorm(20), nrow = 4, ncol = 5)
> N <- matrix(rnorm(15), nrow = 5, ncol = 3)
>
> M %*% N
## [,1] [,2] [,3]
## [1,] -1.3324 -0.3467 0.1803
## [2,] 1.7218 2.0924 -1.2959
## [3,] 1.7792 0.5250 1.9424
## [4,] 0.1583 1.9016 -1.0523
>
> # Can we perform N*M? No! A and B are not compatible!! Try to run: N%*%M
3.3.6 Additional functions
> M <- matrix(rnorm(16), nrow = 4, ncol = 4)
>
> dim(M)
## [1] 4 4
>
> t(M)
## [,1] [,2] [,3] [,4]
## [1,] 0.8836 0.7216 1.1255 -0.3660
## [2,] -1.4741 0.6280 -0.5720 -0.1200
## [3,] -0.5487 1.0463 -1.2647 0.2757
## [4,] 0.4768 -1.4371 -0.1576 -0.3024
>
> det(M)
## [1] 1.64
>
> (invM <- solve(M))
## [,1] [,2] [,3] [,4]
## [1,] 0.3360 0.4212 -0.10608 -1.417
## [2,] -0.5751 0.0519 0.03683 -1.173
## [3,] 0.5220 0.3749 -0.82252 -0.530
## [4,] 0.2975 -0.1887 -0.63600 -1.610
>
> eigen(M)
## eigen() decomposition
## $values
## [1] 0.7635+1.456i 0.7635-1.456i -0.9312+0.000i -0.6514+0.000i
##
## $vectors
## [,1] [,2] [,3] [,4]
## [1,] -0.68683+0.00000i -0.68683+0.00000i 0.2619+0i 0.36373+0i
## [2,] 0.08143+0.62184i 0.08143-0.62184i 0.6659+0i 0.63353+0i
## [3,] -0.34794+0.08086i -0.34794-0.08086i -0.4924+0i -0.09702+0i
## [4,] 0.02428-0.08228i 0.02428+0.08228i 0.4954+0i 0.67597+0i
3.4 Data-frame
- A special data object is called a data frame (
data.frame
). We can create data frames by reading data in from files or by using the functionas.data.frame()
on a set of vectors. A data frame is a set of parallel vectors, where the vectors can be of different types, e.g:
## course hours
## 1 CTA 39
## 2 PSP 65
## 3 RM 52
## course hours
## [1,] "CTA" "39"
## [2,] "PSP" "65"
## [3,] "RM" "52"
3.4.1 Data frames: adding and removing columns
## x y
## 1 A 1
## 2 B 2
## 3 C 3
## [1] "A" "B" "C"
## [1] "A" "B" "C"
> # It is simple to add or remove a column:
>
> dat$z <- dat$y^2
> dat$name <- c("A1", "A2", "A3")
> dat$y <- NULL
> dat
## x z name
## 1 A 1 A1
## 2 B 4 A2
## 3 C 9 A3
3.4.2 Data frames: merging data frames
## course hours
## 1 CTA 39
## 2 PSP 65
## 3 RM 52
## course credits
## 1 RM 6
## 2 CTA 4
## 3 PSP 8
> # We can merge that information into one data set by:
>
> df12 <- merge(df1, df2, by = "course")
> df12
## course hours credits
## 1 CTA 39 4
## 2 PSP 65 8
## 3 RM 52 6
3.4.3 Data frames: getting dimension, column info and others
## [1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
## [1] "integer"
## [1] "numeric"
## [1] 153 6
## [1] 153
## [1] 6
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
## 9 8 19 20.1 61 5 9
## 10 NA 194 8.6 69 5 10
## Ozone Solar.R Wind Temp Month Day
## 151 14 191 14.3 75 9 28
## 152 18 131 8.0 76 9 29
## 153 20 223 11.5 68 9 30
## Ozone Solar.R Wind Temp Month Day
## 145 23 14 9.2 71 9 22
## 146 36 139 10.3 81 9 23
## 147 7 49 10.3 69 9 24
## 148 14 20 16.6 63 9 25
## 149 30 193 6.9 70 9 26
## 150 NA 145 13.2 77 9 27
## 151 14 191 14.3 75 9 28
## 152 18 131 8.0 76 9 29
## 153 20 223 11.5 68 9 30
3.4.4 Data frames: the subset()
function
- Let’s look at the airquality data again:
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
- Logical indexing applies to data frames:
- … but a neat function is built in for making subsets of data:
## Ozone Temp
## 29 45 81
## 35 NA 84
## 36 NA 85
## 38 29 82
## 39 NA 87
## 40 71 90
## 41 39 87
## 42 NA 93
## 43 NA 92
## 44 23 82
## 61 NA 83
## 62 135 84
## 63 49 85
## 64 32 81
## 65 NA 84
## 66 64 83
## 67 40 83
## 68 77 88
## 69 97 92
## 70 97 92
## 71 85 89
## 72 NA 82
## 74 27 81
## 75 NA 91
## 77 48 81
## 78 35 82
## 79 61 84
## 80 79 87
## 81 63 85
## 83 NA 81
## 84 NA 82
## 85 80 86
## 86 108 85
## 87 20 82
## 88 52 86
## 89 82 88
## 90 50 86
## 91 64 83
## 92 59 81
## 93 39 81
## 94 9 81
## 95 16 82
## 96 78 86
## 97 35 85
## 98 66 87
## 99 122 89
## 100 89 90
## 101 110 90
## 102 NA 92
## 103 NA 86
## 104 44 86
## 105 28 82
## 117 168 81
## 118 73 86
## 119 NA 88
## 120 76 97
## 121 118 94
## 122 84 96
## 123 85 94
## 124 96 91
## 125 78 92
## 126 73 93
## 127 91 93
## 128 47 87
## 129 32 84
## 134 44 81
## 143 16 82
## 146 36 81
## Ozone Solar.R Wind Month Day
## 1 41 190 7.4 5 1
## 32 NA 286 8.6 6 1
## 62 135 269 4.1 7 1
## 93 39 83 6.9 8 1
## 124 96 167 6.9 9 1
## Ozone Solar.R Wind
## 1 41 190 7.4
## 2 36 118 8.0
## 3 12 149 12.6
## 4 18 313 11.5
## 5 NA NA 14.3
## 6 28 NA 14.9
## 7 23 299 8.6
## 8 19 99 13.8
## 9 8 19 20.1
## 10 NA 194 8.6
## 11 7 NA 6.9
## 12 16 256 9.7
## 13 11 290 9.2
## 14 14 274 10.9
## 15 18 65 13.2
## 16 14 334 11.5
## 17 34 307 12.0
## 18 6 78 18.4
## 19 30 322 11.5
## 20 11 44 9.7
## 21 1 8 9.7
## 22 11 320 16.6
## 23 4 25 9.7
## 24 32 92 12.0
## 25 NA 66 16.6
## 26 NA 266 14.9
## 27 NA NA 8.0
## 28 23 13 12.0
## 29 45 252 14.9
## 30 115 223 5.7
## 31 37 279 7.4
## 32 NA 286 8.6
## 33 NA 287 9.7
## 34 NA 242 16.1
## 35 NA 186 9.2
## 36 NA 220 8.6
## 37 NA 264 14.3
## 38 29 127 9.7
## 39 NA 273 6.9
## 40 71 291 13.8
## 41 39 323 11.5
## 42 NA 259 10.9
## 43 NA 250 9.2
## 44 23 148 8.0
## 45 NA 332 13.8
## 46 NA 322 11.5
## 47 21 191 14.9
## 48 37 284 20.7
## 49 20 37 9.2
## 50 12 120 11.5
## 51 13 137 10.3
## 52 NA 150 6.3
## 53 NA 59 1.7
## 54 NA 91 4.6
## 55 NA 250 6.3
## 56 NA 135 8.0
## 57 NA 127 8.0
## 58 NA 47 10.3
## 59 NA 98 11.5
## 60 NA 31 14.9
## 61 NA 138 8.0
## 62 135 269 4.1
## 63 49 248 9.2
## 64 32 236 9.2
## 65 NA 101 10.9
## 66 64 175 4.6
## 67 40 314 10.9
## 68 77 276 5.1
## 69 97 267 6.3
## 70 97 272 5.7
## 71 85 175 7.4
## 72 NA 139 8.6
## 73 10 264 14.3
## 74 27 175 14.9
## 75 NA 291 14.9
## 76 7 48 14.3
## 77 48 260 6.9
## 78 35 274 10.3
## 79 61 285 6.3
## 80 79 187 5.1
## 81 63 220 11.5
## 82 16 7 6.9
## 83 NA 258 9.7
## 84 NA 295 11.5
## 85 80 294 8.6
## 86 108 223 8.0
## 87 20 81 8.6
## 88 52 82 12.0
## 89 82 213 7.4
## 90 50 275 7.4
## 91 64 253 7.4
## 92 59 254 9.2
## 93 39 83 6.9
## 94 9 24 13.8
## 95 16 77 7.4
## 96 78 NA 6.9
## 97 35 NA 7.4
## 98 66 NA 4.6
## 99 122 255 4.0
## 100 89 229 10.3
## 101 110 207 8.0
## 102 NA 222 8.6
## 103 NA 137 11.5
## 104 44 192 11.5
## 105 28 273 11.5
## 106 65 157 9.7
## 107 NA 64 11.5
## 108 22 71 10.3
## 109 59 51 6.3
## 110 23 115 7.4
## 111 31 244 10.9
## 112 44 190 10.3
## 113 21 259 15.5
## 114 9 36 14.3
## 115 NA 255 12.6
## 116 45 212 9.7
## 117 168 238 3.4
## 118 73 215 8.0
## 119 NA 153 5.7
## 120 76 203 9.7
## 121 118 225 2.3
## 122 84 237 6.3
## 123 85 188 6.3
## 124 96 167 6.9
## 125 78 197 5.1
## 126 73 183 2.8
## 127 91 189 4.6
## 128 47 95 7.4
## 129 32 92 15.5
## 130 20 252 10.9
## 131 23 220 10.3
## 132 21 230 10.9
## 133 24 259 9.7
## 134 44 236 14.9
## 135 21 259 15.5
## 136 28 238 6.3
## 137 9 24 10.9
## 138 13 112 11.5
## 139 46 237 6.9
## 140 18 224 13.8
## 141 13 27 10.3
## 142 24 238 10.3
## 143 16 201 8.0
## 144 13 238 12.6
## 145 23 14 9.2
## 146 36 139 10.3
## 147 7 49 10.3
## 148 14 20 16.6
## 149 30 193 6.9
## 150 NA 145 13.2
## 151 14 191 14.3
## 152 18 131 8.0
## 153 20 223 11.5
3.4.5 Data frames: the summary()
function
- The summary() function gives you a range of statistics…
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.70 7.40 9.70 9.96 11.50 20.70
… that you could alternatively obtain using the R functions
min(), max(), mean(), median(), quantile()
.The summary of a data frame gives the summary of each column:
> summary(airquality)
## Ozone Solar.R Wind Temp Month
## Min. : 1.0 Min. : 7 Min. : 1.70 Min. :56.0 Min. :5.00
## 1st Qu.: 18.0 1st Qu.:116 1st Qu.: 7.40 1st Qu.:72.0 1st Qu.:6.00
## Median : 31.5 Median :205 Median : 9.70 Median :79.0 Median :7.00
## Mean : 42.1 Mean :186 Mean : 9.96 Mean :77.9 Mean :6.99
## 3rd Qu.: 63.2 3rd Qu.:259 3rd Qu.:11.50 3rd Qu.:85.0 3rd Qu.:8.00
## Max. :168.0 Max. :334 Max. :20.70 Max. :97.0 Max. :9.00
## NA's :37 NA's :7
## Day
## Min. : 1.0
## 1st Qu.: 8.0
## Median :16.0
## Mean :15.8
## 3rd Qu.:23.0
## Max. :31.0
##
3.4.6 Data frames: missing values
R uses the special value
NA
to code missing values.The result of arithmetic involving
NAs
becomesNA
as well:
- We need a special function
is.na
to filter outNAs
:
- To get rid of
NAs
in a column we can use:
> s <- subset(airquality, !is.na(Ozone))
>
> colMeans(s)
## Ozone Solar.R Wind Temp Month Day
## 42.129 NA 9.862 77.871 7.198 15.534
- Note that the argument
na.rm=TRUE
can be passed to most summary functions e.g.sum(), mean(), sd()
:
## [1] 42.13
## Ozone Solar.R Wind Temp Month Day
## 42.129 185.932 9.958 77.882 6.993 15.804
3.5 Lists
- A list is a most general object type. Elements can be of different types and lengths, e.g:
> list(a = 1, b = "Lisbon", c = c(1, 2, 3), d = list(e = matrix(1:4, 2), f = function(x) x^2))
## $a
## [1] 1
##
## $b
## [1] "Lisbon"
##
## $c
## [1] 1 2 3
##
## $d
## $d$e
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## $d$f
## function(x) x^2
- The objects returned from many of the built-in functions in R are fairly complicated lists!