11 min read

R Vocabulary - Part 1

To be a proficient R user, you need to read and understand the material in the book Advanced R by Hadley Wickham. The second chapter in this book is on vocabulary - a list of functions from the base, stats and utils packages which all R users should be familiar with. In a series of posts, we will attempt to learn most of the functions mentioned in the chapter using some examples.

We will skip the function ? and start with str. According to its documentation, str can be used to display the internal structure of an R object. Let us look at a few simple examples first.

x <- c(1, 2, 3)
str(x)
##  num [1:3] 1 2 3
x <- c(1L, 2L)
str(x)
##  int [1:2] 1 2
x <- c(TRUE, FALSE, TRUE, TRUE)
str(x)
##  logi [1:4] TRUE FALSE TRUE TRUE
x <- c("a", "b", "c")
str(x)
##  chr [1:3] "a" "b" "c"
x <- c(1 + 2i, 3 + 0i, 1i)
str(x)
##  cplx [1:3] 1+2i 3+0i 0+1i
str(charToRaw("radmuzom"))
##  raw [1:8] 72 61 64 6d ...

From the above examples, we see that the for atomic vectors, it displays the type, the number of elements in the vector and the first few elements. What happens if we apply str to functions?

str(c)
## function (...)
str(ls)
## function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE, 
##     pattern, sorted = TRUE)
str(print)
## function (x, ...)

It’s interesting to see that the output is different for different functions. That is because c is a primitive function, ls is an R function while print is a S3 generic function. This can be verified by typing the function name in the console without any parantheses. An explanation of primitive or S3 generics is beyond the scope of this post.

Let us now look at lists.

l <- list(x = 1, a = "A")
str(l)
## List of 2
##  $ x: num 1
##  $ a: chr "A"
l2 <- list(m = matrix(1:4, nrow = 2), l = l)
str(l2)
## List of 2
##  $ m: int [1:2, 1:2] 1 2 3 4
##  $ l:List of 2
##   ..$ x: num 1
##   ..$ a: chr "A"
l3 <- list(l = l, l2 = l2, w = rnorm(10))
str(l3)
## List of 3
##  $ l :List of 2
##   ..$ x: num 1
##   ..$ a: chr "A"
##  $ l2:List of 2
##   ..$ m: int [1:2, 1:2] 1 2 3 4
##   ..$ l:List of 2
##   .. ..$ x: num 1
##   .. ..$ a: chr "A"
##  $ w : num [1:10] 0.4469 -2.0953 0.0889 1.4325 -0.6085 ...

From the output, we notice that str displays the name of the list elements, their class and the basic structure similar to the one we saw for vectors. Use the max.level argument to restrict the level of nesting in the output.

str(l3, max.level = 2)
## List of 3
##  $ l :List of 2
##   ..$ x: num 1
##   ..$ a: chr "A"
##  $ l2:List of 2
##   ..$ m: int [1:2, 1:2] 1 2 3 4
##   ..$ l:List of 2
##  $ w : num [1:10] 0.4469 -2.0953 0.0889 1.4325 -0.6085 ...

A common use of str is to compactly look at the structure of a dataset.

str(InsectSprays)
## 'data.frame':    72 obs. of  2 variables:
##  $ count: num  10 7 20 14 14 12 10 23 17 20 ...
##  $ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

The above example shows that this dataset is a data.frame object comprising 72 observations and 2 variables. The first variable is count, which is a numeric vector while the second variable is spray which is a factor with 6 levels.

From the above examples, it is hopefully clear that if you are unsure of what an R object is, str provides information useful to understand it’s structure. For datasets, it also helps to understand the number of rows and columns for that dataset.

%in and match are most useful in matching the elements of one vector in another vector.

x <- c(6, 37)
y <- sample(1:100, 1000, replace = TRUE)
x %in% y
## [1] TRUE TRUE
match(x, y)
## [1]  90 128
which(y == 6)
## [1]  90 557 613 708 787 908 965 970

Note that the length of the result returned by %in% is the same as the first argument. match only returns the indices of the first occurence of the values in x.

We won’t spend too much time on =, <- and <<- in this article. However, do remember that these are functions and we can use backticks to call them in the “usual” way for functions. The -> and ->> operators are rarely used.

`<-`(x, 3)
x
## [1] 3
1 -> x
x
## [1] 1

$, [ and [[ are operators which act on vectors, matrices, arrays or lists to extract or replace parts. They are described in great detail in the chapter Subsetting.

head returns the first parts of a variety of different objects, but is most useful for vectors or data frames. tail works similarly but returns the last parts of the object.

y <- sample(1:100, 1000, replace = TRUE)
head(y)
## [1] 14 93 65 38 26 37
head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
head(cars, n = 10)
##    speed dist
## 1      4    2
## 2      4   10
## 3      7    4
## 4      7   22
## 5      8   16
## 6      9   10
## 7     10   18
## 8     10   26
## 9     10   34
## 10    11   17
head(ls)
##                                                                              
## 1 function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE, 
## 2     pattern, sorted = TRUE)                                                
## 3 {                                                                          
## 4     if (!missing(name)) {                                                  
## 5         pos <- tryCatch(name, error = function(e) e)                       
## 6         if (inherits(pos, "error")) {

subset is used to return parts of a vector, matrix or data frame which meets conditions provided as an argument to the function. It is most useful for data frames.

subset(cars, speed < 10 & dist > 10)
##   speed dist
## 4     7   22
## 5     8   16
subset(cars, speed < 10 & dist > 10, select = speed)
##   speed
## 4     7
## 5     8

with is used to evaluate an R expression in an environment constructed from data. For interactive use, it usually saves some typing and is nicer to read.

For example, instead of

plot(cars$speed, cars$dist)

one can use

with(cars, plot(speed, dist))

assign is used to assign a value to a name in an environment.

rm(x)
f <- function() { assign("x", 1, pos = 1) }
f()
x
## [1] 1

In the above example, the pos argument is a positive integer which denotes the position in the search list. This causes x to have the value 1 in the global environment.

get is in some sense the opposite of assign.

x <- 3
g <- function() { get("x") }
g()
## [1] 3

all.equal is used to compare objects and report differences.

x <- c(2, 3)
y <- c(2, 3)
all.equal(x, y)
## [1] TRUE
x <- 1
all.equal(x, y)
## [1] "Numeric: lengths (1, 2) differ"
x <- c(1, 2)
all.equal(x, y)
## [1] "Mean relative difference: 0.6666667"
l1 <- list(x = c(1, 2), y = c("A", "B"))
l2 <- list(x = c(1, 2))
all.equal(l1, l2)
## [1] "Length mismatch: comparison on first 1 components"
l2 <- list(x = c(1, 2), y = c("C", "D"))
all.equal(l1, l2)
## [1] "Component \"y\": 2 string mismatches"
l2 <- list(x = c(1, 2), y = c("A", "B"))
all.equal(l1, l2)
## [1] TRUE

identical is used to safely and reliablty test for two objects being exactly equal. In if or while statements, and in logical expressions which use && or ||, identical will ensure that a single logical value is obtained.

2 == c(1, 2)
## [1] FALSE  TRUE
identical(2, c(1, 2))
## [1] FALSE
1 == NULL
## logical(0)
identical(1, NULL)
## [1] FALSE
identical(1, 1.0)
## [1] TRUE
identical(1, 1L)
## [1] FALSE

We will not look at the relational operators !=, ==, >, >=, < and <= in detail here. However, it is worth remembering that these operators are vectorized along with vector recycling (if one of the vectors is shorter than the other, then the elements of the shorter vector are recycled).

x <- c(1, 2)
y <- c(1, 2)
x < y
## [1] FALSE FALSE
x <- 1
x < y
## [1] FALSE  TRUE

is.na should be used to test whether elements are missing. Note that one should not use the == relational operator.

x <- NA
is.na(x)
## [1] TRUE
x == NA
## [1] NA

Also, note that there are separate constants for missing values of the atomic vector types.

x <- c(NA, NA)
class(x)
## [1] "logical"
x <- c(NA, 1.0)
class(x)
## [1] "numeric"
x <- c(NA_character_, NA_character_)
class(x)
## [1] "character"

complete.cases is used to check which cases have no missing values and is most useful with data frames. For data frames, it returns a logical vector specifying which rows have no missing values across the entire sequence.

d <- data.frame(
  x = c(1, NA, 2),
  y = c("A", "B", NA)
)
complete.cases(d)
## [1]  TRUE FALSE FALSE

is.finite returns a logical vector specifying which elements are finite. Even though NaN is “not a number”, is.finte still returns FALSE when evaluated with NaN as the argument.

x <- c(1, 3.0, Inf, NaN, 7)
is.finite(x)
## [1]  TRUE  TRUE FALSE FALSE  TRUE

The basic math functions are explained via the examples below. While the examples use “scalar” values in most cases, all the operations are vectorized. Examples using the trigonometric functions are not provided.

5 * 3
## [1] 15
`*`(5, 3)
## [1] 15
5.1 * 2L
## [1] 10.2
5 * (2 + 3i)
## [1] 10+15i
(2 + 3i) + 7
## [1] 9+3i
(2 + 3i) - 7
## [1] -5+3i
3 / 5
## [1] 0.6
3L / 5L
## [1] 0.6
(3 + 7i) / 6
## [1] 0.5+1.166667i
2 ^ 3
## [1] 8
2.2 ^ 7.5
## [1] 369.9731
(2 + 3i) ^ 3
## [1] -46+9i
(2 + 3i) ^ (3 + 4i)
## [1] -0.2045529+0.8966233i
7 %% 5 # remainder
## [1] 2
7 %/% 5 # integer division
## [1] 1
abs(5)
## [1] 5
abs(5 + 3i)
## [1] 5.830952
abs(-5)
## [1] 5
sign(2)
## [1] 1
sign(-2)
## [1] -1
sign(0)
## [1] 0
sign(2 + 3i)
## Error in sign(2 + (0+3i)): unimplemented complex function
ceiling(c(3.2, 3.8))
## [1] 4 4
floor(c(3.2, 3.8))
## [1] 3 3
trunc(c(3.2, 3.8))
## [1] 3 3
round(c(3.2, 3.8))
## [1] 3 4
round(c(3.275, 3.811), digits = 2)
## [1] 3.27 3.81
signif(c(3.2, 3.8))
## [1] 3.2 3.8
signif(c(3.275, 3.811), digits = 2)
## [1] 3.3 3.8
round(-2.3)
## [1] -2
round(33, digits = -1) # nearest 10
## [1] 30
round(75, digits = -2) # nearest 100
## [1] 100
exp(1)
## [1] 2.718282
exp(-1)
## [1] 0.3678794
log(3)
## [1] 1.098612
log(-2)
## Warning in log(-2): NaNs produced
## [1] NaN
log(exp(3))
## [1] 3
exp(log(3))
## [1] 3
log10(100)
## [1] 2
log2(1024)
## [1] 10
sqrt(25)
## [1] 5
sqrt(-25)
## Warning in sqrt(-25): NaNs produced
## [1] NaN
sqrt(3 + 4i)
## [1] 2+1i

max will find the maximum element from numeric or character vectors. pmax will do an element by element comparison, and return the largest among the first element, largest among the second element and so on. If the vectors are not of equal length, then the elements of the shorter vectors are recycled.

max(c(1, 2.3), c(2.7, 1.5), c(4, 2.2))
## [1] 4
pmax(c(1, 2.3), c(2.7, 1.5), c(4, 2.2))
## [1] 4.0 2.3

min and pmin work in the same way as above. prod and sum calculates the product and sum of the values present in its arguments. diff is used to calculate lagged differences between subsequent values (the default lag is 1).

prod(rnorm(10) + 1)
## [1] -0.9327871
sum(rnorm(10) + 1)
## [1] 6.650378
prod(c(1i, 1 + 2i))
## [1] -2+1i
diff(1:10)
## [1] 1 1 1 1 1 1 1 1 1
diff(1:10, lag = 3)
## [1] 3 3 3 3 3 3 3

The cumulative versions of max, min, prod and sum return the cumulative results as a vector. For the nth element, it will apply the function on the nth element and the result of the cumulative function till the (n-1)th element.

x <- 1:10
cumsum(x)
##  [1]  1  3  6 10 15 21 28 36 45 55
cumprod(x)
##  [1]       1       2       6      24     120     720    5040   40320
##  [9]  362880 3628800
cummax(x)
##  [1]  1  2  3  4  5  6  7  8  9 10
cummin(x)
##  [1] 1 1 1 1 1 1 1 1 1 1

Next we will look at some of the basic descriptive statistical functions. The mean, median, standard deviation and variance of a variable are calculated as follows.

x <- rnorm(10)
mean(x)
## [1] -0.07098529
median(x)
## [1] -0.09831628
sd(x)
## [1] 0.7663107
var(x)
## [1] 0.587232

cor is used to calculate the correlation between a pair of variables. The method argument is used to specify which method to use - the Pearson correlation coefficient, Kendall’s rank correlation or Spearman’s rank correlation. The default method is the Pearson coefficient.

x <- rnorm(100)
y <- rnorm(100)
cor(x, y)
## [1] 0.0936907
cor(x, y, method = "kendall")
## [1] 0.07353535
cor(x, y, method = "spearman")
## [1] 0.1112991