---
title: "Data frames, matrices, & subsetting"
subtitle: "Lecture 05"
author: "Dr. Colin Rundel"
footer: "Sta 523 - Fall 2022"
format:
  revealjs:
    theme: slides.scss
    transition: fade
    slide-number: true
    self-contained: true
execute: 
  echo: true
---

```{r setup, message=FALSE, warning=FALSE, include=FALSE}
options(
  htmltools.dir.version = FALSE, # for blogdown
  width=80
)
```

# Matrices and Arrays


## Matrices

R supports the creation of 2d data structures (rows and columns) of atomic vector types. 

Generally these are formed via a call to `matrix()`.

:::: {.columns}
::: {.column width='50%'}
```{r}
matrix(1:4, nrow=2, ncol=2)
matrix(c(TRUE, FALSE), 2, 2)
```
:::
::: {.column width='50%'}
```{r}
matrix(LETTERS[1:6], 2)
matrix(6:1 / 2, ncol = 2)
```
:::
::::


## Data ordering

Matrices in R use column major ordering (data is sorted in column order not row order).

:::: {.columns .small}
::: {.column width='50%'}
```{r}
(m = matrix(1:6, nrow=2, ncol=3))
c(m)
```
:::
::: {.column width='50%'}
```{r}
(n = matrix(1:6, nrow=2, ncol=3))
c(n)
```
:::
::::

. . .

When creating the matrix we can populate the matrix via row, but that data will still be stored in column major order.

:::: {.columns .small}
::: {.column width='50%'}
```{r}
(x = matrix(1:6, nrow=2, ncol=3, byrow = TRUE))
c(x)
```
:::
::: {.column width='50%'}
```{r}
(y = matrix(1:6, nrow=3, ncol=2, byrow=TRUE))
c(y)
```
:::
::::


## Matrix structure

::: {.small}
```{r}
m = matrix(1:4, ncol=2, nrow=2)
```
:::

. . .

:::: {.columns .small}
::: {.column width='50%'}
```{r}
typeof(m)
mode(m)
```
:::
::: {.column width='50%'}
```{r}
class(m)
attributes(m)
```
:::
::::

. . .

Matrices (and arrays) are just atomic vectors with a `dim` attribute attached (they do not have a class attribute, but they do have a class).

:::: {.columns .small}
::: {.column width='50%'}
```{r}
n = letters[1:6]
dim(n) = c(2L, 3L)

n
class(n)
``` 
:::
::: {.column width='50%'}
```{r}
o = letters[1:6]
attr(o,"dim") = c(2L, 3L)

o
class(o)
``` 
:::
::::


## Arrays

Arrays are just an $n$-dimensional extension of matrices and are defined by adding the appropriate dimension sizes.

:::: {.columns .small}
::: {.column width='50%'}
```{r}
array(1:8, dim = c(2,2,2))
```
:::
::: {.column width='50%'}
```{r}
array(letters[1:6], dim = c(2,1,3))
```
:::
::::


# Data Frames

## Data Frames

A data frame is how R handles heterogeneous tabular data (i.e. a table of rows and columns) and is one of the most commonly used data structure in R.


::: {.medium}
```{r}
(df = data.frame(
  x = 1:3, 
  y = c("a", "b", "c"),
  z = c(TRUE)
))
```
:::

. . .

R represents data frames using a *list* of equal length *vectors*.

::: {.medium}
```{r}
str(df)
```
:::

## Data Frame Structure

:::: {.columns}
::: {.column width='50%'}
```{r}
typeof(df)
class(df)
attributes(df)
```
:::
::: {.column width='50%' .fragment}
```{r}
str(unclass(df))
```
:::
::::


## Build your own data.frame

```{r}
df = list(x = 1:3, y = c("a", "b", "c"), z = c(TRUE, TRUE, TRUE))
```

. . .

:::: {.columns}
::: {.column width='50%'}
```{r}
attr(df,"class") = "data.frame"
df
```
:::
::: {.column width='50%'}
```{r}
attr(df,"row.names") = 1:3
df
```
:::
::::

. . .

```{r}
str(df)
is.data.frame(df)
```


## Strings (Characters) vs Factors

Previous to R v4.0, the default behavior of data frames was to convert character data into factors. Sometimes this was useful, but mostly it wasn't.

It is important to know what type/class you are working with. This behavior can be changed using the `stringsAsFactors` argument to `data.frame` and related functions (e.g. `read.csv`, `read.table`, etc.).

:::: {.columns .small}
::: {.column width='50%'}
```{r}
df = data.frame(x = 1:3, y = c("a", "b", "c"), 
                stringsAsFactors = TRUE)
df
str(df)
```
:::

::: {.column width='50%'}
```{r}
df = data.frame(x = 1:3, y = c("a", "b", "c"), 
                stringsAsFactors = FALSE)
df
str(df)
```
:::
::::


## Length Coercion

For data frames on creation the lengths of the component vectors will be coerced to match, however if they not multiples then there will be an error (previously this produced a warning).

```{r error=TRUE}
data.frame(x = 1:3, y = c("a"))
```

. . .

```{r error=TRUE}
data.frame(x = 1:3, y = c("a","b"))
```

. . .

```{r error=TRUE}
data.frame(x = 1:3, y = character())
```


# Subsetting

## Subsetting in General

R has three subsetting operators (`[`, `[[`, and `$`). The behavior of these operators will depend on the object (class) they are being used with.

In general there are 6 different types of subsetting that can be performed:

:::: {.columns}
::: {.column width='50%'}
* Positive integer

* Negative integer

* Logical value
:::
::: {.column width='50%'}
* Empty / NULL

* Zero valued

* Character value (names)
:::
::::

## Positive Integer subsetting

Returns elements at the given location(s)

:::: {.medium}
```{r}
x = c(1,4,7)
y = list(1,4,7)
```

:::: {.columns}
::: {.column width='50%'}
```{r}
x[1]
x[c(1,3)]
x[c(1,1)]
x[c(1.9,2.1)]
```
:::
::: {.column width='50%' .small}
```{r}
str( y[1] )
str( y[c(1,3)] )
str( y[c(1,1)] )
str( y[c(1.9,2.1)] )
```
:::
::::
::::

::: {.aside}
Note - R uses a 1-based indexing scheme
:::

## Negative Integer subsetting

Excludes elements at the given location(s)

:::: {.columns}
::: {.column width='50%'}
```{r, error=TRUE}
x = c(1,4,7)
```

```{r, error=TRUE}
x[-1]
x[-c(1,3)]
x[c(-1,-1)]
```
:::
::: {.column width='50%'}
```{r, error=TRUE}
y = list(1,4,7)
```

```{r, error=TRUE}
str( y[-1] )
str( y[-c(1,3)] )
```
:::
::::

```{r error=TRUE}
x[c(-1,2)]
y[c(-1,2)]
```

## Logical Value Subsetting

Returns elements that correspond to `TRUE` in the logical vector. Length of the logical vector is coerced to be the same as the vector being subsetted.


:::: {.columns .medium}
::: {.column width='50%'}
```{r}
x = c(1,4,7,12)
```
```{r, error=TRUE}
x[c(TRUE,TRUE,FALSE,TRUE)]
x[c(TRUE,FALSE)]
```
:::
::: {.column width='50%'}
```{r, error=TRUE}
y = list(1,4,7,12)
```
```{r, error=TRUE}
str( y[c(TRUE,TRUE,FALSE,TRUE)] )
str( y[c(TRUE,FALSE)] )
```
:::
::::

. . .

:::: {.columns .medium}
::: {.column width='50%'}
```{r error=TRUE}
x[x %% 2 == 0]
```
:::
::: {.column width='50%'}
```{r error=TRUE}
str( y[y %% 2 == 0] )
```
:::
::::


## Empty Subsetting

Returns the original vector, this is not the same as subsetting with `NULL`

:::: {.columns}
::: {.column width='50%'}
```{r}
x = c(1,4,7)
```
```{r}
x[]
x[NULL]
```
:::
::: {.column width='50%'}
```{r}
y = list(1,4,7)
```
```{r}
str(y[])
str(y[NULL])
```
:::
::::


## Zero subsetting

Returns an empty vector (of the same type), this is the same as subsetting with `NULL`

:::: {.columns .medium}
::: {.column width='50%'}
```{r}
x = c(1,4,7)
```

```{r}
x[0]
```
:::
::: {.column width='50%'}
```{r}
y = list(1,4,7)
str(y[0])
```
:::
::::

. . .

0s can be mixed with either positive or negative integers for subsetting (they are ignored)


:::: {.columns .medium}
::: {.column width='50%'}
```{r}
x[c(0,1)]
y[c(0,1)]
```
:::
::: {.column width='50%'}
```{r}
x[c(0,-1)]
y[c(0,-1)]
```
:::
::::


## Character subsetting

If the vector has names, selects elements whose names correspond to the values in the name vector.

:::: {.columns}
::: {.column width='50%'}
```{r}
x = c(a=1, b=4, c=7)
```

```{r}
x["a"]
x[c("a","a")]
x[c("b","c")]
```
:::
::: {.column width='50%'}
```{r}
y = list(a=1,b=4,c=7)
```

```{r}
str(y["a"])
str(y[c("a","a")])
str(y[c("b","c")])
```
:::
::::

## Out of bounds

:::: {.columns .medium}
::: {.column width='50%'}
```{r}
x = c(1,4,7)
```

```{r}
x[4]
x[-4]
x["a"]
x[c(1,4)]
```
:::
::: {.column width='50%'}
```{r}
y = list(1,4,7)
```
```{r}
str(y[4])
str(y[-4])
str(y["a"])
str(y[c(1,4)])
```
:::
::::

## Missing values

:::: {.columns}
::: {.column width='50%'}
```{r}
x = c(1,4,7)
```
```{r}
x[NA]
x[c(1,NA)]
```
:::
::: {.column width='50%'}
```{r}
y = list(1,4,7)
```
```{r}
str(y[NA])
str(y[c(1,NA)])
```
:::
::::

## NULL and empty vectors (length 0)

This final type of subsetting follows the rules for length coercion with a 0-length vector (i.e. the vector being subset gets coerced to having length 0 if the subsetting vector has length 0)

:::: {.columns}
::: {.column width='50%'}
```{r}
x = c(1,4,7)
```
```{r}
x[NULL]
x[integer()]
x[character()]
```
:::
::: {.column width='50%'}
```{r}
y = list(1,4,7)
```
```{r}
y[NULL]
y[integer()]
y[character()]
```
:::
::::


# The other subset operators <br/> (`[[` and `$`)


## Atomic vectors - [ vs. [[

`[[` subsets like `[` except it can only subset for a *single* value

```{r, error=TRUE}
x = c(a=1,b=4,c=7)
```

```{r}
x[1]
```

. . .

```{r, error=TRUE}
x[[1]]
x[["a"]]
x[[1:2]]
x[[TRUE]]
```

## Generic Vectors (lists) - [ vs. [[

Subsets a single value, but returns the value - not a list containing that value. Vectors are interpreted as nested subsetting.

```{r, error=TRUE}
y = list(a=1, b=4, c=7:9)
```

:::: {.columns}
::: {.column width='50%'}
```{r, error=TRUE}
y[2]
```
:::
::: {.column width='50%'}
```{r, error=TRUE}
str( y[2] )
```
:::
::::

. . .


```{r, error=TRUE}
y[[2]]
y[["b"]]
y[[1:2]]
y[[2:1]]
```

## Hadley's Analogy (1)

```{r echo=FALSE, fig.align="center", out.width="37%"}
knitr::include_graphics("imgs/list_train1.png")
```
. . .
```{r echo=FALSE, fig.align="center", out.width="37%"}
knitr::include_graphics("imgs/list_train2.png")
```
. . .
```{r echo=FALSE, fig.align="center", out.width="37%"}
knitr::include_graphics("imgs/list_train3.png")
```


::: {.aside}
From Advanced R - [Chapter 4.3](https://adv-r.hadley.nz/subsetting.html#subset-single)
:::

## Hadley's Analogy (2)

```{r echo=FALSE, fig.align="center", out.width="80%"}
knitr::include_graphics("imgs/pepper_subset.png")
```

## [[ vs. $

`$` is equivalent to `[[` but it only works for *named lists* and it uses partial matching for names

```{r, error=TRUE}
x = c("abc"=1, "def"=5)
```

```{r, error=TRUE}
x$abc
```

. . .

```{r}
y = list("abc"=1, "def"=5)
```

```{r}
y[["abc"]]
y$abc
y$d
```

## A common error

Why does the following code not work?

```{r error=TRUE}
x = list(abc = 1:10, def = 10:1)
y = "abc"
```

:::: {.columns}
::: {.column width='50%'}
```{r}
x[[y]]
```
:::
::: {.column width='50%'}
```{r}
x$y
```
:::
::::

. . .

The expression `x$y` gets interpreted as `x[["y"]]` by R, note the inclusion of the `"`s, this is not the same as the expression `x[[y]]`.


## Exercise 1

Below are 100 values,

```{r}
x = c(56, 3, 17, 2, 4, 9, 6, 5, 19, 5, 2, 3, 5, 0, 13, 12, 6, 31, 10, 21, 8, 4, 1, 1, 2, 5, 16, 1, 3, 8, 1,
      3, 4, 8, 5, 2, 8, 6, 18, 40, 10, 20, 1, 27, 2, 11, 14, 5, 7, 0, 3, 0, 7, 0, 8, 10, 10, 12, 8, 82,
      21, 3, 34, 55, 18, 2, 9, 29, 1, 4, 7, 14, 7, 1, 2, 7, 4, 74, 5, 0, 3, 13, 2, 8, 1, 6, 13, 7, 1, 10,
      5, 2, 4, 4, 14, 15, 4, 17, 1, 9)
```

write down how you would create a subset to accomplish each of the following:

* Select every third value starting at position 2 in `x`.

* Remove all values with an odd index (e.g. 1, 3, etc.)

* Remove every 4th value, but only if it is odd.


# Subsetting Data Frames

## Subsetting rows

As data frames have 2 dimensions, we can subset on either the rows or the columns - the subsetting values are separated by a comma.

::: {.small}
```{r}
(df = data.frame(x = 1:3, y = c("A","B","C"), z = TRUE))
```
:::

. . .

:::: {.columns .small}
::: {.column width='50%'}
```{r}
df[1, ]
```
:::
::: {.column width='50%'}
```{r}
str( df[1, ] )
```
:::
::::


. . .

:::: {.columns .small}
::: {.column width='50%'}
```{r}
df[c(1,3), ]
```
:::
::: {.column width='50%'}
```{r}
str( df[c(1,3), ] )
```
:::
::::


## Subsetting Columns

::: {.small}
```{r}
df
```
:::

. . .

:::: {.columns .small}
::: {.column width='50%'}
```{r}
df[, 1]
```
:::
::: {.column width='50%'}
```{r}
str( df[, 1] )
```
:::
::::


. . .

:::: {.columns .small}
::: {.column width='50%'}
```{r}
df[, 1:2]
```
:::
::: {.column width='50%'}
```{r}
str( df[, 1:2] )
```
:::
::::


. . .

:::: {.columns .small}
::: {.column width='50%'}
```{r}
df[, -3]
```
:::
::: {.column width='50%'}
```{r}
str( df[, -3] )
```
:::
::::


## Subsetting both

::: {.small}
```{r}
df
```
:::

. . .

:::: {.columns .small}
::: {.column width='50%'}
```{r}
df[1, 1]
```
:::
::: {.column width='50%'}
```{r}
str( df[1, 1] )
```
:::
::::


. . .

:::: {.columns .small}
::: {.column width='50%'}
```{r}
df[1:2, 1:2]
```
:::
::: {.column width='50%'}
```{r}
str( df[1:2, 1:2] )
```
:::
::::


. . .

:::: {.columns}
::: {.column width='50%'}
```{r}
df[-1, 2:3]
```
:::
::: {.column width='50%'}
```{r}
str( df[-1, 2:3] )
```
:::
::::


## Preserving vs Simplifying

Most of the time, R's `[` subset operator is a *preserving* operator, in that the returned object will always have the same type/class as the object being subset. 

Confusingly, when used with some classes (e.g. data frame, matrix or array) `[` becomes a *simplifying* operator (does not preserve type) - this behavior is instead controlled by the `drop` argument.


## Drop

:::: {.columns .small}
::: {.column width='50%'}
```{r error=TRUE}
df[1, ]
```
:::
::: {.column width='50%'}
```{r error=TRUE}
str(df[1, ])
```
:::
::::


. . .

:::: {.columns .small}
::: {.column width='50%'}
```{r error=TRUE}
df[1, , drop=TRUE]
```
:::
::: {.column width='50%'}
```{r error=TRUE}
str(df[1, , drop=TRUE])
```
:::
::::

. . .

:::: {.columns .small}
::: {.column width='50%'}
```{r error=TRUE}
df[, 1]
```
:::
::: {.column width='50%'}
```{r error=TRUE}
str(df[, 1])
```
:::
::::


. . .

:::: {.columns .small}
::: {.column width='50%'}
```{r error=TRUE}
df[, 1, drop=FALSE]
```
:::
::: {.column width='50%'}
```{r error=TRUE}
str(df[, 1, drop=FALSE])
```
:::
::::


## Exceptions

`drop` only works when the resulting value can be represented as a 1d vector (either a list or atomic).

:::: {.columns}
::: {.column width='50%'}
```{r error=TRUE}
df[1:2, 1:2]
```
:::
::: {.column width='50%'}
```{r error=TRUE}
str(df[1:2, 1:2])
```
:::
::::


. . .

:::: {.columns}
::: {.column width='50%'}
```{r error=TRUE}
df[1:2, 1:2, drop=TRUE]
```
:::
::: {.column width='50%'}
```{r error=TRUE}
str(df[1:2, 1:2, drop=TRUE])
```
:::
::::


## Preserving vs Simplifying Subsets

<br/>

Type             |  Simplifying             |  Preserving
:----------------|:-------------------------|:--------------------------------------------------------
Atomic Vector    |  `x[[1]]`                |  `x[1]`
List             |  `x[[1]]`                |  `x[1]`
Matrix / Array   |  `x[[1]]` <br/> `x[1, ]` <br/> `x[, 1]` |  `x[1, , drop=FALSE]` <br/> `x[, 1, drop=FALSE]`
Factor           |  `x[1:4, drop=TRUE]`     |  `x[1:4]` <br/> `x[[1]]`
Data frame       |  `x[, 1]` <br/> `x[[1]]` |  `x[, 1, drop=FALSE]` <br/> `x[1]`


```{r echo=FALSE}
knitr::knit_exit()
```


# Subsetting and assignment

## Subsetting and assignment

Subsets can also be used with assignment to update specific values within an object (in-place).

```{r}
x = c(1, 4, 7, 9, 10, 15)
```

. . .

```{r}
x[2] = 2
x
```
. . .
```{r}
x %% 2 != 0
```
. . .
```{r}
x[x %% 2 != 0] = (x[x %% 2 != 0] + 1) / 2
x
```
. . .
```{r}
x[c(1,1)] = c(2,3)
x
```

```{r}
x = 1:6
```

```{r}
x[c(2,NA)] = 1
x
```

. . .

```{r}
x = 1:6
```
```{r}
x[c(-1,-2)] = 3
x
```

. . .

```{r}
x = 1:6
```
```{r}
x[c(TRUE,NA)] = 1
x
```

. . .

```{r}
x = 1:6
```
```{r}
x[] = 1:3
x
```


## Subsets of Subsets

```{r}
( df = data.frame(a = c(5,1,NA,3)) )
```

```{r}
df$a[df$a == 5] = 0
df
```

. . .

```{r}
df[1][df[1] != 3] = -1
df
```

## Exercise 2

Some data providers choose to encode missing values using values like `-999`. Below is a sample data frame with missing values encoded in this way. 

```{r}
d = data.frame(
  patient_id = c(1, 2, 3, 4, 5),
  age = c(32, 27, 56, 19, 65),
  bp = c(110, 100, 125, -999, -999),
  o2 = c(97, 95, -999, -999, 99)
)
```

* *Task 1* - using the subsetting tools we've discussed come up with code that will replace the `-999` values in the `bp` and `o2` column with actual `NA` values. Save this as `d_na`.

* *Task 2* - Once you have created `d_na` come up with code that translate it back into the original data frame `d`, i.e. replace the `NA`s with `-999`.