---
title: "Tidy data & dplyr"
subtitle: "Lecture 06"
author: "Dr. Colin Rundel"
footer: "Sta 523 - Fall 2022"
format:
  revealjs:
    theme: slides.scss
    transition: fade
    slide-number: true
    self-contained: true
execute: 
  echo: true
---


```{r setup, message=FALSE, warning=FALSE, include=FALSE}
options(
  htmltools.dir.version = FALSE, # for blogdown
  width=75
)

library(magrittr)
```

#

![](imgs/hex-tidyverse.png){fig-align="center" width="66%"}

## Tidy data

:::: {.columns}
::: {.column width='50%'}
```{r echo=FALSE, out.width="70%", fig.align="center"}
knitr::include_graphics('imgs/tidy1.png')
```
:::
::: {.column width='50%'}
```{r echo=FALSE, out.width="70%", fig.align="center"}
knitr::include_graphics('imgs/tidy2.png')
```
:::
::::

. . .

```{r echo=FALSE, out.width="30%", fig.align="center"}
knitr::include_graphics('imgs/tidy3.png')
```

::: {.aside}
From R4DS - [tidy data](http://r4ds.had.co.nz/tidy-data.html)
:::

## Tidy vs Untidy

> Happy families are all alike; every unhappy family is unhappy in its own way 
>
> — Leo Tolstoy, Anna Karenina

. . .

<br/>

```{r echo=FALSE}
tidyr::billboard[,1:7]
```

<br/><br/>

::: {.center}
Is the above data set tidy?
:::

## More tidy vs untidy

Is the following data tidy?

```{r include=FALSE}
sw_people = purrr::map(
  repurrrsive::sw_people,
  ~ .[1:8]
)
```

:::: {.columns}
::: {.column width='50%'}
```{r, echo=FALSE}
str(sw_people[1:3])
```
:::
::: {.column width='50%'}
```{r, echo=FALSE}
str(sw_people[4:6])
```
:::
::::


#

![](imgs/hex-tibble.png){fig-align="center" width="40%"}


## Modern data frames

The tidyverse includes the tibble package that extends data frames to be a bit more modern. The core features of tibbles is to have a nicer printing method as well as being "surly" and "lazy".

:::: {.small}
```{r}
library(tibble)
```

```{r include = FALSE}
options(width = 50)
```

:::: {.columns}
::: {.column width='50%'}
```{r}
iris
```
:::
::: {.column width='50%'}
```{r}
(tbl_iris = as_tibble(iris))
```
:::
::::
::::

```{r include = FALSE}
options(width = 75)
```

## Tibbles are lazy

By default, subsetting tibbles always results in another tibble (`$` or `[[` can still be used to subset for a specific column). I.e. tibble subsets are always preserving and therefore type consistent.

::: {.small}
```{r}
tbl_iris[1,]
```
:::

. . .

:::: {.columns .small}
::: {.column width='50%'}
```{r}
tbl_iris[,1]
```
:::
::: {.column width='50%'}
```{r}
head(tbl_iris[[1]])
head(tbl_iris$Species)
```
:::
::::


## More laziness - partial matching

Tibbles do not use partial matching when the `$` operator is used.

:::: {.columns}
::: {.column width='50%'}
```{r}
head( iris$Species )
```
:::
::: {.column width='50%'}
```{r}
head( tbl_iris$Species )
```
:::
::::

. . .

:::: {.columns}
::: {.column width='50%'}
```{r}
head( iris$Sp )
```
:::
::: {.column width='50%'}
```{r}
head( tbl_iris$Sp )
```
:::
::::

## More laziness - stringsAsFactors

Tibbles also have always had `stringsAsFactors = FALSE` as default behavior.

```{r}
(t = tibble(
  x = 1:3, 
  y = c("A","B","C"),
  z = factor(c("X","Y","Z"))
))
```


## Tibbles and length coercion

Only vectors with length 1 will undergo length coercion - everything else will throw an error.

:::: {.columns .medium}
::: {.column width='50%'}
```{r error=TRUE}
data.frame(x = 1:4, y = 1)
```
:::
::: {.column width='50%'}
```{r error=TRUE}
tibble(x = 1:4, y = 1)
```
:::
::::

. . .

:::: {.columns .medium}
::: {.column width='50%'}
```{r error=TRUE}
data.frame(x = 1:4, y = 1:2)
```
:::
::: {.column width='50%'}
```{r error=TRUE}
tibble(x = 1:4, y = 1:2)
```
:::
::::


## Tibbles and S3

:::: {.columns .medium}
::: {.column width='50%'}
```{r}
t = tibble(
  x = 1:3, 
  y = c("A","B","C")
)

class(t)
```
:::
::: {.column width='50%'}
```{r}
d = data.frame(
  x = 1:3, 
  y = c("A","B","C")
)

class(d)
```
:::
::::

. . .


::: {.medium}
```{r}
methods(class="tbl_df")
methods(class="tbl")
```
:::

## Supporting tibbles?

```{r}
d = tibble(
  x = rnorm(100),
  y = 3 + x + rnorm(100, sd = 0.1) 
)
```

```{r}
lm(y~x, data = d)
```


<br/> <br/>

::: {.center .large}
Why did this work?
:::


#

::: {.center .xlarge}
magrittr
:::

![](imgs/hex-magrittr.png){fig-align="center" width="35%"}


## What is a pipe

> In software engineering, a pipeline consists of a chain of processing elements (processes, threads, coroutines, functions, etc.), arranged so that the output of each 
element is the input of the next;
> - [Wikipedia - Pipeline (software)](https://en.wikipedia.org/wiki/Pipeline_%28software%29)

. . .

Magrittr's pipe is a new infix operator that allows us to link two functions together in a way that is readable from left to right.


The two code examples below are equivalent, 

:::: {.columns}
::: {.column width='50%'}
```r
f(g(x=1, y=2), n=2)
```
:::
::: {.column width='50%'}
```r
g(x=1, y=2) %>% f(n=2)
```
:::
::::


## Readability

Consider the following sequence of actions that describe the process of getting to campus in the morning:

I need to find my key, then unlock my car, then start my car, then drive to school, then park.

. . .

<br/>

Expressed as a set of nested functions in R pseudocode this would look like:

```{r eval=FALSE}
park(drive(start_car(find("keys")), to="campus"))
```

. . .

Writing it out using pipes give it a more natural (and easier to read) structure:

```{r eval=FALSE}
find("keys") %>%
    start_car() %>%
    drive(to="campus") %>%
    park()
```

## Approaches

All of the following are fine, it comes down to personal preference:

Nested:
```{r, eval=FALSE}
h( g( f(x), y=1), z=1 )
```

Piped:
```{r, eval=FALSE}
f(x) %>% 
  g(y=1) %>% 
  h(z=1)
```

Intermediate:
```{r, eval=FALSE}
res = f(x)
res = g(res, y=1)
res = h(res, z=1)
```

## What about other arguments?

Sometimes we want to send our results to an function argument other than first one or we want to use the previous result for multiple arguments. In these cases we can refer to the previous result using `.`.

. . .

```{r}
data.frame(a = 1:3, b = 3:1) %>% lm(a~b, data=.)
```

. . .

```{r}
data.frame(a = 1:3, b = 3:1) %>% .[[1]]
```

. . .

```{r}
data.frame(a = 1:3, b = 3:1) %>% .[[length(.)]]
```

## The base R pipe

As of R v4.1.0 a pipe operator has been added to the base language in R, it is implemented as `|>`.

```{r}
1:10 |> cumsum()
1:10 |> cumsum() |> mean()
```

. . .

The current version of RStudio on the departmental servers is v4.1 so you are welcome to try it out.

## Base R pipe considerations:

* Depending an R version >= 4.1 is a harder dependency than depending on the magrittr package

* `|>` does not support using `.` to pass returned values to other argument positions

* `|>` will likely have less overhead than `%>%` but the difference is unlikely to matter in practice

* `|>` supports an equivalent to `.` using `_` as of R v4.2


#

![](imgs/hex-dplyr.png){fig-align="center" width="40%"}


## A Grammar of Data Manipulation

::: {.medium}
dplyr is based on the concepts of functions as verbs that manipulate data frames.

Core single data frame functions / verbs:

* `filter()` / `slice()`: pick rows based on criteria
* `select()` / `rename()`: select columns by name
* `pull()`: grab a column as a vector
* `arrange()`: reorder rows
* `mutate()` / `transmute()`: create or modify columns
* `distinct()`: filter for unique rows
* `summarise()` / `count()`: reduce variables to values
* `group_by()` / `ungroup()`: modify other verbs to act on subsets
* `relocate()`: change column order
* ... (many more)
:::

## dplyr heuristics

1. First argument is *always* a data frame

2. Subsequent arguments say what to do with that data frame

3. *Always* return a data frame

4. Don't modify in place

5. Magic via lazy evaluation


## Example Data

We will demonstrate dplyr's functionality using the nycflights13 data.

::: {.medium}
```{r message=FALSE}
library(dplyr)
library(nycflights13)
```

```{r}
flights
```
:::

## filter() - March flights

```{r}
flights %>% filter(month == 3)
```

## filter() - Flights in the first 7 days of March

```{r}
flights %>% filter(month == 3, day <= 7)
```

## filter() - Flights to LAX *or* JFK in March

```{r}
flights %>% filter(dest == "LAX" | dest == "JFK", month==3)
```

## slice() - First 10 flights

```{r}
flights %>% slice(1:10)
```

## slice() - Last 5 flights

```{r}
flights %>% slice((n()-4):n())
```

## slice_tail() - Last 5 flights

```{r}
flights %>% slice_tail(n = 5)
```

## select() - Individual Columns

```{r}
flights %>% select(year, month, day)
```

## select() - Exclude Columns

```{r}
flights %>% select(-year, -month, -day)
```

## select() - Ranges

```{r}
flights %>% select(year:day)
```

## select() - Exclusion Ranges

```{r}
flights %>% select(-(year:day))
```

## select() - Matching contains()

```{r}
flights %>% select(contains("dep"), 
                   contains("arr"))
```

## select() - Matching starts_with()

```{r}
flights %>% select(starts_with("dep"), 
                   starts_with("arr"))
```

::: {.aside}
Other helpers provide by [tidyselect](https://tidyselect.r-lib.org/): 

`starts_with`, `ends_with`, `everything`, `matches`, `num_range`, `one_of`, `everything`, `last_col`.
:::

## select() + where() - Get numeric columns

::: {.medium}
```{r}
flights %>% select(where(is.numeric))
```
:::

## select() + where() - Get non-numeric columns

::: {.medium}
```{r}
flights %>% select(where(function(x) !is.numeric(x)))
```
:::

## relocate - to the front

```{r}
flights %>% relocate(carrier, origin, dest)
```

## relocate - to the end

```{r}
flights %>%
  relocate(year, month, day, .after = last_col())
```


## rename() - Change column names

```{r}
flights %>% rename(tail_number = tailnum)
```
 
## select() vs. rename()

:::: {.columns .small}
::: {.column width='50%'}
```{r}
flights %>% select(tail_number = tailnum)
```
:::

::: {.column width='50%'}
```{r}
#| include: false
options(width=50)
```
```{r}
flights %>% rename(tail_number = tailnum)
```
```{r}
#| include: false
options(width=75)
```
:::
::::

## pull()

```{r}
names(flights)
```

. . .

```{r}
flights %>% pull("year") %>% head()
```

. . .

```{r}
flights %>% pull(1) %>% head()
```

. . .

```{r}
flights %>% pull(-1) %>% head()
```


## arrange() - Sort data

```{r}
flights %>% filter(month==3,day==2) %>% arrange(origin, dest)
```

## arrange() & desc() - Descending order

```{r}
flights %>% 
  filter(month==3, day==2) %>% 
  arrange(desc(origin), dest) %>% 
  select(origin, dest, tailnum)
```


## distinct() - Find unique rows

```{r}
flights %>% 
  select(origin, dest) %>% 
  distinct() %>% 
  arrange(origin,dest)
```

## mutate() - Modify / create columns

```{r message=FALSE}
flights %>% 
  select(year:day) %>% 
  mutate(date = paste(year, month, day, sep="/"))
```

## summarise() - Arregate rows

```{r}
flights %>% 
  summarize(n(), min(dep_delay), max(dep_delay))
```

. . .

```{r}
flights %>% 
  summarize(
    n = n(), 
    min_dep_delay = min(dep_delay, na.rm = TRUE), 
    max_dep_delay = max(dep_delay, na.rm = TRUE)
  )
```

## group_by()

```{r}
flights %>% group_by(origin)
```

## summarise() with group_by()

```{r}
flights %>% 
  group_by(origin) %>%
  summarize(
    n = n(), 
    min_dep_delay = min(dep_delay, na.rm = TRUE), 
    max_dep_delay = max(dep_delay, na.rm = TRUE)
  )
```

## Groups after summarise

```{r}
#| include: false
options(width=50)
```

:::: {.columns .small}
::: {.column width='50%'}
```{r}
flights %>% 
  group_by(origin) %>%
  summarize(
    n = n(), 
    min_dep_delay = min(dep_delay, na.rm=TRUE), 
    max_dep_delay = max(dep_delay, na.rm=TRUE),
    .groups = "drop_last"
  )
```
:::
::: {.column width='50%'}
```{r}
flights %>% 
  group_by(origin) %>%
  summarize(
    n = n(), 
    min_dep_delay = min(dep_delay, na.rm=TRUE), 
    max_dep_delay = max(dep_delay, na.rm=TRUE),
    .groups = "keep"
  )
```
:::
::::

```{r}
#| include: false
options(width=75)
```


## count()

:::: {.columns}
::: {.column width='50%'}
```{r}
flights %>% 
  group_by(origin, carrier) %>%
  summarize(
    n = n(), 
    .groups = "drop"
  )
```
:::
::: {.column width='50%'}
```{r}
flights %>% 
  count(origin, carrier)
```
:::
::::

## mutate() with group_by()

```{r}
flights %>% group_by(origin) %>%
  mutate(
    n = n(), 
  ) %>%
  select(origin, n)
```

## Exercises / Examples

1. How many flights to Los Angeles (LAX) did each of the legacy carriers (AA, UA, DL or US) have in May from JFK, and what was their average duration?

1. What was the shortest flight out of each airport in terms of distance? In terms of duration?

1. Which plane (check the tail number) flew out of each New York airport the most?

1. Which date should you fly on if you want to have the lowest possible average departure delay? What about arrival delay?