Concepts: Data Manipulation

class: center, middle, inverse, title-slide

.title[
# Concepts: Data Manipulation
]

---

##  Data Manipulation

- The manipulation of `data.frames` almost always comes back to SQL queries

- R and Python abstract the SQL logic and provide function-ized methods for executing these core tasks.

- `dplyr` is the primary "tidyverse" package for doing this in R, while `pandas` is the "go to" in python.

- `dplyr` describes is functionality as a 'grammar of data manipulation' which is useful for discussion in that the functions we call, are things we "do" to the `data.frame`

---

## Verbs

We will cover two "pure" verbs:

- `select()`: picks variables based on their names.
- `filter()`: picks rows/observations based on their values.

And three "manipulation" verbs
- `mutate()`: adds new variables that are functions of existing variables
- `summarise()`: reduces multiple values down to a single summary.
- `arrange()`: changes the ordering of the rows.

These all combine naturally with `group_by()` which allows you to perform any operation "by group".

---

## The %>% (pipe) operator
 
The pipe operator (`%>%`) will change your data workflow in R.

This syntax leads to code that is much easier to write and to read.

The RStudio keyboard shortcut: Ctrl+Shift+M (Windows), Cmd+Shift+M (Mac).

The pipe passes the object on the left hand side of the pipe into the first argument of the right hand function

To be %>% compatible, the `data.frame` is ALWAYS the fist argument to dplyr verbs

---

# Data

For demonstration we will get some streamflow data from the USGS. The details of this function will come latter in the course so focus on the data

```r
flows = readNWISdv(siteNumbers = '14187200',
                   parameterCd = "00060") %>% 
  renameNWISColumns()

glimpse(flows)
```

```
Rows: 17,785
Columns: 5
$ agency_cd <chr> "USGS", "USGS", "USGS", "USGS", "USGS", "USGS", "USGS", "USG…
$ site_no   <chr> "14187200", "14187200", "14187200", "14187200", "14187200", …
$ Date      <date> 1973-08-01, 1973-08-02, 1973-08-03, 1973-08-04, 1973-08-05,…
$ Flow      <dbl> 809, 828, 829, 930, 939, 939, 944, 932, 927, 925, 927, 928, …
$ Flow_cd   <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", …
```

---

## Use `filter()` to subset on a condition

- `filter()` takes logical (binary) expressions and returns the _rows_ in which all conditions are TRUE.

- `filter()` does NOT impact columns

- Lets find all rows in `flows` data that in have Flow code A:

```r
filter(flows, Flow_cd == "A") %>% 
  glimpse()
```

```
Rows: 17,464
Columns: 5
$ agency_cd <chr> "USGS", "USGS", "USGS", "USGS", "USGS", "USGS", "USGS", "USG…
$ site_no   <chr> "14187200", "14187200", "14187200", "14187200", "14187200", …
$ Date      <date> 1973-08-01, 1973-08-02, 1973-08-03, 1973-08-04, 1973-08-05,…
$ Flow      <dbl> 809, 828, 829, 930, 939, 939, 944, 932, 927, 925, 927, 928, …
$ Flow_cd   <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", …
```

---

## Use `filter()` to subset data on multiple conditions

- Lets find all observations in `flows` where the Date is after `2000-01-01`, _and_ the Flow Code is A:

```r
filter(flows, Flow_cd == "A", Date > as.Date("2010-01-01")) %>% 
  glimpse()
```

```
Rows: 4,179
Columns: 5
$ agency_cd <chr> "USGS", "USGS", "USGS", "USGS", "USGS", "USGS", "USGS", "USG…
$ site_no   <chr> "14187200", "14187200", "14187200", "14187200", "14187200", …
$ Date      <date> 2010-01-02, 2010-01-03, 2010-01-04, 2010-01-05, 2010-01-06,…
$ Flow      <dbl> 7870, 6920, 5860, 7860, 10000, 10100, 9760, 9130, 8600, 7040…
$ Flow_cd   <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", …
```

---

## Use `select()` to subset by variables or columns.

- Use `select()` to subset the variables or columns you want by name (think of the `$` syntax).

```r
select(flows, Date, Flow) %>%
  glimpse()
```

```
Rows: 17,785
Columns: 2
$ Date <date> 1973-08-01, 1973-08-02, 1973-08-03, 1973-08-04, 1973-08-05, 1973…
$ Flow <dbl> 809, 828, 829, 930, 939, 939, 944, 932, 927, 925, 927, 928, 945, …
```

---

## Use `select()` to subset and rename

`select()` can also be used to rename existing columns

```r
select(flows, Date, flow_cfs = Flow) %>%
  glimpse()
```

```
Rows: 17,785
Columns: 2
$ Date     <date> 1973-08-01, 1973-08-02, 1973-08-03, 1973-08-04, 1973-08-05, …
$ flow_cfs <dbl> 809, 828, 829, 930, 939, 939, 944, 932, 927, 925, 927, 928, 9…
```

---

# Use mutate() to add new variables

- `mutate()` defines and inserts new variables into a existing `data.frame`

- `mutate()` builds new variables sequentially so you can reference earlier ones when defining later ones

- In the `flows` dataset we have a Date value that will allow us extract a Year and month

```r
flows %>%
  select(Date, Flow) %>% 
  filter(Date > as.Date('2010-01-01')) %>%
  mutate(Year  = format(Date, "%Y"),
         Month = format(Date, "%m")) %>% 
  glimpse()
```

```
Rows: 4,482
Columns: 4
$ Date  <date> 2010-01-02, 2010-01-03, 2010-01-04, 2010-01-05, 2010-01-06, 201…
$ Flow  <dbl> 7870, 6920, 5860, 7860, 10000, 10100, 9760, 9130, 8600, 7040, 65…
$ Year  <chr> "2010", "2010", "2010", "2010", "2010", "2010", "2010", "2010", …
$ Month <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01"…
```

---

# Arrange

- orders the rows of a `data.frame` rows by the values of selected columns.

```r
tmp  = flows %>%
  select(Date, Flow) %>% 
  filter(Date > as.Date('2020-01-01')) %>%
  mutate(Year  = format(Date, "%Y"),
         Month = format(Date, "%m")) 
  
head(arrange(tmp, as.numeric(Year)))
```

```
        Date Flow Year Month
1 2020-01-02 3200 2020    01
2 2020-01-03 3100 2020    01
3 2020-01-04 2420 2020    01
4 2020-01-05 2260 2020    01
5 2020-01-06 2550 2020    01
6 2020-01-07 3210 2020    01
```

```r
head(arrange(tmp, -as.numeric(Year)))
```

```
        Date  Flow Year Month
1 2022-01-01  5790 2022    01
2 2022-01-02  5650 2022    01
3 2022-01-03  6550 2022    01
4 2022-01-04  6640 2022    01
5 2022-01-05  8450 2022    01
6 2022-01-06 10500 2022    01
```

---

## `summarize()`

Have you ever had questions like "_what is the mean streamflow?_", but only have daily data?

- `summarize()` takes a dataset with n observations, computes requested values, and returns a dataset with 1 observation.

```r
flows %>% 
 select(Date, Flow) %>% 
 mutate(Year  = format(Date, "%Y")) %>%
 summarize(meanQ = mean(Flow),
           maxQ = max(Flow))
```

```
     meanQ  maxQ
1 2802.225 25700
```

---

## `group_by()`

Have you ever had questions like "_what is the mean annual streamflow?_", but only have daily data?
- `group_by()` adds extra structure to your dataset by grouping information (think of atomic diminsion)

- `mutate()` and `summarize()` honor groupings.

Combined with the verbs like `select`, `filter`, and `arrange` these new tools allow you to solve an extremely diverse set of problems with relative ease.

```r
tt = flows %>% 
 select(Date, Flow) %>% 
 mutate(Year  = format(Date, "%Y")) %>%
 group_by(Year)

paint::paint(tt)
```

```
tibble [17785, 3] 
grouped by: Year [50] 
Date date 1973-08-01 1973-08-02 1973-08-03 1973-08-04 1973-~
Flow dbl  809 828 829 930 939 939
Year chr  1973 1973 1973 1973 1973 1973 
```

---

```r
tt2 = tt %>%
 summarize(meanQ = mean(Flow),
           maxQ = max(Flow))

glimpse(tt2)
```

```
Rows: 50
Columns: 3
$ Year  <chr> "1973", "1974", "1975", "1976", "1977", "1978", "1979", "1980", …
$ meanQ <dbl> 4668.549, 3658.748, 3610.696, 2339.740, 2859.581, 2206.038, 2378…
$ maxQ  <dbl> 13200, 14400, 14100, 15000, 16200, 11300, 12000, 14700, 17000, 1…
```

---

## Dont forget!

- `group_by` adds _structure_ to your `data.frame` that impacts how data is manipulated, when done with your groupings dont forget to `ungroup()`

```r
flows %>% 
 select(Date, Flow) %>% 
 mutate(Year  = format(Date, "%Y")) %>%
 group_by(Year) %>%
 summarize(meanQ = mean(Flow),
           maxQ = max(Flow)) %>%
 ungroup()
```

```
# A tibble: 50 × 3
   Year  meanQ  maxQ
   <chr> <dbl> <dbl>
 1 1973  4669. 13200
 2 1974  3659. 14400
 3 1975  3611. 14100
 4 1976  2340. 15000
 5 1977  2860. 16200
 6 1978  2206. 11300
 7 1979  2378. 12000
 8 1980  2548. 14700
 9 1981  2976. 17000
10 1982  3424. 15100
# … with 40 more rows
```

---
class: inverse, center, middle
# When One Table is not enough
---

# Relational Data

- There will come a time when you need data from different sources.

- When this happens we must **join** -- or merge -- multiple `data.frames`

- To merge data, we have to find a point of _commonality_

- The attribute of _commonality_ is the _relation_ (e.g. realational data!)

---

# Relational Data

To work with relational data we need "verbs" that work with _pairs_ (**2**) of tables.

- **Mutating joins**: add *new* variables to one table from matching observations in another.

- **Filtering joins**:  filter observations from one table if they match an observation in the other table.

The most common place to find relational data is in a relational database management system (or RDBMS)

---

# Keys

- The _variables_ used to connect a pair tables are called **keys**.

- A key is a variable (or set of variables) that uniquely identifies an observation or "unit".

---

# Keys

There are two types of keys:

- **Primary keys**: *uniquely* identify observations in its own table.

- **Foreign keys**: *uniquely* identify observations in another table.

- A _primary_ key and a corresponding _foreign_ key form a _relation._

- Relations are typically one-to-many but can be one-to-one

---

# Todays Data:

.pull-left[

```r
band_members
```

```
# A tibble: 3 × 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles
```
]

.pull-right[

```r
band_instruments
```

```
# A tibble: 3 × 2
  name  plays 
  <chr> <chr> 
1 John  guitar
2 Paul  bass  
3 Keith guitar
```
]

---
class: center,middle
# The basics
<img src="img/sql-table-joins.png"> 
---

---
class: inverse, center, middle
# Mutating Joins
add *new* variables to one table from matching observations in another.
---

---
# Inner Join

If there are multiple matches between x and y, all combination of the matches are returned.

<table class="table" style="font-size: 18px; width: auto !important; float: left; margin-right: 10px;">
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
   <th style="text-align:left;"> band </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Mick </td>
   <td style="text-align:left;"> Stones </td>
  </tr>
  <tr>
   <td style="text-align:left;"> John </td>
   <td style="text-align:left;"> Beatles </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Paul </td>
   <td style="text-align:left;"> Beatles </td>
  </tr>
</tbody>
</table>

<table class="table" style="font-size: 18px; width: auto !important; margin-right: 0; margin-left: auto">
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
   <th style="text-align:left;"> plays </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> John </td>
   <td style="text-align:left;"> guitar </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Paul </td>
   <td style="text-align:left;"> bass </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Keith </td>
   <td style="text-align:left;"> guitar </td>
  </tr>
</tbody>
</table>

****

### inner_join(band_members, band_instruments, by = "name")

****

<table class="table" style="font-size: 24px; width: auto !important; float: left; margin-right: 10px;">
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
   <th style="text-align:left;"> band </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Mick </td>
   <td style="text-align:left;"> Stones </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;color: navy !important;background-color: yellow !important;"> John </td>
   <td style="text-align:left;font-weight: bold;color: navy !important;background-color: yellow !important;"> Beatles </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;color: navy !important;background-color: yellow !important;"> Paul </td>
   <td style="text-align:left;font-weight: bold;color: navy !important;background-color: yellow !important;"> Beatles </td>
  </tr>
</tbody>
</table>

<table class="table" style="font-size: 24px; width: auto !important; margin-right: 0; margin-left: auto">
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
   <th style="text-align:left;"> plays </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;font-weight: bold;color: navy !important;background-color: yellow !important;"> John </td>
   <td style="text-align:left;font-weight: bold;color: navy !important;background-color: yellow !important;"> guitar </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;color: navy !important;background-color: yellow !important;"> Paul </td>
   <td style="text-align:left;font-weight: bold;color: navy !important;background-color: yellow !important;"> bass </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Keith </td>
   <td style="text-align:left;"> guitar </td>
  </tr>
</tbody>
</table>

****

---

# The other mutating joins

.pull-left[

]

.pull-right[
**left_join(x, y)**: Return all rows from _x_, and all columns from _x_ and _y_.

<table class="table" style="font-size: 18px; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
   <th style="text-align:left;"> band </th>
   <th style="text-align:left;"> plays </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Mick </td>
   <td style="text-align:left;"> <span style="     color: black !important;">Stones</span> </td>
   <td style="text-align:left;"> <span style="     color: red !important;">NA</span> </td>
  </tr>
  <tr>
   <td style="text-align:left;"> John </td>
   <td style="text-align:left;"> <span style="     color: black !important;">Beatles</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;">guitar</span> </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Paul </td>
   <td style="text-align:left;"> <span style="     color: black !important;">Beatles</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;">bass</span> </td>
  </tr>
</tbody>
</table>

**right_join(x, y)**: Return all rows from _x_ where there are matching values in _y_, and all columns from _x_ and _y_.

<table class="table" style="font-size: 18px; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
   <th style="text-align:left;"> band </th>
   <th style="text-align:left;"> plays </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> John </td>
   <td style="text-align:left;"> <span style="     color: black !important;">Beatles</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;">guitar</span> </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Paul </td>
   <td style="text-align:left;"> <span style="     color: black !important;">Beatles</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;">bass</span> </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Keith </td>
   <td style="text-align:left;"> <span style="     color: red !important;">NA</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;">guitar</span> </td>
  </tr>
</tbody>
</table>

**full_join(x, y)**: Return all rows and columns from both _x_ and _y_.

]
---
class: middle, center, inverse 
# Filtering Joins

"Filtering" joins keep cases from the LHS
---

# Semi Join

**semi_join(x, y)**:  Return all rows from _x_ where there are matching values in _ y_, keeping just columns from _x_.

****

### semi_join(band_members, band_instruments, by = "name")

****

<table class="table" style="font-size: 24px; width: auto !important; float: left; margin-right: 10px;">
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
   <th style="text-align:left;"> band </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Mick </td>
   <td style="text-align:left;"> Stones </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;color: darkred !important;background-color: white !important;"> John </td>
   <td style="text-align:left;font-weight: bold;color: darkred !important;background-color: white !important;"> Beatles </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;color: darkred !important;background-color: white !important;"> Paul </td>
   <td style="text-align:left;font-weight: bold;color: darkred !important;background-color: white !important;"> Beatles </td>
  </tr>
</tbody>
</table>

<table class="table" style="font-size: 24px; width: auto !important; margin-right: 0; margin-left: auto">
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
   <th style="text-align:left;"> plays </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;font-weight: bold;color: navy !important;background-color: gray50 !important;"> John </td>
   <td style="text-align:left;font-weight: bold;color: navy !important;background-color: gray50 !important;"> guitar </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;color: navy !important;background-color: gray50 !important;"> Paul </td>
   <td style="text-align:left;font-weight: bold;color: navy !important;background-color: gray50 !important;"> bass </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Keith </td>
   <td style="text-align:left;"> guitar </td>
  </tr>
</tbody>
</table>

****

<table class="table" style="font-size: 24px; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
   <th style="text-align:left;"> band </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> John </td>
   <td style="text-align:left;"> Beatles </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Paul </td>
   <td style="text-align:left;"> Beatles </td>
  </tr>
</tbody>
</table>
---

# Anti Join

**anti_join(x, y)**: Return all rows from _x_ where there are **not** matching values in _y_, keeping just columns from _x_.

****

### anti_join(band_members, band_instruments, by = "name")

****

<table class="table" style="font-size: 24px; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
   <th style="text-align:left;"> band </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Mick </td>
   <td style="text-align:left;"> Stones </td>
  </tr>
</tbody>
</table>

---

# When keys dont share a name

.pull-left[

```r
(band_members2 = band_members %>% 
  select(first_name = name, band))
```

```
# A tibble: 3 × 2
  first_name band   
  <chr>      <chr>  
1 Mick       Stones 
2 John       Beatles
3 Paul       Beatles
```
]

.pull-right[

```r
inner_join(band_members2, 
           band_instruments, 
           by = c('first_name' = 'name'))
```

```
# A tibble: 2 × 3
  first_name band    plays 
  <chr>      <chr>   <chr> 
1 John       Beatles guitar
2 Paul       Beatles bass  
```
]

---

# Summary

- To a computer all data is bytes
- Software like R/Python know how to interpret bytes as values
- Multiple values can be stored with structure
- Like type values can be stored in atomic vectors and dislike types can be stored as lists

- Both R and python make extensive use of data.frames which are lists with enforced equal length elements (columns)

- SQL-ish and other data manipulations can be executed on data.frames for custom analysis, data creation, and question answering