Concepts: Data Type, Strucuture and Manipulation

---

## Road Map of Intro

~ 20 mintues
 - Data Types
 - Data Structures
 - Data Manipulation

~ 20 minutes
Vector Data (sf, CRS)
  - Primatives
  - PCS vs GCS 
  - measures
  - Predicates Operations

---

# Learning Objectives

- Understand how you, software, and computers understand values

- Understand how valus can be structured
  - vectors, dimensions/attributes, data.frames
  
- Understand how data.frames can be operated on
  - filter
  - select
  - mutate
  - summarize
  - group_by

- Understand how two tables can be joined

---

# Data

Computers convert bytes --> hex --> value

- Humans read values
- Software reads Hex bytes
- Hardware reads Binary bytes

---

## Example

---

## What's the difference between 3 and '3'

**To us**: _meaning_

**To software**: _hows its handled_

**To a computer**: _nothing_

Thus, there is a need for a computer "guessable" / human definable data `types`.

---
class: center, middle, inverse
# Data Types (main 6)
---

### 1. Double/float (e.g. `2`, `2.15`)

- Values with decimals
- Generally the default computational data type
- There are three special values unique to doubles: `Inf`, `-Inf`, and `NaN` (not a number).

---

### 2. integer (e.g. `2L`)

- Values _without_ decimals 
- To create an integer you must follow the a number with an uppercase L.
- Take less memory then doubles but this is rarely an issue on modern computers and "normal" size data

```r
object.size(1:1000)
```

```
4048 bytes
```

```r
object.size(as.numeric(1:1000))
```

```
8048 bytes
```

---

### 3. character (e.g. `"x"`, `"Welcome!"`)

- Stores text ranging in size from a single letter to a novel.
- surrounded by `"` ("here") or `'` ('there'). 
- Special characters are escaped with `\`; see `?Quotes` for full details.

---

### 4. logical (e.g. `TRUE`, `FALSE`)

- Logical values store boolean values (`TRUE` and `FALSE`). 
- Useful for checking conditions and controlling the flow of a program.
- Or, for checking binary conditions (like on,off;  open/closed; >100)
- Logicals can be written in full (TRUE or FALSE), or abbreviated (T or F).
- Logicals are effectively 0 (F) and (1)

```r
as.numeric(TRUE)
```

```
[1] 1
```

```r
as.integer(FALSE)
```

```
[1] 0
```

---

### 5. raw (e.g. holds `bytes`)
### 6. complex (e.g. `1+4i`)

---

```r
# Numeric
*typeof(1.9)
```
]
 
.panel2-type-auto[

```
[1] "double"
```
]

---
count: false

```r
# Numeric
typeof(1.9)

# Integer
*typeof(1L)
```
]
 
.panel2-type-auto[

```
[1] "double"
```

```
[1] "integer"
```
]

---
count: false

```r
# Numeric
typeof(1.9)

# Integer
typeof(1L)

# Boolean
*typeof(TRUE)
```
]
 
.panel2-type-auto[

```
[1] "double"
```

```
[1] "integer"
```

```
[1] "logical"
```
]

---
count: false

```r
# Numeric
typeof(1.9)

# Integer
typeof(1L)

# Boolean
typeof(TRUE)

# Character
*typeof("Welcome")
```
]
 
.panel2-type-auto[

```
[1] "double"
```

```
[1] "integer"
```

```
[1] "logical"
```

```
[1] "character"
```
]

---
class: center, middle, inverse
# Data Structure
### Storing more then one value requires **structure**!
---

## Vectors

- **Vectors** come in two types: _atomic_ and _list_

-  _atomic_ vectors elements must have the same **type**

- _lists_ elements can have different **types**

- `NULL` serves as a generic zero length vector.

- This diagram - taken from [here](https://adv-r.hadley.nz/vectors-chap.html) illustrates the basic relationships:

---

### Atomic Vectors

A vector containing one **type** of data is called an "atom"
  - Can created using the `c()` (combine) function in R and with brackets `[` in Python numpy
  - The length can be checked with `length()` (R) or `len()` (python)

---

```r
# Numeric
*dbl_vec = c(1.9, 2, 3.5)
```
]
 
.panel2-atom-auto[

]

---
count: false

```r
# Numeric
dbl_vec = c(1.9, 2, 3.5)
*typeof(dbl_vec)
```
]
 
.panel2-atom-auto[

```
[1] "double"
```
]

---
count: false

```r
# Numeric
dbl_vec = c(1.9, 2, 3.5)
typeof(dbl_vec)
*length(dbl_vec)
```
]
 
.panel2-atom-auto[

```
[1] "double"
```

```
[1] 3
```
]

---
count: false

```r
# Numeric
dbl_vec = c(1.9, 2, 3.5)
typeof(dbl_vec)
length(dbl_vec)

*lg_vec = c(TRUE, FALSE, F, T)
```
]
 
.panel2-atom-auto[

```
[1] "double"
```

```
[1] 3
```
]

---
count: false

```r
# Numeric
dbl_vec = c(1.9, 2, 3.5)
typeof(dbl_vec)
length(dbl_vec)

lg_vec = c(TRUE, FALSE, F, T)
*typeof(lg_vec)
```
]
 
.panel2-atom-auto[

```
[1] "double"
```

```
[1] 3
```

```
[1] "logical"
```
]

---
count: false

```r
# Numeric
dbl_vec = c(1.9, 2, 3.5)
typeof(dbl_vec)
length(dbl_vec)

lg_vec = c(TRUE, FALSE, F, T)
typeof(lg_vec)
*length(lg_vec)
```
]
 
.panel2-atom-auto[

```
[1] "double"
```

```
[1] 3
```

```
[1] "logical"
```

```
[1] 4
```
]

---
count: false

```r
# Numeric
dbl_vec = c(1.9, 2, 3.5)
typeof(dbl_vec)
length(dbl_vec)

lg_vec = c(TRUE, FALSE, F, T)
typeof(lg_vec)
length(lg_vec)
```
]
 
.panel2-atom-auto[

```
[1] "double"
```

```
[1] 3
```

```
[1] "logical"
```

```
[1] 4
```
]

---

# Missing Values!

- Missing values still need a place holder
- Missing values are denoted with `NA` (short for not applicable). 
- Missing values are 'infectious': most computations involving a missing value will return another missing value.

```r
x = c(2,4)
y = c(2, NA)

x + y
```

```
[1]  4 NA
```

---

## Cohersion

- As seen above, type is a property of the vector

- When you try and combine different types they will be coerced in a fixed order:

- character → double → integer → logical

- Coercion often happens automatically.

- Failed coercion generates a warning and a missing value

---
count: false

```r
*c("a", 1)
```
]
 
.panel2-atoms-auto[

```
[1] "a" "1"
```
]

---
count: false

```r
c("a", 1)

*c("a", TRUE)
```
]
 
.panel2-atoms-auto[

```
[1] "a" "1"
```

```
[1] "a"    "TRUE"
```
]

---
count: false

```r
c("a", 1)

c("a", TRUE)

*c(4.5, 1L)
```
]
 
.panel2-atoms-auto[

```
[1] "a" "1"
```

```
[1] "a"    "TRUE"
```

```
[1] 4.5 1.0
```
]

---
count: false

```r
c("a", 1)

c("a", TRUE)

c(4.5, 1L)

*c("1", 18, "GIS")
```
]
 
.panel2-atoms-auto[

```
[1] "a" "1"
```

```
[1] "a"    "TRUE"
```

```
[1] 4.5 1.0
```

```
[1] "1"   "18"  "GIS"
```
]

---
count: false

```r
c("a", 1)

c("a", TRUE)

c(4.5, 1L)

c("1", 18, "GIS")

*as.numeric(c("1", 18, "GIS"))
```
]
 
.panel2-atoms-auto[

```
[1] "a" "1"
```

```
[1] "a"    "TRUE"
```

```
[1] 4.5 1.0
```

```
[1] "1"   "18"  "GIS"
```

```
[1]  1 18 NA
```
]

---
count: false

```r
c("a", 1)

c("a", TRUE)

c(4.5, 1L)

c("1", 18, "GIS")

as.numeric(c("1", 18, "GIS"))

*as.logical(c("1", 18, "GIS"))
```
]
 
.panel2-atoms-auto[

```
[1] "a" "1"
```

```
[1] "a"    "TRUE"
```

```
[1] 4.5 1.0
```

```
[1] "1"   "18"  "GIS"
```

```
[1]  1 18 NA
```

```
[1] NA NA NA
```
]

---

# Subsetting Atomics

Atomics can be subset using the position(s) (explicit or logical) of each value

```r
# Atomic numeric vector
(x = c(3.4, 7, 18, 9.6))
```

```
[1]  3.4  7.0 18.0  9.6
```

```r
# Third Value
x[3]
```

```
[1] 18
```

```r
# Third and Fourth value
x[c(3,4)]
```

```
[1] 18.0  9.6
```

```r
# Drop the third value
x[-3]
```

```
[1] 3.4 7.0 9.6
```

```r
# Keep the 1 and 2 value, but drop 3 and 4
x[c(T,T,F,F)]
```

```
[1] 3.4 7.0
```

---

## Diminsions

- Atomics do **not** include a number of important structures like matrices (2D) or arrays (3D), factors, or date-times.

- These structures are built on top of atomic vectors by adding **attributes**.

- For example, adding a `dim` attribute to a vector allows it to behave like a 2D matrix or a XD array.

---

### Matrix

- A matrix is 2D atom (row, column)
 - Same data types
 - Same column length

- For spatial data, this corresponds to the structure of gridded/raster data.

```r
# Use matrix
(mat = matrix(1:9, nrow = 3))
```

```
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
```

---

# Subsetting Matrices

Now that a structure (row, column) has been assigned,
Subsetting requires a row, column (i,j) syntax

```r
(x = matrix(1:9, nrow = 3))
```

```
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
```

```r
x[3,]
```

```
[1] 3 6 9
```

```r
x[,3]
```

```
[1] 7 8 9
```

```r
x[3,3]
```

```
[1] 9
```

---

### Arrays

- An array is a 3D atom [row, column, slice]

- For spatial data, this is corresponds to the structure of gridded/raster data with a time diminsion

```r
*(array(c(1:12), dim = c(3,2,2)))
```
]
 
.panel2-arrays-auto[

```
, , 1

[,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

, , 2

[,1] [,2]
[1,]    7   10
[2,]    8   11
[3,]    9   12
```
]

---

# Subsetting Arrays

- Now that a structure (row, column, slice) has been assigned,

- Subsetting requires a row, column (i,j,z) syntax

```r
(x = array(1:12, dim = c(2,2,3))) 
```

```
, , 1

[,1] [,2]
[1,]    1    3
[2,]    2    4

, , 2

[,1] [,2]
[1,]    5    7
[2,]    6    8

, , 3

[,1] [,2]
[1,]    9   11
[2,]   10   12
```
]

```r
x[1,,]
```

```
     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    3    7   11
```

```r
x[,1,]
```

```
     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
```

```r
x[,,1]
```

```
     [,1] [,2]
[1,]    1    3
[2,]    2    4
```

---

## Lists

Lists allow each list element to be any type/strucutre.

```r
(my_list <- list(
  matrix(1:4, nrow = 2), 
  "GIS is great!", 
  c(TRUE, FALSE, TRUE), 
  c(2.3, 5.9)
))
```

```
[[1]]
     [,1] [,2]
[1,]    1    3
[2,]    2    4

[[2]]
[1] "GIS is great!"

[[3]]
[1]  TRUE FALSE  TRUE

[[4]]
[1] 2.3 5.9
```

```r
typeof(my_list)
```

```
[1] "list"
```

---

# Subsetting Lists

- With fleaxabilty of list objects comes new subsetting operartors.

- The `[[`  allows an element of list to be accessed:

```r
my_list[[1]]
```

```
     [,1] [,2]
[1,]    1    3
[2,]    2    4
```

```r
my_list[[1]][1,2]
```

```
[1] 3
```

---

### Data Frames

A `data.frame` is a `structure` built on top of **lists**

- However, the length of each vector (list or atomic) **must** be the same.

- This gives data frames their rectangular structure and explains why they share the properties of both matrices and lists

---

```r
*(df1 <- data.frame(name = c("George", "Stan", "Carly"),
*                 age  = c(75,15,31),
*                 retired = c(T,F,F)))
```
]
 
.panel2-smalldf-auto[

```
    name age retired
1 George  75    TRUE
2   Stan  15   FALSE
3  Carly  31   FALSE
```
]

---
count: false
 
# A small data.frame
.panel1-smalldf-auto[

```r
(df1 <- data.frame(name = c("George", "Stan", "Carly"),
                  age  = c(75,15,31),
                  retired = c(T,F,F)))

*typeof(df1)
```
]
 
.panel2-smalldf-auto[

```
    name age retired
1 George  75    TRUE
2   Stan  15   FALSE
3  Carly  31   FALSE
```

```
[1] "list"
```
]

---

## Subsetting data.frames

data.frames can be subset like a matrix:

```r
df1[1,2]
```

```
[1] 75
```

data.frames can be subset like a list:

```r
df1[[2]]
```

```
[1] 75 15 31
```

And, with a new operation* `$` that allows access by column name:

```r
df1$name
```

```
[1] "George" "Stan"   "Carly" 
```

---
class: center, middle, inverse
# Data Manipulation
---

The manipulation of data.frames almost always comes back to SQL queries

R and Python abstract the SQL logic and provide function-ized methods for executing these core tasks.

`dplyr` is the principle package for doing this in R, while pandas is the "go to" in python.

dplyr describes is functionality as a 'grammar of data manipulation` which is useful for discussion in that the functions we call, are things we "do" to the data.frame.

---

## Verbs

We will cover two "pure" verbs:

- `select()`: picks variables based on their names.
- `filter()`: picks rows/observations based on their values.

And three "manipulation" verbs
- `mutate()`: adds new variables that are functions of existing variables
- `summarise()`: reduces multiple values down to a single summary.
- `arrange()`: changes the ordering of the rows.

These all combine naturally with `group_by()` which allows you to perform any operation “by group”.

---

## The %>% (pipe) operator
 
The pipe operator (`%>%`) will change your data workflow in R.

This syntax leads to code that is much easier to write and to read.

The RStudio keyboard shortcut: Ctrl+Shift+M (Windows), Cmd+Shift+M (Mac).

The pipe passes the object on the left hand side of the pipe into the first argument of the right hand function

To be %>% compatible, the `data.frame` is ALWAYS the fist argument to dplyr verbs

---

# Data

For demonstration we will get some streamflow data from the USGS. The details of this function will come latter in the course so focus on the data

```r
flows = readNWISdv(siteNumbers = '14187200',
                   parameterCd = "00060") %>% 
  renameNWISColumns()

glimpse(flows)
```

```
Rows: 17,785
Columns: 5
$ agency_cd <chr> "USGS", "USGS", "USGS", "USGS", "USGS", "USGS", "USGS", "USG…
$ site_no   <chr> "14187200", "14187200", "14187200", "14187200", "14187200", …
$ Date      <date> 1973-08-01, 1973-08-02, 1973-08-03, 1973-08-04, 1973-08-05,…
$ Flow      <dbl> 809, 828, 829, 930, 939, 939, 944, 932, 927, 925, 927, 928, …
$ Flow_cd   <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", …
```

---

## Use `filter()` to subset on a condition

- `filter()` takes logical (binary) expressions and returns the _rows_ in which all conditions are TRUE.

- `filter()` does NOT impact columns

- Lets find all rows in `flows` data that in have Flow code A:

```r
filter(flows, Flow_cd == "A") %>% 
  glimpse()
```

```
Rows: 17,464
Columns: 5
$ agency_cd <chr> "USGS", "USGS", "USGS", "USGS", "USGS", "USGS", "USGS", "USG…
$ site_no   <chr> "14187200", "14187200", "14187200", "14187200", "14187200", …
$ Date      <date> 1973-08-01, 1973-08-02, 1973-08-03, 1973-08-04, 1973-08-05,…
$ Flow      <dbl> 809, 828, 829, 930, 939, 939, 944, 932, 927, 925, 927, 928, …
$ Flow_cd   <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", …
```

---

## Use `filter()` to subset data on multiple conditions

- Lets find all observations in `flows` where the Date is after `2000-01-01`, _and_ the Flow Code is A:

```r
filter(flows, Flow_cd == "A", Date > as.Date("2010-01-01")) %>% 
  glimpse()
```

```
Rows: 4,179
Columns: 5
$ agency_cd <chr> "USGS", "USGS", "USGS", "USGS", "USGS", "USGS", "USGS", "USG…
$ site_no   <chr> "14187200", "14187200", "14187200", "14187200", "14187200", …
$ Date      <date> 2010-01-02, 2010-01-03, 2010-01-04, 2010-01-05, 2010-01-06,…
$ Flow      <dbl> 7870, 6920, 5860, 7860, 10000, 10100, 9760, 9130, 8600, 7040…
$ Flow_cd   <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", …
```

---

## Use `select()` to subset by variables or columns.

- Use `select()` to subset the variables or columns you want by name (think of the `$` syntax).

```r
select(flows, Date, Flow) %>%
  glimpse()
```

```
Rows: 17,785
Columns: 2
$ Date <date> 1973-08-01, 1973-08-02, 1973-08-03, 1973-08-04, 1973-08-05, 1973…
$ Flow <dbl> 809, 828, 829, 930, 939, 939, 944, 932, 927, 925, 927, 928, 945, …
```

---

## Use `select()` to subset and rename

`select()` can also be used to rename existing columns

```r
select(flows, Date, flow_cfs = Flow) %>%
  glimpse()
```

```
Rows: 17,785
Columns: 2
$ Date     <date> 1973-08-01, 1973-08-02, 1973-08-03, 1973-08-04, 1973-08-05, …
$ flow_cfs <dbl> 809, 828, 829, 930, 939, 939, 944, 932, 927, 925, 927, 928, 9…
```

---

# Use mutate() to add new variables

- `mutate()` defines and inserts new variables into a existing `data.frame`

- `mutate()` builds new variables sequentially so you can reference earlier ones when defining later ones

- In the `flows` dataset we have a Date value that will allow us extract a Year and month

```r
flows %>%
  select(Date, Flow) %>% 
  filter(Date > as.Date('2010-01-01')) %>%
  mutate(Year  = format(Date, "%Y"),
         Month = format(Date, "%m")) %>% 
  glimpse()
```

```
Rows: 4,482
Columns: 4
$ Date  <date> 2010-01-02, 2010-01-03, 2010-01-04, 2010-01-05, 2010-01-06, 201…
$ Flow  <dbl> 7870, 6920, 5860, 7860, 10000, 10100, 9760, 9130, 8600, 7040, 65…
$ Year  <chr> "2010", "2010", "2010", "2010", "2010", "2010", "2010", "2010", …
$ Month <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01"…
```

---

# Arrange

- orders the rows of a `data.frame` rows by the values of selected columns.

```r
tmp  = flows %>%
  select(Date, Flow) %>% 
  filter(Date > as.Date('2020-01-01')) %>%
  mutate(Year  = format(Date, "%Y"),
         Month = format(Date, "%m")) 
  
head(arrange(tmp, as.numeric(Year)))
```

```
        Date Flow Year Month
1 2020-01-02 3200 2020    01
2 2020-01-03 3100 2020    01
3 2020-01-04 2420 2020    01
4 2020-01-05 2260 2020    01
5 2020-01-06 2550 2020    01
6 2020-01-07 3210 2020    01
```

```r
head(arrange(tmp, -as.numeric(Year)))
```

```
        Date  Flow Year Month
1 2022-01-01  5790 2022    01
2 2022-01-02  5650 2022    01
3 2022-01-03  6550 2022    01
4 2022-01-04  6640 2022    01
5 2022-01-05  8450 2022    01
6 2022-01-06 10500 2022    01
```

---

## `summarize()`

Have you ever had questions like "_what is the mean streamflow?_", but only have daily data?

- `summarize()` takes a dataset with n observations, computes requested values, and returns a dataset with 1 observation.

```r
flows %>% 
 select(Date, Flow) %>% 
 mutate(Year  = format(Date, "%Y")) %>%
 summarize(meanQ = mean(Flow),
           maxQ = max(Flow))
```

```
     meanQ  maxQ
1 2802.225 25700
```

---

## `group_by()`

Have you ever had questions like "_what is the mean annual streamflow?_", but only have daily data?
- `group_by()` adds extra structure to your dataset by grouping information (think of atomic diminsion)

- `mutate()` and `summarize()` honor groupings.

Combined with the verbs like `select`, `filter`, and `arrange` these new tools allow you to solve an extremely diverse set of problems with relative ease.

```r
tt = flows %>% 
 select(Date, Flow) %>% 
 mutate(Year  = format(Date, "%Y")) %>%
 group_by(Year)

paint::paint(tt)
```

```
tibble [17785, 3] 
grouped by: Year [50] 
Date date 1973-08-01 1973-08-02 1973-08-03 1973-08-04 1973-~
Flow dbl  809 828 829 930 939 939
Year chr  1973 1973 1973 1973 1973 1973 
```

---

```r
tt2 = tt %>%
 summarize(meanQ = mean(Flow),
           maxQ = max(Flow))

glimpse(tt2)
```

```
Rows: 50
Columns: 3
$ Year  <chr> "1973", "1974", "1975", "1976", "1977", "1978", "1979", "1980", …
$ meanQ <dbl> 4668.549, 3658.748, 3610.696, 2339.740, 2859.581, 2206.038, 2378…
$ maxQ  <dbl> 13200, 14400, 14100, 15000, 16200, 11300, 12000, 14700, 17000, 1…
```

---

## Dont forget!

- `group_by` adds _structure_ to your `data.frame` that impacts how data is manipulated, when done with your groupings dont forget to `ungroup()`

```r
flows %>% 
 select(Date, Flow) %>% 
 mutate(Year  = format(Date, "%Y")) %>%
 group_by(Year) %>%
 summarize(meanQ = mean(Flow),
           maxQ = max(Flow)) %>%
 ungroup()
```

```
# A tibble: 50 × 3
   Year  meanQ  maxQ
   <chr> <dbl> <dbl>
 1 1973  4669. 13200
 2 1974  3659. 14400
 3 1975  3611. 14100
 4 1976  2340. 15000
 5 1977  2860. 16200
 6 1978  2206. 11300
 7 1979  2378. 12000
 8 1980  2548. 14700
 9 1981  2976. 17000
10 1982  3424. 15100
# … with 40 more rows
```

---
class: inverse, center, middle
# When One Table is not enough
---

# Relational Data

- There will come a time when you need data from different sources.

- When this happens we must **join** -- or merge -- multiple `data.frames`

- To merge data, we have to find a point of _commonality_

- The attribute of _commonality_ is the _relation_ (e.g. realational data!)

---

# Relational Data

To work with relational data we need "verbs" that work with _pairs_ (**2**) of tables.

- **Mutating joins**: add *new* variables to one table from matching observations in another.

- **Filtering joins**:  filter observations from one table if they match an observation in the other table.

The most common place to find relational data is in a relational database management system (or RDBMS)

---

# Keys

- The _variables_ used to connect a pair tables are called **keys**.

- A key is a variable (or set of variables) that uniquely identifies an observation or "unit".

---

# Keys

There are two types of keys:

- **Primary keys**: *uniquely* identify observations in its own table.

- **Foreign keys**: *uniquely* identify observations in another table.

- A _primary_ key and a corresponding _foreign_ key form a _relation._

- Relations are typically one-to-many but can be one-to-one

---

# Todays Data:

```r
band_members
```

```
# A tibble: 3 × 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles
```
]

```r
band_instruments
```

```
# A tibble: 3 × 2
  name  plays 
  <chr> <chr> 
1 John  guitar
2 Paul  bass  
3 Keith guitar
```
]

---
class: center,middle
# The basics
<img src="img/sql-table-joins.png"> 
---

---
class: inverse, center, middle
# Mutating Joins
add *new* variables to one table from matching observations in another.
---

**inner_join(x, y)**: Return all rows from _x_ where there are matching values in _y_, and all columns from x and y.

---
# Inner Join

If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join.

<table class="table" style="font-size: 18px; width: auto !important; float: left; margin-right: 10px;">
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
   <th style="text-align:left;"> band </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Mick </td>
   <td style="text-align:left;"> Stones </td>
  </tr>
  <tr>
   <td style="text-align:left;"> John </td>
   <td style="text-align:left;"> Beatles </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Paul </td>
   <td style="text-align:left;"> Beatles </td>
  </tr>
</tbody>
</table>

<table class="table" style="font-size: 18px; width: auto !important; margin-right: 0; margin-left: auto">
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
   <th style="text-align:left;"> plays </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> John </td>
   <td style="text-align:left;"> guitar </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Paul </td>
   <td style="text-align:left;"> bass </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Keith </td>
   <td style="text-align:left;"> guitar </td>
  </tr>
</tbody>
</table>

****

### inner_join(band_members, band_instruments, by = "name")

****

<table class="table" style="font-size: 24px; width: auto !important; float: left; margin-right: 10px;">
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
   <th style="text-align:left;"> band </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Mick </td>
   <td style="text-align:left;"> Stones </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;color: navy !important;background-color: yellow !important;"> John </td>
   <td style="text-align:left;font-weight: bold;color: navy !important;background-color: yellow !important;"> Beatles </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;color: navy !important;background-color: yellow !important;"> Paul </td>
   <td style="text-align:left;font-weight: bold;color: navy !important;background-color: yellow !important;"> Beatles </td>
  </tr>
</tbody>
</table>

<table class="table" style="font-size: 24px; width: auto !important; margin-right: 0; margin-left: auto">
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
   <th style="text-align:left;"> plays </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;font-weight: bold;color: navy !important;background-color: yellow !important;"> John </td>
   <td style="text-align:left;font-weight: bold;color: navy !important;background-color: yellow !important;"> guitar </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;color: navy !important;background-color: yellow !important;"> Paul </td>
   <td style="text-align:left;font-weight: bold;color: navy !important;background-color: yellow !important;"> bass </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Keith </td>
   <td style="text-align:left;"> guitar </td>
  </tr>
</tbody>
</table>

****

<table class="table" style="font-size: 18px; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
   <th style="text-align:left;"> band </th>
   <th style="text-align:left;"> plays </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> John </td>
   <td style="text-align:left;"> <span style="     color: black !important;">Beatles</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;">guitar</span> </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Paul </td>
   <td style="text-align:left;"> <span style="     color: black !important;">Beatles</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;">bass</span> </td>
  </tr>
</tbody>
</table>

---

# The other mutating joins

**left_join(x, y)**: Return all rows from _x_, and all columns from _x_ and _y_.

**right_join(x, y)**: Return all rows from _x_ where there are matching values in _y_, and all columns from _x_ and _y_.

**full_join(x, y)**: Return all rows and columns from both _x_ and _y_.

---
class: middle, center, inverse 
# Filtering Joins

"Filtering" joins keep cases from the LHS
---

# Semi Join

**semi_join(x, y)**:  Return all rows from _x_ where there are matching values in _ y_, keeping just columns from _x_.

****

### semi_join(band_members, band_instruments, by = "name")

****

<table class="table" style="font-size: 24px; width: auto !important; float: left; margin-right: 10px;">
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
   <th style="text-align:left;"> band </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Mick </td>
   <td style="text-align:left;"> Stones </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;color: darkred !important;background-color: white !important;"> John </td>
   <td style="text-align:left;font-weight: bold;color: darkred !important;background-color: white !important;"> Beatles </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;color: darkred !important;background-color: white !important;"> Paul </td>
   <td style="text-align:left;font-weight: bold;color: darkred !important;background-color: white !important;"> Beatles </td>
  </tr>
</tbody>
</table>

<table class="table" style="font-size: 24px; width: auto !important; margin-right: 0; margin-left: auto">
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
   <th style="text-align:left;"> plays </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;font-weight: bold;color: navy !important;background-color: gray50 !important;"> John </td>
   <td style="text-align:left;font-weight: bold;color: navy !important;background-color: gray50 !important;"> guitar </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;color: navy !important;background-color: gray50 !important;"> Paul </td>
   <td style="text-align:left;font-weight: bold;color: navy !important;background-color: gray50 !important;"> bass </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Keith </td>
   <td style="text-align:left;"> guitar </td>
  </tr>
</tbody>
</table>

****

<table class="table" style="font-size: 24px; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
   <th style="text-align:left;"> band </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> John </td>
   <td style="text-align:left;"> Beatles </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Paul </td>
   <td style="text-align:left;"> Beatles </td>
  </tr>
</tbody>
</table>
---

# Anti Join

**anti_join(x, y)**: Return all rows from _x_ where there are **not** matching values in _y_, keeping just columns from _x_.

****

### anti_join(band_members, band_instruments, by = "name")

****

<table class="table" style="font-size: 24px; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
   <th style="text-align:left;"> band </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Mick </td>
   <td style="text-align:left;"> Stones </td>
  </tr>
</tbody>
</table>

---

# When keys dont share a name

```r
(band_members2 = band_members %>% 
  select(first_name = name, band))
```

```
# A tibble: 3 × 2
  first_name band   
  <chr>      <chr>  
1 Mick       Stones 
2 John       Beatles
3 Paul       Beatles
```
]

```r
inner_join(band_members2, 
           band_instruments, 
           by = c('first_name' = 'name'))
```

```
# A tibble: 2 × 3
  first_name band    plays 
  <chr>      <chr>   <chr> 
1 John       Beatles guitar
2 Paul       Beatles bass  
```
]

---

# Summary

- To a computer all data is bytes
- Software like R/Python know how to interpret bytes as values
- Multiple values can be stored with structure
- Like type values can be stored in atomic vectors and dislike types can be stored as lists

- Both R and python make extensive use of data.frames which are lists with enforced equal length elements (columns)

- SQL-ish and other data manipulations can be executed on data.frames for custom analysis, data creation, and question answering