class: center, middle, inverse, title-slide .title[ # Concepts: Data Type, Strucuture and Manipulation ] --- <style type="text/css"> .remark-code{font-size: 100%} </style> ## Road Map of Intro ~ 20 mintues - Data Types - Data Structures - Data Manipulation ~ 20 minutes Vector Data (sf, CRS) - Primatives - PCS vs GCS - measures - Predicates Operations --- # Learning Objectives - Understand how you, software, and computers understand values - Understand how valus can be structured - vectors, dimensions/attributes, data.frames - Understand how data.frames can be operated on - filter - select - mutate - summarize - group_by - Understand how two tables can be joined --- # Data Computers convert bytes --> hex --> value - Humans read values - Software reads Hex bytes - Hardware reads Binary bytes --- ## Example <img src="img/04-ascii-hex-binary.png"> --- ## What's the difference between 3 and '3' -- **To us**: _meaning_ -- **To software**: _hows its handled_ -- **To a computer**: _nothing_ -- Thus, there is a need for a computer "guessable" / human definable data `types`. --- class: center, middle, inverse # Data Types (main 6) --- ### 1. Double/float (e.g. `2`, `2.15`) - Values with decimals - Generally the default computational data type - There are three special values unique to doubles: `Inf`, `-Inf`, and `NaN` (not a number). --- ### 2. integer (e.g. `2L`) - Values _without_ decimals - To create an integer you must follow the a number with an uppercase L. - Take less memory then doubles but this is rarely an issue on modern computers and "normal" size data ```r object.size(1:1000) ``` ``` 4048 bytes ``` ```r object.size(as.numeric(1:1000)) ``` ``` 8048 bytes ``` --- ### 3. character (e.g. `"x"`, `"Welcome!"`) - Stores text ranging in size from a single letter to a novel. - surrounded by `"` ("here") or `'` ('there'). - Special characters are escaped with `\`; see `?Quotes` for full details. --- ### 4. logical (e.g. `TRUE`, `FALSE`) - Logical values store boolean values (`TRUE` and `FALSE`). - Useful for checking conditions and controlling the flow of a program. - Or, for checking binary conditions (like on,off; open/closed; >100) - Logicals can be written in full (TRUE or FALSE), or abbreviated (T or F). - Logicals are effectively 0 (F) and (1) ```r as.numeric(TRUE) ``` ``` [1] 1 ``` ```r as.integer(FALSE) ``` ``` [1] 0 ``` --- ### 5. raw (e.g. holds `bytes`) ### 6. complex (e.g. `1+4i`) --- count: false .panel1-type-auto[ ```r # Numeric *typeof(1.9) ``` ] .panel2-type-auto[ ``` [1] "double" ``` ] --- count: false .panel1-type-auto[ ```r # Numeric typeof(1.9) # Integer *typeof(1L) ``` ] .panel2-type-auto[ ``` [1] "double" ``` ``` [1] "integer" ``` ] --- count: false .panel1-type-auto[ ```r # Numeric typeof(1.9) # Integer typeof(1L) # Boolean *typeof(TRUE) ``` ] .panel2-type-auto[ ``` [1] "double" ``` ``` [1] "integer" ``` ``` [1] "logical" ``` ] --- count: false .panel1-type-auto[ ```r # Numeric typeof(1.9) # Integer typeof(1L) # Boolean typeof(TRUE) # Character *typeof("Welcome") ``` ] .panel2-type-auto[ ``` [1] "double" ``` ``` [1] "integer" ``` ``` [1] "logical" ``` ``` [1] "character" ``` ] <style> .panel1-type-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-type-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-type-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- class: center, middle, inverse # Data Structure ### Storing more then one value requires **structure**! --- ## Vectors - **Vectors** come in two types: _atomic_ and _list_ -- - _atomic_ vectors elements must have the same **type** -- - _lists_ elements can have different **types** -- - `NULL` serves as a generic zero length vector. -- - This diagram - taken from [here](https://adv-r.hadley.nz/vectors-chap.html) illustrates the basic relationships: <img src="img/04-vec-list-01.png"> --- ### Atomic Vectors A vector containing one **type** of data is called an "atom" - Can created using the `c()` (combine) function in R and with brackets `[` in Python numpy - The length can be checked with `length()` (R) or `len()` (python) <img src="img/04-vec-list-02.png" width="60%" style="display: block; margin: auto;" /> --- count: false .panel1-atom-auto[ ```r # Numeric *dbl_vec = c(1.9, 2, 3.5) ``` ] .panel2-atom-auto[ ] --- count: false .panel1-atom-auto[ ```r # Numeric dbl_vec = c(1.9, 2, 3.5) *typeof(dbl_vec) ``` ] .panel2-atom-auto[ ``` [1] "double" ``` ] --- count: false .panel1-atom-auto[ ```r # Numeric dbl_vec = c(1.9, 2, 3.5) typeof(dbl_vec) *length(dbl_vec) ``` ] .panel2-atom-auto[ ``` [1] "double" ``` ``` [1] 3 ``` ] --- count: false .panel1-atom-auto[ ```r # Numeric dbl_vec = c(1.9, 2, 3.5) typeof(dbl_vec) length(dbl_vec) *lg_vec = c(TRUE, FALSE, F, T) ``` ] .panel2-atom-auto[ ``` [1] "double" ``` ``` [1] 3 ``` ] --- count: false .panel1-atom-auto[ ```r # Numeric dbl_vec = c(1.9, 2, 3.5) typeof(dbl_vec) length(dbl_vec) lg_vec = c(TRUE, FALSE, F, T) *typeof(lg_vec) ``` ] .panel2-atom-auto[ ``` [1] "double" ``` ``` [1] 3 ``` ``` [1] "logical" ``` ] --- count: false .panel1-atom-auto[ ```r # Numeric dbl_vec = c(1.9, 2, 3.5) typeof(dbl_vec) length(dbl_vec) lg_vec = c(TRUE, FALSE, F, T) typeof(lg_vec) *length(lg_vec) ``` ] .panel2-atom-auto[ ``` [1] "double" ``` ``` [1] 3 ``` ``` [1] "logical" ``` ``` [1] 4 ``` ] --- count: false .panel1-atom-auto[ ```r # Numeric dbl_vec = c(1.9, 2, 3.5) typeof(dbl_vec) length(dbl_vec) lg_vec = c(TRUE, FALSE, F, T) typeof(lg_vec) length(lg_vec) ``` ] .panel2-atom-auto[ ``` [1] "double" ``` ``` [1] 3 ``` ``` [1] "logical" ``` ``` [1] 4 ``` ] <style> .panel1-atom-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-atom-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-atom-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- # Missing Values! - Missing values still need a place holder - Missing values are denoted with `NA` (short for not applicable). - Missing values are 'infectious': most computations involving a missing value will return another missing value. ```r x = c(2,4) y = c(2, NA) x + y ``` ``` [1] 4 NA ``` --- ## Cohersion - As seen above, type is a property of the vector - When you try and combine different types they will be coerced in a fixed order: - character → double → integer → logical - Coercion often happens automatically. - Failed coercion generates a warning and a missing value --- count: false .panel1-atoms-auto[ ```r *c("a", 1) ``` ] .panel2-atoms-auto[ ``` [1] "a" "1" ``` ] --- count: false .panel1-atoms-auto[ ```r c("a", 1) *c("a", TRUE) ``` ] .panel2-atoms-auto[ ``` [1] "a" "1" ``` ``` [1] "a" "TRUE" ``` ] --- count: false .panel1-atoms-auto[ ```r c("a", 1) c("a", TRUE) *c(4.5, 1L) ``` ] .panel2-atoms-auto[ ``` [1] "a" "1" ``` ``` [1] "a" "TRUE" ``` ``` [1] 4.5 1.0 ``` ] --- count: false .panel1-atoms-auto[ ```r c("a", 1) c("a", TRUE) c(4.5, 1L) *c("1", 18, "GIS") ``` ] .panel2-atoms-auto[ ``` [1] "a" "1" ``` ``` [1] "a" "TRUE" ``` ``` [1] 4.5 1.0 ``` ``` [1] "1" "18" "GIS" ``` ] --- count: false .panel1-atoms-auto[ ```r c("a", 1) c("a", TRUE) c(4.5, 1L) c("1", 18, "GIS") *as.numeric(c("1", 18, "GIS")) ``` ] .panel2-atoms-auto[ ``` [1] "a" "1" ``` ``` [1] "a" "TRUE" ``` ``` [1] 4.5 1.0 ``` ``` [1] "1" "18" "GIS" ``` ``` [1] 1 18 NA ``` ] --- count: false .panel1-atoms-auto[ ```r c("a", 1) c("a", TRUE) c(4.5, 1L) c("1", 18, "GIS") as.numeric(c("1", 18, "GIS")) *as.logical(c("1", 18, "GIS")) ``` ] .panel2-atoms-auto[ ``` [1] "a" "1" ``` ``` [1] "a" "TRUE" ``` ``` [1] 4.5 1.0 ``` ``` [1] "1" "18" "GIS" ``` ``` [1] 1 18 NA ``` ``` [1] NA NA NA ``` ] <style> .panel1-atoms-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-atoms-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-atoms-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- # Subsetting Atomics Atomics can be subset using the position(s) (explicit or logical) of each value ```r # Atomic numeric vector (x = c(3.4, 7, 18, 9.6)) ``` ``` [1] 3.4 7.0 18.0 9.6 ``` ```r # Third Value x[3] ``` ``` [1] 18 ``` ```r # Third and Fourth value x[c(3,4)] ``` ``` [1] 18.0 9.6 ``` ```r # Drop the third value x[-3] ``` ``` [1] 3.4 7.0 9.6 ``` ```r # Keep the 1 and 2 value, but drop 3 and 4 x[c(T,T,F,F)] ``` ``` [1] 3.4 7.0 ``` --- ## Diminsions - Atomics do **not** include a number of important structures like matrices (2D) or arrays (3D), factors, or date-times. - These structures are built on top of atomic vectors by adding **attributes**. - For example, adding a `dim` attribute to a vector allows it to behave like a 2D matrix or a XD array. --- ### Matrix - A matrix is 2D atom (row, column) - Same data types - Same column length - For spatial data, this corresponds to the structure of gridded/raster data. ```r # Use matrix (mat = matrix(1:9, nrow = 3)) ``` ``` [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 ``` --- # Subsetting Matrices Now that a structure (row, column) has been assigned, Subsetting requires a row, column (i,j) syntax -- ```r (x = matrix(1:9, nrow = 3)) ``` ``` [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 ``` ```r x[3,] ``` ``` [1] 3 6 9 ``` ```r x[,3] ``` ``` [1] 7 8 9 ``` ```r x[3,3] ``` ``` [1] 9 ``` --- ### Arrays - An array is a 3D atom [row, column, slice] - For spatial data, this is corresponds to the structure of gridded/raster data with a time diminsion count: false .panel1-arrays-auto[ ```r *(array(c(1:12), dim = c(3,2,2))) ``` ] .panel2-arrays-auto[ ``` , , 1 [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6 , , 2 [,1] [,2] [1,] 7 10 [2,] 8 11 [3,] 9 12 ``` ] <style> .panel1-arrays-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-arrays-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-arrays-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- # Subsetting Arrays - Now that a structure (row, column, slice) has been assigned, - Subsetting requires a row, column (i,j,z) syntax .pull-left[ ```r (x = array(1:12, dim = c(2,2,3))) ``` ``` , , 1 [,1] [,2] [1,] 1 3 [2,] 2 4 , , 2 [,1] [,2] [1,] 5 7 [2,] 6 8 , , 3 [,1] [,2] [1,] 9 11 [2,] 10 12 ``` ] .pull-right[ ```r x[1,,] ``` ``` [,1] [,2] [,3] [1,] 1 5 9 [2,] 3 7 11 ``` ```r x[,1,] ``` ``` [,1] [,2] [,3] [1,] 1 5 9 [2,] 2 6 10 ``` ```r x[,,1] ``` ``` [,1] [,2] [1,] 1 3 [2,] 2 4 ``` --- ## Lists Lists allow each list element to be any type/strucutre. ```r (my_list <- list( matrix(1:4, nrow = 2), "GIS is great!", c(TRUE, FALSE, TRUE), c(2.3, 5.9) )) ``` ``` [[1]] [,1] [,2] [1,] 1 3 [2,] 2 4 [[2]] [1] "GIS is great!" [[3]] [1] TRUE FALSE TRUE [[4]] [1] 2.3 5.9 ``` ```r typeof(my_list) ``` ``` [1] "list" ``` --- # Subsetting Lists - With fleaxabilty of list objects comes new subsetting operartors. - The `[[` allows an element of list to be accessed: ```r my_list[[1]] ``` ``` [,1] [,2] [1,] 1 3 [2,] 2 4 ``` ```r my_list[[1]][1,2] ``` ``` [1] 3 ``` --- ### Data Frames A `data.frame` is a `structure` built on top of **lists** -- - However, the length of each vector (list or atomic) **must** be the same. -- - This gives data frames their rectangular structure and explains why they share the properties of both matrices and lists --- count: false # A small data.frame .panel1-smalldf-auto[ ```r *(df1 <- data.frame(name = c("George", "Stan", "Carly"), * age = c(75,15,31), * retired = c(T,F,F))) ``` ] .panel2-smalldf-auto[ ``` name age retired 1 George 75 TRUE 2 Stan 15 FALSE 3 Carly 31 FALSE ``` ] --- count: false # A small data.frame .panel1-smalldf-auto[ ```r (df1 <- data.frame(name = c("George", "Stan", "Carly"), age = c(75,15,31), retired = c(T,F,F))) *typeof(df1) ``` ] .panel2-smalldf-auto[ ``` name age retired 1 George 75 TRUE 2 Stan 15 FALSE 3 Carly 31 FALSE ``` ``` [1] "list" ``` ] <style> .panel1-smalldf-auto { color: black; width: 58.2%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-smalldf-auto { color: black; width: 38.8%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-smalldf-auto { color: black; width: 0%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ## Subsetting data.frames data.frames can be subset like a matrix: ```r df1[1,2] ``` ``` [1] 75 ``` -- data.frames can be subset like a list: ```r df1[[2]] ``` ``` [1] 75 15 31 ``` -- And, with a new operation* `$` that allows access by column name: ```r df1$name ``` ``` [1] "George" "Stan" "Carly" ``` --- class: center, middle, inverse # Data Manipulation --- The manipulation of data.frames almost always comes back to SQL queries R and Python abstract the SQL logic and provide function-ized methods for executing these core tasks. `dplyr` is the principle package for doing this in R, while pandas is the "go to" in python. dplyr describes is functionality as a 'grammar of data manipulation` which is useful for discussion in that the functions we call, are things we "do" to the data.frame. --- ## Verbs We will cover two "pure" verbs: - `select()`: picks variables based on their names. - `filter()`: picks rows/observations based on their values. And three "manipulation" verbs - `mutate()`: adds new variables that are functions of existing variables - `summarise()`: reduces multiple values down to a single summary. - `arrange()`: changes the ordering of the rows. These all combine naturally with `group_by()` which allows you to perform any operation “by group”. --- ## The %>% (pipe) operator The pipe operator (`%>%`) will change your data workflow in R. This syntax leads to code that is much easier to write and to read. The RStudio keyboard shortcut: Ctrl+Shift+M (Windows), Cmd+Shift+M (Mac). The pipe passes the object on the left hand side of the pipe into the first argument of the right hand function To be %>% compatible, the `data.frame` is ALWAYS the fist argument to dplyr verbs --- # Data For demonstration we will get some streamflow data from the USGS. The details of this function will come latter in the course so focus on the data ```r flows = readNWISdv(siteNumbers = '14187200', parameterCd = "00060") %>% renameNWISColumns() glimpse(flows) ``` ``` Rows: 17,785 Columns: 5 $ agency_cd <chr> "USGS", "USGS", "USGS", "USGS", "USGS", "USGS", "USGS", "USG… $ site_no <chr> "14187200", "14187200", "14187200", "14187200", "14187200", … $ Date <date> 1973-08-01, 1973-08-02, 1973-08-03, 1973-08-04, 1973-08-05,… $ Flow <dbl> 809, 828, 829, 930, 939, 939, 944, 932, 927, 925, 927, 928, … $ Flow_cd <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", … ``` --- ## Use `filter()` to subset on a condition - `filter()` takes logical (binary) expressions and returns the _rows_ in which all conditions are TRUE. - `filter()` does NOT impact columns -- - Lets find all rows in `flows` data that in have Flow code A: ```r filter(flows, Flow_cd == "A") %>% glimpse() ``` ``` Rows: 17,464 Columns: 5 $ agency_cd <chr> "USGS", "USGS", "USGS", "USGS", "USGS", "USGS", "USGS", "USG… $ site_no <chr> "14187200", "14187200", "14187200", "14187200", "14187200", … $ Date <date> 1973-08-01, 1973-08-02, 1973-08-03, 1973-08-04, 1973-08-05,… $ Flow <dbl> 809, 828, 829, 930, 939, 939, 944, 932, 927, 925, 927, 928, … $ Flow_cd <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", … ``` --- ## Use `filter()` to subset data on multiple conditions - Lets find all observations in `flows` where the Date is after `2000-01-01`, _and_ the Flow Code is A: -- ```r filter(flows, Flow_cd == "A", Date > as.Date("2010-01-01")) %>% glimpse() ``` ``` Rows: 4,179 Columns: 5 $ agency_cd <chr> "USGS", "USGS", "USGS", "USGS", "USGS", "USGS", "USGS", "USG… $ site_no <chr> "14187200", "14187200", "14187200", "14187200", "14187200", … $ Date <date> 2010-01-02, 2010-01-03, 2010-01-04, 2010-01-05, 2010-01-06,… $ Flow <dbl> 7870, 6920, 5860, 7860, 10000, 10100, 9760, 9130, 8600, 7040… $ Flow_cd <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", … ``` --- ## Use `select()` to subset by variables or columns. - Use `select()` to subset the variables or columns you want by name (think of the `$` syntax). -- ```r select(flows, Date, Flow) %>% glimpse() ``` ``` Rows: 17,785 Columns: 2 $ Date <date> 1973-08-01, 1973-08-02, 1973-08-03, 1973-08-04, 1973-08-05, 1973… $ Flow <dbl> 809, 828, 829, 930, 939, 939, 944, 932, 927, 925, 927, 928, 945, … ``` --- ## Use `select()` to subset and rename `select()` can also be used to rename existing columns ```r select(flows, Date, flow_cfs = Flow) %>% glimpse() ``` ``` Rows: 17,785 Columns: 2 $ Date <date> 1973-08-01, 1973-08-02, 1973-08-03, 1973-08-04, 1973-08-05, … $ flow_cfs <dbl> 809, 828, 829, 930, 939, 939, 944, 932, 927, 925, 927, 928, 9… ``` --- # Use mutate() to add new variables -- - `mutate()` defines and inserts new variables into a existing `data.frame` -- - `mutate()` builds new variables sequentially so you can reference earlier ones when defining later ones -- - In the `flows` dataset we have a Date value that will allow us extract a Year and month -- ```r flows %>% select(Date, Flow) %>% filter(Date > as.Date('2010-01-01')) %>% mutate(Year = format(Date, "%Y"), Month = format(Date, "%m")) %>% glimpse() ``` ``` Rows: 4,482 Columns: 4 $ Date <date> 2010-01-02, 2010-01-03, 2010-01-04, 2010-01-05, 2010-01-06, 201… $ Flow <dbl> 7870, 6920, 5860, 7860, 10000, 10100, 9760, 9130, 8600, 7040, 65… $ Year <chr> "2010", "2010", "2010", "2010", "2010", "2010", "2010", "2010", … $ Month <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01"… ``` --- # Arrange - orders the rows of a `data.frame` rows by the values of selected columns. ```r tmp = flows %>% select(Date, Flow) %>% filter(Date > as.Date('2020-01-01')) %>% mutate(Year = format(Date, "%Y"), Month = format(Date, "%m")) head(arrange(tmp, as.numeric(Year))) ``` ``` Date Flow Year Month 1 2020-01-02 3200 2020 01 2 2020-01-03 3100 2020 01 3 2020-01-04 2420 2020 01 4 2020-01-05 2260 2020 01 5 2020-01-06 2550 2020 01 6 2020-01-07 3210 2020 01 ``` ```r head(arrange(tmp, -as.numeric(Year))) ``` ``` Date Flow Year Month 1 2022-01-01 5790 2022 01 2 2022-01-02 5650 2022 01 3 2022-01-03 6550 2022 01 4 2022-01-04 6640 2022 01 5 2022-01-05 8450 2022 01 6 2022-01-06 10500 2022 01 ``` --- ## `summarize()` Have you ever had questions like "_what is the mean streamflow?_", but only have daily data? - `summarize()` takes a dataset with n observations, computes requested values, and returns a dataset with 1 observation. ```r flows %>% select(Date, Flow) %>% mutate(Year = format(Date, "%Y")) %>% summarize(meanQ = mean(Flow), maxQ = max(Flow)) ``` ``` meanQ maxQ 1 2802.225 25700 ``` --- ## `group_by()` Have you ever had questions like "_what is the mean annual streamflow?_", but only have daily data? - `group_by()` adds extra structure to your dataset by grouping information (think of atomic diminsion) - `mutate()` and `summarize()` honor groupings. -- Combined with the verbs like `select`, `filter`, and `arrange` these new tools allow you to solve an extremely diverse set of problems with relative ease. ```r tt = flows %>% select(Date, Flow) %>% mutate(Year = format(Date, "%Y")) %>% group_by(Year) paint::paint(tt) ``` ``` tibble [17785, 3] grouped by: Year [50] Date date 1973-08-01 1973-08-02 1973-08-03 1973-08-04 1973-~ Flow dbl 809 828 829 930 939 939 Year chr 1973 1973 1973 1973 1973 1973 ``` --- ```r tt2 = tt %>% summarize(meanQ = mean(Flow), maxQ = max(Flow)) glimpse(tt2) ``` ``` Rows: 50 Columns: 3 $ Year <chr> "1973", "1974", "1975", "1976", "1977", "1978", "1979", "1980", … $ meanQ <dbl> 4668.549, 3658.748, 3610.696, 2339.740, 2859.581, 2206.038, 2378… $ maxQ <dbl> 13200, 14400, 14100, 15000, 16200, 11300, 12000, 14700, 17000, 1… ``` --- ## Dont forget! - `group_by` adds _structure_ to your `data.frame` that impacts how data is manipulated, when done with your groupings dont forget to `ungroup()` ```r flows %>% select(Date, Flow) %>% mutate(Year = format(Date, "%Y")) %>% group_by(Year) %>% summarize(meanQ = mean(Flow), maxQ = max(Flow)) %>% ungroup() ``` ``` # A tibble: 50 × 3 Year meanQ maxQ <chr> <dbl> <dbl> 1 1973 4669. 13200 2 1974 3659. 14400 3 1975 3611. 14100 4 1976 2340. 15000 5 1977 2860. 16200 6 1978 2206. 11300 7 1979 2378. 12000 8 1980 2548. 14700 9 1981 2976. 17000 10 1982 3424. 15100 # … with 40 more rows ``` --- class: inverse, center, middle # When One Table is not enough --- # Relational Data - There will come a time when you need data from different sources. -- - When this happens we must **join** -- or merge -- multiple `data.frames` -- - To merge data, we have to find a point of _commonality_ -- - The attribute of _commonality_ is the _relation_ (e.g. realational data!) --- # Relational Data To work with relational data we need "verbs" that work with _pairs_ (**2**) of tables. -- - **Mutating joins**: add *new* variables to one table from matching observations in another. -- - **Filtering joins**: filter observations from one table if they match an observation in the other table. -- The most common place to find relational data is in a relational database management system (or RDBMS) --- # Keys - The _variables_ used to connect a pair tables are called **keys**. -- - A key is a variable (or set of variables) that uniquely identifies an observation or "unit". --- # Keys There are two types of keys: -- - **Primary keys**: *uniquely* identify observations in its own table. -- - **Foreign keys**: *uniquely* identify observations in another table. -- - A _primary_ key and a corresponding _foreign_ key form a _relation._ -- - Relations are typically one-to-many but can be one-to-one --- # Todays Data: .pull-left[ ```r band_members ``` ``` # A tibble: 3 × 2 name band <chr> <chr> 1 Mick Stones 2 John Beatles 3 Paul Beatles ``` ] .pull-right[ ```r band_instruments ``` ``` # A tibble: 3 × 2 name plays <chr> <chr> 1 John guitar 2 Paul bass 3 Keith guitar ``` ] --- class: center,middle # The basics <img src="img/sql-table-joins.png"> --- --- class: inverse, center, middle # Mutating Joins add *new* variables to one table from matching observations in another. --- **inner_join(x, y)**: Return all rows from _x_ where there are matching values in _y_, and all columns from x and y. --- # Inner Join If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join. -- <table class="table" style="font-size: 18px; width: auto !important; float: left; margin-right: 10px;"> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:left;"> band </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Mick </td> <td style="text-align:left;"> Stones </td> </tr> <tr> <td style="text-align:left;"> John </td> <td style="text-align:left;"> Beatles </td> </tr> <tr> <td style="text-align:left;"> Paul </td> <td style="text-align:left;"> Beatles </td> </tr> </tbody> </table> <table class="table" style="font-size: 18px; width: auto !important; margin-right: 0; margin-left: auto"> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:left;"> plays </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> John </td> <td style="text-align:left;"> guitar </td> </tr> <tr> <td style="text-align:left;"> Paul </td> <td style="text-align:left;"> bass </td> </tr> <tr> <td style="text-align:left;"> Keith </td> <td style="text-align:left;"> guitar </td> </tr> </tbody> </table> -- **** ### inner_join(band_members, band_instruments, by = "name") -- **** <table class="table" style="font-size: 24px; width: auto !important; float: left; margin-right: 10px;"> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:left;"> band </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Mick </td> <td style="text-align:left;"> Stones </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: navy !important;background-color: yellow !important;"> John </td> <td style="text-align:left;font-weight: bold;color: navy !important;background-color: yellow !important;"> Beatles </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: navy !important;background-color: yellow !important;"> Paul </td> <td style="text-align:left;font-weight: bold;color: navy !important;background-color: yellow !important;"> Beatles </td> </tr> </tbody> </table> <table class="table" style="font-size: 24px; width: auto !important; margin-right: 0; margin-left: auto"> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:left;"> plays </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;color: navy !important;background-color: yellow !important;"> John </td> <td style="text-align:left;font-weight: bold;color: navy !important;background-color: yellow !important;"> guitar </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: navy !important;background-color: yellow !important;"> Paul </td> <td style="text-align:left;font-weight: bold;color: navy !important;background-color: yellow !important;"> bass </td> </tr> <tr> <td style="text-align:left;"> Keith </td> <td style="text-align:left;"> guitar </td> </tr> </tbody> </table> -- **** <table class="table" style="font-size: 18px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:left;"> band </th> <th style="text-align:left;"> plays </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> John </td> <td style="text-align:left;"> <span style=" color: black !important;">Beatles</span> </td> <td style="text-align:left;"> <span style=" color: black !important;">guitar</span> </td> </tr> <tr> <td style="text-align:left;"> Paul </td> <td style="text-align:left;"> <span style=" color: black !important;">Beatles</span> </td> <td style="text-align:left;"> <span style=" color: black !important;">bass</span> </td> </tr> </tbody> </table> --- # The other mutating joins **left_join(x, y)**: Return all rows from _x_, and all columns from _x_ and _y_. **right_join(x, y)**: Return all rows from _x_ where there are matching values in _y_, and all columns from _x_ and _y_. **full_join(x, y)**: Return all rows and columns from both _x_ and _y_. --- class: middle, center, inverse # Filtering Joins "Filtering" joins keep cases from the LHS --- # Semi Join **semi_join(x, y)**: Return all rows from _x_ where there are matching values in _ y_, keeping just columns from _x_. -- <table class="table" style="font-size: 18px; width: auto !important; float: left; margin-right: 10px;"> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:left;"> band </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Mick </td> <td style="text-align:left;"> Stones </td> </tr> <tr> <td style="text-align:left;"> John </td> <td style="text-align:left;"> Beatles </td> </tr> <tr> <td style="text-align:left;"> Paul </td> <td style="text-align:left;"> Beatles </td> </tr> </tbody> </table> <table class="table" style="font-size: 18px; width: auto !important; margin-right: 0; margin-left: auto"> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:left;"> plays </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> John </td> <td style="text-align:left;"> guitar </td> </tr> <tr> <td style="text-align:left;"> Paul </td> <td style="text-align:left;"> bass </td> </tr> <tr> <td style="text-align:left;"> Keith </td> <td style="text-align:left;"> guitar </td> </tr> </tbody> </table> -- **** ### semi_join(band_members, band_instruments, by = "name") -- **** <table class="table" style="font-size: 24px; width: auto !important; float: left; margin-right: 10px;"> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:left;"> band </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Mick </td> <td style="text-align:left;"> Stones </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: darkred !important;background-color: white !important;"> John </td> <td style="text-align:left;font-weight: bold;color: darkred !important;background-color: white !important;"> Beatles </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: darkred !important;background-color: white !important;"> Paul </td> <td style="text-align:left;font-weight: bold;color: darkred !important;background-color: white !important;"> Beatles </td> </tr> </tbody> </table> <table class="table" style="font-size: 24px; width: auto !important; margin-right: 0; margin-left: auto"> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:left;"> plays </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;color: navy !important;background-color: gray50 !important;"> John </td> <td style="text-align:left;font-weight: bold;color: navy !important;background-color: gray50 !important;"> guitar </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: navy !important;background-color: gray50 !important;"> Paul </td> <td style="text-align:left;font-weight: bold;color: navy !important;background-color: gray50 !important;"> bass </td> </tr> <tr> <td style="text-align:left;"> Keith </td> <td style="text-align:left;"> guitar </td> </tr> </tbody> </table> -- **** <table class="table" style="font-size: 24px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:left;"> band </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> John </td> <td style="text-align:left;"> Beatles </td> </tr> <tr> <td style="text-align:left;"> Paul </td> <td style="text-align:left;"> Beatles </td> </tr> </tbody> </table> --- # Anti Join **anti_join(x, y)**: Return all rows from _x_ where there are **not** matching values in _y_, keeping just columns from _x_. -- <table class="table" style="font-size: 18px; width: auto !important; float: left; margin-right: 10px;"> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:left;"> band </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Mick </td> <td style="text-align:left;"> Stones </td> </tr> <tr> <td style="text-align:left;"> John </td> <td style="text-align:left;"> Beatles </td> </tr> <tr> <td style="text-align:left;"> Paul </td> <td style="text-align:left;"> Beatles </td> </tr> </tbody> </table> <table class="table" style="font-size: 18px; width: auto !important; margin-right: 0; margin-left: auto"> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:left;"> plays </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> John </td> <td style="text-align:left;"> guitar </td> </tr> <tr> <td style="text-align:left;"> Paul </td> <td style="text-align:left;"> bass </td> </tr> <tr> <td style="text-align:left;"> Keith </td> <td style="text-align:left;"> guitar </td> </tr> </tbody> </table> -- **** ### anti_join(band_members, band_instruments, by = "name") -- **** <table class="table" style="font-size: 24px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:left;"> band </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Mick </td> <td style="text-align:left;"> Stones </td> </tr> </tbody> </table> --- # When keys dont share a name .pull-left[ ```r (band_members2 = band_members %>% select(first_name = name, band)) ``` ``` # A tibble: 3 × 2 first_name band <chr> <chr> 1 Mick Stones 2 John Beatles 3 Paul Beatles ``` ] .pull-right[ ```r inner_join(band_members2, band_instruments, by = c('first_name' = 'name')) ``` ``` # A tibble: 2 × 3 first_name band plays <chr> <chr> <chr> 1 John Beatles guitar 2 Paul Beatles bass ``` ] --- # Summary - To a computer all data is bytes - Software like R/Python know how to interpret bytes as values - Multiple values can be stored with structure - Like type values can be stored in atomic vectors and dislike types can be stored as lists - Both R and python make extensive use of data.frames which are lists with enforced equal length elements (columns) - SQL-ish and other data manipulations can be executed on data.frames for custom analysis, data creation, and question answering