Terminology: data structures
R is an interpreted language (access through a command-line interpreter) with a number of data structures (vectors, matrices, arrays, data frames, lists) and extensible objects (regression models, time-series, geospatial coordinates) and supports procedural programming with functions.
To learn about objects, become friends with the built-in class
and str
functions. Let’s explore a new dataset - palmerpenguins - recently developed by Allison Horst as an alternative to the old R standby iris
dataset:
## [1] "tbl_df" "tbl" "data.frame"
## tibble [344 x 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
penguins
is a tibble, which is a new tidyverse
spin on data frames
, and was created as a new alternative to the iris
data set that has been used extensively for beginning tutorials on learning R. Data frames consist of rows of observations on columns of values for variables of interest - they are one of the fundamental and most important data structures in R. tibbles
make a few improvements / changes to data frames
such as:
- Never converting strings to factors
- Never creating row names
- Updated print method that only shows first 10 rows, just the columns that fit on the screen, and the type of each column (just as we get using
str
)
We can easily convert objects from tibble
to data frame
and vice versa:
But as we see in the result of str(penguins)
above, following the information that penguins
is a tibble
with 344 observations of 8 variables, we get information on each of the variables, in this case that 2 are numeric, 2 are integers, and 3 are factors - factors encode categorical variables - and str
gives us the number of levels in each factor.
First off, R has several main data types:
- logical
- integer
- double
- complex
- character
- raw
- list
- NULL
- closure (function)
- special
- builtin (basic functions and operators)
- environment
- S4 (some S4 objects)
- others you won’t run into at user level
We can ask what data type something is using typeof
:
## [1] "list"
## [1] "double"
## [1] "integer"
We see a couple interesting things here - penguins
, which we just said is a tibble
, is a data type of list
. bill_length_mm
is data type double
, and in str(penguins)
we saw it was numeric - that makes sense - but we see that species
is data type integer
, and in str(penguins)
we were told this variable was a factor with three levels. What’s going on here?
First off, class
refers to the abstract type of an object in R, whereas typeof
or mode
refer to how an object is stored in memory. So penguins
is an object of class tibble
, but it is stored in memory as a list (i.e. each column is an item in a list). Note that this allows tibbles and data frames to have columns of different classes, whereas a matrix needs to be all of the same mode.
For our species
column, we see it’s mode
is numeric, it’s typeof
is integer
, and it’s class is factor
. Nominal variables in R are treated as a vector of integers 1:k, where k is the number of unique values of that nominal variable and a mapping of the character strings to these integer values.
This allows us to quickly see see all the unique values of a particular nominal variable or quickly re-asign a level of a nominal variable to a new value - remember, everything in R is in memory, so don’t worry about tweaking the data!
## [1] "Adelie" "Chinstrap" "Gentoo"
See if you can explain how that re-asignment we just did worked.
To access particular columns in a tibble
or data frame
, as we saw above, we use the $
operator. We can see the value of species
for each observation in penguins
as well as listing of all levels of the variable by running:
## [1] adeliae adeliae adeliae adeliae adeliae adeliae adeliae
## [8] adeliae adeliae adeliae adeliae adeliae adeliae adeliae
## [15] adeliae adeliae adeliae adeliae adeliae adeliae adeliae
## [22] adeliae adeliae adeliae adeliae adeliae adeliae adeliae
## [29] adeliae adeliae adeliae adeliae adeliae adeliae adeliae
## [36] adeliae adeliae adeliae adeliae adeliae adeliae adeliae
## [43] adeliae adeliae adeliae adeliae adeliae adeliae adeliae
## [50] adeliae adeliae adeliae adeliae adeliae adeliae adeliae
## [57] adeliae adeliae adeliae adeliae adeliae adeliae adeliae
## [64] adeliae adeliae adeliae adeliae adeliae adeliae adeliae
## [71] adeliae adeliae adeliae adeliae adeliae adeliae adeliae
## [78] adeliae adeliae adeliae adeliae adeliae adeliae adeliae
## [85] adeliae adeliae adeliae adeliae adeliae adeliae adeliae
## [92] adeliae adeliae adeliae adeliae adeliae adeliae adeliae
## [99] adeliae adeliae adeliae adeliae adeliae adeliae adeliae
## [106] adeliae adeliae adeliae adeliae adeliae adeliae adeliae
## [113] adeliae adeliae adeliae adeliae adeliae adeliae adeliae
## [120] adeliae adeliae adeliae adeliae adeliae adeliae adeliae
## [127] adeliae adeliae adeliae adeliae adeliae adeliae adeliae
## [134] adeliae adeliae adeliae adeliae adeliae adeliae adeliae
## [141] adeliae adeliae adeliae adeliae adeliae adeliae adeliae
## [148] adeliae adeliae adeliae adeliae adeliae Gentoo Gentoo
## [155] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [162] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [169] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [176] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [183] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [190] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [197] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [204] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [211] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [218] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [225] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [232] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [239] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [246] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [253] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [260] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [267] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [274] Gentoo Gentoo Gentoo Chinstrap Chinstrap Chinstrap Chinstrap
## [281] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [288] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [295] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [302] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [309] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [316] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [323] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [330] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [337] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [344] Chinstrap
## Levels: adeliae Chinstrap Gentoo
To access particular columns or rows of a data frame, we use indexing:
## # A tibble: 1 x 1
## bill_length_mm
## <dbl>
## 1 39.1
## # A tibble: 1 x 1
## species
## <fct>
## 1 adeliae
A handy function is names
, which you can use to get or to set data frame variable names:
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
Explain what this last line did
A little example of tidy evaluation and piping to do the same thing - we’ll go into more:
## [1] "species" "island" "Bill_Length"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"