Day 2 - AM


Good morning, Jedis!


Schedule

- Review of Day 1
- Dates continued...
- Summarize your data
- Pipe together functions
- Join two tables by a shared column
- Calculate new columns
- Summarize for each group
- Save data


Porg review


The poggle of porgs has returned to help review the dplyr functions. Follow along by downloading the porg data from the URL below.

library(readr)
porgs <- read_csv("https://itep-r.netlify.com/data/porg_data.csv")























1 New mission!


The Wookies want our help. BB8 just received new data from the wookies suggesting there was a large magnetic storm just before Site 1 burned down on Endor. Sounds like we’re going to be STORM CHASERS! Mission accepted.


Get the dataset

While we’re relaxing and flying to Endor let’s get set up with the new data.

Endor setup

  1. Download the data.
  2. Create a new folder called data in your project folder.
  3. Move the downloaded Endor file to there.
  4. Create a new R script called endor.R


Now we can read in the data from the data folder with our favorite read_csv function.

library(readr)
library(dplyr)
library(janitor)

air_endor <- read_csv("data/air_endor.csv")

names(air_endor)

# Clean the column names
air_endor <- clean_names(air_endor)

names(air_endor)
## [1] "analyte"        "lab_code"       "units"          "start_run_date"
## [5] "site_name"      "long_site_name" "planet"         "result"


Note

When your project is open, RStudio sets the working directory to your project folder. When reading files from a folder outside of your project, you’ll need to use the full file path location.

For example, for a file on the X-drive: X://Programs/Air_Quality/Ozone_data.csv

Welcome to Endor!



Let’s get to know the Endor data.

Remember the ways?

Hint: summary(), glimpse(), nrow(), distinct()


glimpse(air_endor)
## Observations: 1,190
## Variables: 8
## $ analyte        <chr> "1-Methylnaphthalene", "1-Methylnaphthalene", "...
## $ lab_code       <chr> "HV-403", "HV-408", "HV-411", "HV-415", "HV-418...
## $ units          <chr> "ng/m3", "ng/m3", "ng/m3", "ng/m3", "ng/m3", "n...
## $ start_run_date <chr> "7/11/2016", "7/23/2016", "8/5/2016", "8/16/201...
## $ site_name      <chr> "battlesite1", "battlesite1", "battlesite1", "b...
## $ long_site_name <chr> "Endor_Tana_forestmoon_battlesite1", "Endor_Tan...
## $ planet         <chr> "Endor", "Endor", "Endor", "Endor", "Endor", "E...
## $ result         <dbl> 0.00291, 0.00000, 0.00000, 0.00608, 0.00000, 0....


distinct(air_endor, analyte)
## # A tibble: 27 x 1
##    analyte             
##    <chr>               
##  1 1-Methylnaphthalene 
##  2 2-Methylnaphthalene 
##  3 5-Methylchrysene    
##  4 Acenaphthene        
##  5 Anthanthrene        
##  6 Benz[a]anthracene   
##  7 Benzo[a]pyrene      
##  8 Benzo[c]fluorene    
##  9 Benzo[e]pyrene      
## 10 Benzo[g,h,i]perylene
## # ... with 17 more rows


Woah! That’s definitely more analytes than we need.

We only want to know about magnetic_field data. Let’s filter down to that analyte.


mag_endor <- filter(air_endor, analyte == "magnetic_field")

Boom! How many rows are left now?


Ok. According to the request we received, the Resistance is only interested in observations from year 2017. So let’s filter the data down to only dates for that year.

How do we filter on the date column?


With time series data it’s helpful for R to know which columns are dates. Dates come in many formats and the standard format in R is 2019-01-15, also referred to as Year-Month-Day format. We use the lubridate package to help format our dates.

2 Dates


Enter the lubridate package.


It’s about time! Lubridate makes working with dates much easier.

We can find how much time has elapsed, add or subtract days, and find seasonal and day of the week averages.



Convert text to a DATE

Function Order of date elements
mdy() Month-Day-Year :: 05-18-2019 or 05/18/2019
dmy() Day-Month-Year (Euro dates) :: 18-05-2019 or 18/05/2019
ymd() Year-Month-Day (science dates) :: 2019-05-18 or 2019/05/18
ymd_hm() Year-Month-Day Hour:Minutes :: 2019-05-18 8:35 AM
ymd_hms() Year-Month-Day Hour:Minutes:Seconds :: 2019-05-18 8:35:22 AM


Get date parts

Function Date element
year() Year
month() Month as 1,2,3; For Jan, Feb, Mar use label=TRUE
week() Week of the year
day() Day of the month
wday() Day of the week as 1,2,3; For Sun, Mon, Tue use label=TRUE
- Time -
hour() Hour of the day (24hr)
minute() Minutes
second() Seconds
tz() Time zone


Install lubridate

Run: install.packages("lubridate")

Clean the dates

Real world examples

Does your date column look like one of these? Here’s the lubridate function to tell R that the column is a date.

Format Function to use
“05/18/2019” mdy(date_column)
“May 18, 2019” mdy(date_column)
“05/18/2019 8:00 CDT” mdy_hm(date_column, tz = "US/Central")
“05/18/2019 11:05:32 PDT” mdy_hms(date_column, tz = "US/Pacific")


European dates

Format Function to use
“30.05.2019” dmy(date_column)
“30-05-2019” dmy(date_column)
“30/05/2019” dmy(date_column)


AQS formatted dates

Format Function to use
“20190518” ymd(sample_date)

Small problem


Oh no! BB8 got locked out of the system! You need the passcode before you can unlock the data.

There’s a cryptic note on the pad next to you… “Marty’s birthday”. Could that be the hint to the passcode???

We need to determine when Marty’s birthday is…time to get out our detective hats.

You ask the lead engineer:

So when is Marty’s birthday?

“I think it’s on a Tuesday this year”

Any more helpful clues? Anything?

“Hmmm… Let’s see, I think it was around the 22nd week of the year. Yep. That’s it.”


Explore!

After some more interviews you now have a list of dates that could be Marty’s birthday. Find the date that occurred on the Tuesday of the 22nd week of the year.

library(lubridate)

date1 <- "7/2/2019"

date2 <- "28/05/2019"

date3 <- "June 4th, 2019"

Step 1: Use mdy(date) or dmy(date) to convert the dates to the universal date format.


Are you sure you want to click this button? There’s no going back.

Show code

# Use mdy() or dmy() to format dates

#1
date1 <- mdy("7/2/2019")

#2
date2 <- dmy("28/05/2019")

#3
date3 <- mdy("June 4th, 2019")

Step 2: Find the day of the week using wday(date), and week of the year using week(date).

Show code

# Use wday() and week() to get date info

#1
date1 <- mdy("7/2/2019")
wday(date1, label = TRUE, abbr = FALSE)
week(date1)

#2
date2 <- dmy("28/05/2019")
wday(date2, label = TRUE, abbr = FALSE)
week(date2)

#3
date3 <- mdy("June 4th, 2019")
wday(date3, label = TRUE, abbr = FALSE)
week(date3)


Show answer

"28/05/2019" or May 28th, 2019

Congrats! BB8 successfully accessed the backup system. The data is back online!


Back on Endor!


Let’s use the mdy() function to turn the start_run_date column into a nicely formatted date.

library(lubridate)

mag_endor <- mutate(mag_endor, date = mdy(start_run_date),
                               year = year(date))


Ok, back to work.

Let’s filter to only the rows where the year equals 2017.


mag_endor <- filter(mag_endor, year == 2017)

Time series

library(ggplot2)

ggplot(mag_endor, aes(x = date, y = result)) +
  geom_line(size = 2, color = "tomato") +
  geom_point(size = 5, alpha = 0.5)  # alpha changes transparency

Looks like the measurements definitely picked up a signal towards the end of the year.


Mystery solved!

Aha! There really was a spike in the magnetic field in November. Hopefully, the Resistance will reward us handsomely for this information.


3 Calendar plots


Calendar plots are good for zooming in to see specific days when elevated observations may have occurred. Maybe we can pinpoint what day that magnetic storm may have occurred.

openair package

There is a calendar plot function in the package openair. Let’s install openair, load it, and use calendarplot() to make a calendar of the observations.


install.packages("openair")


library(openair)
calendarPlot(mag_endor,
             pollutant    = "result",
             statistic    = "mean",
             year         = 2017,
             annotate     = "date",
             digits       = 0,
             key.footer   = "nT",
             par.settings = list(fontsize       = list(text = 14),
                                 layout.heights = list(top.padding = 1)),
             main = "Magnetic Field Strength at Endor Battle Site 1")


Let’s zoom in and filter the calendar to only 3 months: October, November, December.

mag_endor <- mutate(mag_endor, month = month(date))
                        
mag_endor <- filter(mag_endor, month %in% c(10,11,12))

calendarPlot(mag_endor,
             pollutant    = "result",
             statistic    = "mean",
             year         = 2017,
             annotate     = "date",
             digits       = 0,
             key.footer   = "nT",
             par.settings = list(fontsize       = list(text = 14),
                                 layout.heights = list(top.padding = 1)),
             main = "Magnetic Field Strength at Endor Battle Site 1")


4 Pipe it together: A %>% B


Let the %>% guide you.

I wish I could type the name of the data less often.

You can with the pipe!

Use the %>% to chain functions together and make our scripts more streamlined.

Here are 2 ways the %>% is helpful:

#1: It eliminates the need for nested parentheses.

Say you wanted to take the sum of 3 numbers and then take the log and then round the final result.

round(log(sum(c(10, 20, 30, 50))))

The code doesn’t look much like the words we used to describe it. Let’s pipe it so we can read the code from left to right.

c(10, 20, 30, 50) %>% sum() %>% log() %>% round()


#2: We can combine many processing steps into one cohesive chunk.

Here are some of the functions we’ve applied to the scrap data so far:

scrap <- arrange(scrap, desc(price_per_pound))

scrap <- filter(scrap, origin != "All")

scrap <- mutate(scrap, 
                scrap_finder    = "BB8",
                measure_method  = "REM-24")

We can use the %>% operator to pipe or chain them together.

scrap <- scrap %>%
           arrange(desc(price_per_pound)) %>%
           filter(origin != "All") %>%
           mutate(scrap_finder    = "BB8",
                  measure_method  = "REM-24")

Explore!

Similar to above, use the %>% to combine the two Endor lines below.

# Filter to only magnetic_field measurements
mag_endor <- filter(air_endor, analyte == "magnetic_field")

# Add a date column to the filtered data
mag_endor <- mutate(mag_endor, date = mdy(start_run_date))


So where were we?

Right! We we’re enjoying our time on beautiful lush Endor. But aren’t we missing somebody…?

Finn needs us!

That’s enough scuttlebutting on Endor, Finn needs us back on Jakku. It turns out we forgot to pick-up Finn when we left. Now he’s being held ransom by Junk Boss Plutt. We’ll need to act fast to get to him before the Empire does. Blast off!



Update from BB8!

BB8 was busy on our flight back to Jakku, and recovered a full set of scrap records from the notorious Unkar Plutt. Let’s take a look.

library(readr)
library(dplyr)

# Read in the full scrap database
scrap <- read_csv("https://itep-r.netlify.com/data/starwars_scrap_jakku_full.csv")

5 Jakku re-visited


Okay, so we’re back on this ol’ dust bucket. Let’s try not to forget anyone this time. We’re quickly running out of friends on this planet.

A scrappy ransom

Mr. Baddy Plutt is demanding 10,000 items of scrap for Finn. Sounds expensive, but lucky for us he didn’t clarify the exact items. Let’s find the scrap that weighs the least per shipment and try to make this transaction as light as possible.

Take a look at our NEW scrap data and see if we have the weight of all the items.

# What unit types are in the data?
unique(scrap$units)
## [1] "Cubic Yards" "Items"       "Tons"


Or return results as a data frame

distinct(scrap, units)
## # A tibble: 3 x 1
##   units      
##   <chr>      
## 1 Cubic Yards
## 2 Items      
## 3 Tons


Hmmm…. So how much does a cubic yard of Hull Panel weigh?

A lot? 32? Maybe…

I think we’re going to need some more data.


“Hey BB8!”

“Do your magic data finding thing.”

Weight conversion

It took a while, but with a few droid bribes BB8 was able to track down a Weight conversion table from his droid buddies. Our current data shows the total cubic yards for some scrap shipments, but not how much the shipment weighs.

Read the weight conversion table

# The data's URL
convert_url <- "https://rtrain.netlify.com/data/conversion_table.csv"

# Read the conversion data
convert <- read_csv(convert_url)

head(convert, 3)
## # A tibble: 3 x 3
##   item               units       pounds_per_unit
##   <chr>              <chr>                 <dbl>
## 1 Bulkhead           Cubic Yards             321
## 2 Hull Panel         Cubic Yards             641
## 3 Repulsorlift array Cubic Yards             467


Stars! A cubic yard of Hull Panel weighs 641 lbs. That’s what I thought!

Let’s join this new conversion table to the scrap data to make our calculations easier. To do that we need to make a new friend.

Say Hello to left_join()!


6 Join tables with left_join()


Join 2 tables

left_join(scrap, convert, by = c("columns to join by"))

left_join() works like a zipper to combine 2 tables based on one or more variables. It’s called “left”-join because the entire table on the left side is retained.

Anything that matches from the right table gets to join the party, but any rows that don’t have a matching ID will be ignored.


Adding porg names

Remember our porg friends? How rude of us not to share their names. Wups!

Here’s a table of their names.

Hey now! Who’s who? Let’s join their names to the rest of the data.


The joined result:


BONUS More joining

Star Wars characters

left_join() is used to add columns to your table by pulling them from another table. Imagine you have the 2 Star Wars tables below. One table includes character names and heights; the second has names and homeworlds. Since both tables share a common name column, we can join the tables together using the name columns as the joining key.


starwars_heights

starwars_name height
Luke Skywalker 172
C-3PO 167
Darth Vader 202
Leia Organa 150
Greedo 246

starwars_homeworlds

starwars_name homeworld
Luke Skywalker Tatooine
C-3PO Tatooine
Darth Vader Tatooine
Leia Organa Alderaan
Ayla Secura Ryloth


Uh oh! There’s no “Ayla Secura” in the height table and there’s no “Greedo” in the homeworld table. Can we still join the tables? Run the code below to see what happens.


library(dplyr)
   
# Create new tables 
starwars_heights <- data_frame(starwars_name = c("Luke Skywalker", "C-3PO", "Darth Vader", "Leia Organa", "Greedo"), 
                               height = c(172, 167, 202, 150, 246))      

starwars_homeworlds <- data_frame(starwars_name = c("Luke Skywalker", "C-3PO", "Darth Vader", "Leia Organa", "Ayla Secura"), 
                                  homeworld = c("Tatooine", "Tatooine", "Tatooine", "Alderaan", "Ryloth"))

# Join the tables together by starwars_name
## Tell left_join which columns to use as the key with:  
##     by = c("key_left" = "key_right")
height_and_homeworld <- left_join(starwars_heights, starwars_homeworlds, 
                                  by = c("starwars_name" = "starwars_name"))

height_and_homeworld


starwars_name height homeworld
Luke Skywalker 172 Tatooine
C-3PO 167 Tatooine
Darth Vader 202 Tatooine
Leia Organa 150 Alderaan
Greedo 246 NA


Did it work?

When left_join adds the homeworlds column to the starwars_heights table it only adds a value for the characters when the tables have a matching character name. When R couldn’t find “Greedo” in the homeworld table, the Star Wars character’s homeworld was recorded as missing or NA.


Multiple records


Now imagine the table starwars_homeworld has two records for C-3PO, each with a different homeworld listed.

What will happen when you join the two tables?


starwars_heights

starwars_name height
Luke Skywalker 172
C-3PO 167
Darth Vader 202
Leia Organa 150
Greedo 246

starwars_homeworlds

starwars_name homeworld
Luke Skywalker Tatooine
C-3PO Tatooine
C-3PO Tantive IV
Darth Vader Tatooine
Leia Organa Alderaan
Ayla Secura Ryloth


When you run the code below you’ll see that left_join is very thorough and adds each additional homeworld it finds for C-3PO as a new row in the joined table.

# Create new tables 
starwars_heights <- data_frame(starwars_name = c("Luke Skywalker", "C-3PO", "Darth Vader", "Leia Organa", "Greedo"), 
                               height  = c(172, 167, 202, 150, 173))


starwars_homeworlds <- data_frame(starwars_name = c("Luke Skywalker", "C-3PO", "C-3PO", "Darth Vader", "Leia Organa", "Ayla Secura"), 
                                  homeworld = c("Tatooine", "Tatooine", "Tantive IV", "Tatooine", "Alderaan", "Ryloth"))

# Join the tables together by Star Wars character name
height_and_homeworld <- left_join(starwars_heights, starwars_homeworlds)

# Check number of rows
nrow(height_and_homeworld)

height_and_homeworld


starwars_name height homeworld
Luke Skywalker 172 Tatooine
C-3PO 167 Tatooine
C-3PO 167 Tantive IV
Darth Vader 202 Tatooine
Leia Organa 150 Alderaan
Greedo 173 NA


This results in a table with one extra row than we started with in our heights table. So growing table sizes are a sign of duplicate records when using left_join().

In practice, when you see this you’ll want to investigate why one of your tables has duplicate entries, especially if the observation for the two rows is different, as it was for C-3PO’s homeword.

Are there really two different Star Wars characters named C-3PO, or maybe someone made two different guesses about the droid’s homeworld, or maybe the data simply has a mistake.


Back to the scrap yard


Let’s apply our new left_join() skills to the scrap data.

First, let’s re-read the full scrap data.

library(readr)
library(dplyr)

# Read in the larger scrap database
scrap <- read_csv("https://itep-r.netlify.com/data/starwars_scrap_jakku_full.csv")

# what units types are in the data? 
distinct(scrap, units)
## # A tibble: 3 x 1
##   units      
##   <chr>      
## 1 Cubic Yards
## 2 Items      
## 3 Tons

Join the conversion table to scrap

Look at the tables. What columns in both tables do we want to join by?

names(scrap)

Let’s join by item and units.

# Join the scrap to the conversion table
scrap <- left_join(scrap, convert, 
                   by = c("item"  = "item", 
                          "units" = "units"))


Want to save on typing?

left_join() will automatically search for matching columns if you don’t use the by= argument. So if you know 2 tables share a column name you don’t have to specify how to join them. The code below does the same as above.

scrap <- left_join(scrap, convert)

head(scrap, 4)
## # A tibble: 4 x 8
##   receipt_date item  origin destination amount units price_per_pound
##   <chr>        <chr> <chr>  <chr>        <dbl> <chr>           <dbl>
## 1 4/1/2013     Bulk~ Crate~ Raiders       4017 Cubi~           1005.
## 2 4/2/2013     Star~ Outsk~ Trade cara~   1249 Cubi~           1129.
## 3 4/3/2013     Star~ Outsk~ Niima Outp~   4434 Cubi~           1129.
## 4 4/4/2013     Hull~ Crate~ Raiders        286 Cubi~            843.
## # ... with 1 more variable: pounds_per_unit <dbl>

Help!

For more details, you can type ?left_join to see all the arguments and options.

Total pounds per shipment


Let’s mutate()!

Now that we have pounds per unit we can use mutate to calculate the total pounds for each shipment.

Fill in the blank

scrap <- scrap %>% 
         mutate(total_pounds = amount *  _____________ )
Show code

scrap <- scrap %>% 
         mutate(total_pounds = amount * pounds_per_unit)

Total price per shipment

Explore!

Total price

We need to do some serious multiplication. We now have the total amount shipped in pounds, and the price per pound, but we want to know the total price for each transaction.

How do we calculate that?

# Calculate the total price for each shipment
scrap <- scrap %>% 
         mutate(credits = ________ * ________ )
         

Total price

We need to do some serious multiplication. We now have the total amount shipped in pounds, and the price per pound, but we want to know the total price for each transaction.

How do we calculate that?

# Calculate the total price for each shipment
scrap <- scrap %>% 
         mutate(credits = total_pounds  * ________ )

Total price

We need to do some serious multiplication. We now have the total amount shipped in pounds, and the price per pound, but we want to know the total price for each transaction.

How do we calculate that?

# Calculate the total price for each shipment
scrap <- scrap %>% 
         mutate(credits = total_pounds * price_per_pound)
         


Price per item

Great! Let’s add one last column. We can divide the shipment’s credits by the amount of items to get the price_per_unit.

# Calculate the price per unit
scrap <- scrap %>% 
         mutate(price_per_unit = credits / amount)


Note

Analysts often get asked questions:

  • What’s the highest number?
  • What’s the lowest number?
  • What was the average tons of scrap from Cratertown last year?
  • Which town’s scrap was worth the most credits?


Return to Homebase