Day 1 - PM


Data exploration

1 dplyr



You’ve unlocked a new package!


dplyr is the go-to tool for exploring, re-arranging, and summarizing data.

Use install.packages("dplyr") to add dplyr to your library.

Then try loading the package from your library to test if it was installed.

library(dplyr)

Get to know your data frame

Use these functions to describe your data.

Data frame info

Function Information
names(scrap) column names
nrow(...) number of rows
ncol(...) number of columns
summary(...) summary of all column values (ex. max, mean, median)
glimpse(...) column names + a glimpse of first values (requires dplyr package)

2 glimpse() & summary()


glimpse() tells you what type of data and how much data you have.

Let’s read the data into R and give these a whirl.

library(readr)
library(dplyr)

scrap <- read_csv("https://itep-r.netlify.com/data/starwars_scrap_jakku.csv")

# View your whole dataframe and the types of data it contains
glimpse(scrap)
## Observations: 1,132
## Variables: 7
## $ receipt_date    <chr> "4/1/2013", "4/2/2013", "4/3/2013", "4/4/2013"...
## $ item            <chr> "Flight recorder", "Proximity sensor", "Vitus-...
## $ origin          <chr> "Outskirts", "Outskirts", "Reestki", "Tuanul",...
## $ destination     <chr> "Niima Outpost", "Raiders", "Raiders", "Raider...
## $ amount          <dbl> 887, 7081, 4901, 707, 107, 32109, 862, 13944, ...
## $ units           <chr> "Tons", "Tons", "Tons", "Tons", "Tons", "Tons"...
## $ price_per_pound <dbl> 590.93, 1229.03, 225.54, 145.27, 188.28, 1229....


summary() gives you a quick report of your numeric data.

# Summary gives us the means, mins, and maxima 
summary(scrap)
##  receipt_date           item              origin         
##  Length:1132        Length:1132        Length:1132       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  destination            amount           units          
##  Length:1132        Min.   :     65   Length:1132       
##  Class :character   1st Qu.:   1544   Class :character  
##  Mode  :character   Median :   4099   Mode  :character  
##                     Mean   :  18751                     
##                     3rd Qu.:   7475                     
##                     Max.   :2971601                     
##                     NA's   :910                         
##  price_per_pound   
##  Min.   :   145.3  
##  1st Qu.:   259.6  
##  Median :   790.7  
##  Mean   :  3811.7  
##  3rd Qu.:  1496.7  
##  Max.   :579215.3  
##  NA's   :910


Explore!

# Try some more on your own
nrow()

ncol()

names()

Your analysis toolbox


dplyr is the hero for most analysis tasks. With these six functions you can accomplish just about anything you want with your data.


Function Job
select() Select individual columns to drop or keep
arrange() Sort a table top-to-bottom based on the values of a column
filter() Keep only a subset of rows depending on the values of a column
mutate() Add new columns or update existing columns
summarize() Calculate a single summary for an entire table
group_by() Sort data into groups based on the values of a column

3 Porg tables


A poggle of porgs volunteered to help us demo the dplyr functions.





















4 select()


Use the select() function to:

  • Drop columns you no longer need
  • Pull-out a few columns to create a new table
  • Rearrange or change the order of columns

Drop columns with a minus sign: -col_name

library(dplyr)
library(readr)

scrap <- read_csv("https://itep-r.netlify.com/data/starwars_scrap_jakku.csv")

# Drop the destination column
select(scrap, -destination)
## # A tibble: 1,132 x 6
##    receipt_date item                  origin   amount units price_per_pound
##    <chr>        <chr>                 <chr>     <dbl> <chr>           <dbl>
##  1 4/1/2013     Flight recorder       Outskir~    887 Tons             591.
##  2 4/2/2013     Proximity sensor      Outskir~   7081 Tons            1229.
##  3 4/3/2013     Vitus-Series Attitud~ Reestki    4901 Tons             226.
##  4 4/4/2013     Aural sensor          Tuanul      707 Tons             145.
##  5 4/5/2013     Electromagnetic disc~ Tuanul      107 Tons             188.
##  6 4/6/2013     Proximity sensor      Tuanul    32109 Tons            1229.
##  7 4/7/2013     Hyperdrive motivator  Tuanul      862 Tons            1485.
##  8 4/8/2013     Landing jet           Reestki   13944 Tons            1497.
##  9 4/9/2013     Electromagnetic disc~ Cratert~   7788 Tons             188.
## 10 4/10/2013    Sublight engine       Outskir~  10642 Tons            7211.
## # ... with 1,122 more rows

Drop multiple columns

Use -c(col_name1, col_name2) or -col_name1, -col_name2 to drop multiple columns.

# Drop the destination and units columns
select(scrap, -c(destination, units, amount))
## # A tibble: 1,132 x 4
##    receipt_date item                             origin     price_per_pound
##    <chr>        <chr>                            <chr>                <dbl>
##  1 4/1/2013     Flight recorder                  Outskirts             591.
##  2 4/2/2013     Proximity sensor                 Outskirts            1229.
##  3 4/3/2013     Vitus-Series Attitude Thrusters  Reestki               226.
##  4 4/4/2013     Aural sensor                     Tuanul                145.
##  5 4/5/2013     Electromagnetic discharge filter Tuanul                188.
##  6 4/6/2013     Proximity sensor                 Tuanul               1229.
##  7 4/7/2013     Hyperdrive motivator             Tuanul               1485.
##  8 4/8/2013     Landing jet                      Reestki              1497.
##  9 4/9/2013     Electromagnetic discharge filter Cratertown            188.
## 10 4/10/2013    Sublight engine                  Outskirts            7211.
## # ... with 1,122 more rows

Keep only 3 columns

# Keep item, amount and price_per_pound columns
select(scrap, item, amount, price_per_pound)
## # A tibble: 1,132 x 3
##    item                             amount price_per_pound
##    <chr>                             <dbl>           <dbl>
##  1 Flight recorder                     887            591.
##  2 Proximity sensor                   7081           1229.
##  3 Vitus-Series Attitude Thrusters    4901            226.
##  4 Aural sensor                        707            145.
##  5 Electromagnetic discharge filter    107            188.
##  6 Proximity sensor                  32109           1229.
##  7 Hyperdrive motivator                862           1485.
##  8 Landing jet                       13944           1497.
##  9 Electromagnetic discharge filter   7788            188.
## 10 Sublight engine                   10642           7211.
## # ... with 1,122 more rows

everything() else the same

select() also works to change the order of columns. The code below puts the item column first and then moves the units and amount columns directly after item.

We then keep evertyhing() else the same.

# Make the `item`, `units`, and `amount` columns the first three columns
# Leave `everything()` else in the same order
select(scrap, item, units, amount, everything())
## # A tibble: 1,132 x 7
##    item       units amount receipt_date origin  destination price_per_pound
##    <chr>      <chr>  <dbl> <chr>        <chr>   <chr>                 <dbl>
##  1 Flight re~ Tons     887 4/1/2013     Outski~ Niima Outp~            591.
##  2 Proximity~ Tons    7081 4/2/2013     Outski~ Raiders               1229.
##  3 Vitus-Ser~ Tons    4901 4/3/2013     Reestki Raiders                226.
##  4 Aural sen~ Tons     707 4/4/2013     Tuanul  Raiders                145.
##  5 Electroma~ Tons     107 4/5/2013     Tuanul  Niima Outp~            188.
##  6 Proximity~ Tons   32109 4/6/2013     Tuanul  Trade cara~           1229.
##  7 Hyperdriv~ Tons     862 4/7/2013     Tuanul  Trade cara~           1485.
##  8 Landing j~ Tons   13944 4/8/2013     Reestki Niima Outp~           1497.
##  9 Electroma~ Tons    7788 4/9/2013     Crater~ Raiders                188.
## 10 Sublight ~ Tons   10642 4/10/2013    Outski~ Niima Outp~           7211.
## # ... with 1,122 more rows

5 Sort with arrange()


Rey wants to know what the highest priced items are. Use arrange() to find the highest priced scrap item and see which origins might have a lot of them.

# Arrange scrap items by price
scrap <- arrange(scrap, price_per_pound)

# View the top 6 rows
head(scrap)
## # A tibble: 6 x 7
##   receipt_date item      origin    destination amount units price_per_pound
##   <chr>        <chr>     <chr>     <chr>        <dbl> <chr>           <dbl>
## 1 4/4/2013     Aural se~ Tuanul    Raiders        707 Tons             145.
## 2 5/22/2013    Aural se~ Outskirts Niima Outp~   3005 Tons             145.
## 3 5/23/2013    Aural se~ Tuanul    Raiders       6204 Tons             145.
## 4 6/4/2013     Aural se~ Tuanul    Raiders       3120 Tons             145.
## 5 6/10/2013    Aural se~ Blowback~ Niima Outp~   2312 Tons             145.
## 6 6/20/2013    Aural se~ Outskirts Trade cara~   6272 Tons             145.


What, only 145 credits!? That’s not much at all. Oh wait…that’s the smallest one on top.

Big things first: desc()

To arrange a column in descending order with the biggest numbers on top, we use: desc(price_per_pound)

# Put most expensive items on top
scrap <- arrange(scrap, desc(price_per_pound))

# View the top 8 rows
head(scrap, 8)
## # A tibble: 8 x 7
##   receipt_date item      origin    destination amount units price_per_pound
##   <chr>        <chr>     <chr>     <chr>        <dbl> <chr>           <dbl>
## 1 12/31/2016   Total     All       All         2.97e6 Tons          579215.
## 2 4/10/2013    Sublight~ Outskirts Niima Outp~ 1.06e4 Tons            7211.
## 3 4/14/2013    Sublight~ Outskirts Raiders     2.38e3 Tons            7211.
## 4 4/15/2013    Sublight~ Craterto~ Raiders     6.31e3 Tons            7211.
## 5 4/16/2013    Sublight~ Tuanul    Trade cara~ 3.98e3 Tons            7211.
## 6 5/14/2013    Sublight~ Craterto~ Raiders     2.99e2 Tons            7211.
## 7 6/14/2013    Sublight~ Blowback~ Raiders     8.58e3 Tons            7211.
## 8 8/6/2013     Sublight~ Craterto~ Raiders     1.77e3 Tons            7211.


Explore!

Try arranging by more than one column, such as price_per_pound and amount. What happens?

Hint: You can view the entire table by clicking on it in the upper-right Environment tab.

Spock-tip!

When you save an arranged data table it maintains its order. This is perfect for sending people a quick Top 10 list of pollutants or sites.

R BINGO!


We don’t have free spaces. Sorry, we are very mean people.

  1. We’re going to call-out the bingo words using an R function.
# Get pretty colors
install.packages("viridis")
library(viridis)
library(ggplot2)

# List of all the words
bingo_words <- c("median()", "nrow()", "glimpse()", "sum()", "head()", "tail()", "arrange()", "write_csv()", "geom_col()", "filter()", "ncol()", "sd()", "summarise()", "quantile()", "install.packages()", "geom_point()", "unique()", "select()", "mean()", "min()", "left_join()", "read_csv()", "nth()", "ggplot()", "library()", "n()")

# Shuffle the words randomly
bingo_words <- sample(bingo_words)

# Set the draw number
n <- 1

# Loop thru each word
for(n in 1:length(bingo_words)) {

  print(n)
  
  # Plot the word
  call_fun <- ggplot(data.frame(x = 1, y = 1), aes(x = 1, y = 1)) + 
        geom_point(color = sample(viridis_pal()(30), 1), size = 177) + 
        geom_label(label = bingo_words[n], size = 12) +
        labs(x = NULL, y = NULL) +
        theme_void()
  
  print(call_fun)
  
  readline(prompt="Press Enter...")
}

6 filter()


The filter() function creates a subset of the data based on the value of one or more columns. Let’s take a look at the records with the origin "All".

filter(scrap, origin == "All")
## # A tibble: 1 x 7
##   receipt_date item  origin destination  amount units price_per_pound
##   <chr>        <chr> <chr>  <chr>         <dbl> <chr>           <dbl>
## 1 12/31/2016   Total All    All         2971601 Tons          579215.


Spock-tip!

We use a == (double equals sign) for comparing values. A == makes the comparison “is it equal to?” and returns a True or False answer. So the code above will returns only the rows where the condition origin == "All" is TRUE.

A single equals sign = is used within functions to set options, for example read_csv(file = "starwars_scrap_jakku.csv"). It’s not a big deal if you mix them up the first time. R is often helpful and will let you know which one is needed.

Comparing values


We use a variety of comparisons when processing data. Sometimes we only want concentrations above a certain level, or days below a given temperature, or sites that have missing observations.

We use the Menu of comparisons below to find the data we want.

Menu of comparisons

Symbol Comparison
> greater than
>= greater than or equal to
< less than
<= less than or equal to
== equal to
!= NOT equal to
%in% value is in a list: X %in% c(1,3,7)
is.na(...) is the value missing?
str_detect(col_name, "word") “word” appears in text?


Explore!

Try comparing some things in the console to see if you get what you’d expect. R doesn’t always think like we do.

4 != 5

4 == 4

4 < 3

4 > c(1, 3, 5)

5 == c(1, 3, 5)

5 %in% c(1, 3, 5)

2 %in% c(1, 3, 5)

2 == NA

6.1 Dropping rows


Let’s look at the data without the All category. Look in the comparison table above to find the NOT operator. We’re going to filter our data to keep only the origins that are NOT equal to “All”.

scrap <- filter(scrap, origin != "All")

We can arrange the data in ascending order by item to confirm the “All” category is gone.

# Arrange data
scrap <- arrange(scrap, item)

head(scrap)
## # A tibble: 6 x 7
##   receipt_date item      origin    destination amount units price_per_pound
##   <chr>        <chr>     <chr>     <chr>        <dbl> <chr>           <dbl>
## 1 4/4/2013     Aural se~ Tuanul    Raiders        707 Tons             145.
## 2 5/22/2013    Aural se~ Outskirts Niima Outp~   3005 Tons             145.
## 3 5/23/2013    Aural se~ Tuanul    Raiders       6204 Tons             145.
## 4 6/4/2013     Aural se~ Tuanul    Raiders       3120 Tons             145.
## 5 6/10/2013    Aural se~ Blowback~ Niima Outp~   2312 Tons             145.
## 6 6/20/2013    Aural se~ Outskirts Trade cara~   6272 Tons             145.


Now let’s take another look at that bar chart. Is there anything else that is less than perfect with our data?

library(ggplot2)

ggplot(scrap, aes(x = origin, y = amount)) + geom_col()

Multiple filters


We can add multiple comparisons to filter() to further restrict the data we pull from a larger data set. Only the records that pass the conditions of all the comparisons will be pulled into the new data frame.

The code below filters the data to only scrap records with an origin of Outskirts AND a destination of Niima Outpost.

outskirts_to_niima <- filter(scrap, 
                             origin        == "Outskirts", 
                             destination   == "Niima Outpost")


Let’s calculate some new columns to help focus Rey’s scavenging work.

7 mutate()


mutate() can edit existing columns in a data frame or add new values that are calculated from existing columns.

Add a column

First, let’s add a column with our names. That way Rey can thank us personally when her ship is finally up and running.

# Add your name as a column
scrap <- mutate(scrap, scrap_finder = "BB8")

Add several columns

Let’s also add a new column to document the data measurement method.

#  Add your name as a column and
#  some information about the method
scrap <- mutate(scrap, 
                scrap_finder    = "BB8",
                measure_method  = "REM-24")

## REM = Republic Equivalent Method

Change a column


Remember how the units of Tons was written two ways: “TONS” and “Tons”? We can use mutate() together with tolower() to make sure all of the scrap units are written in lower case. This will help prevent them from getting grouped separately in our plots and summaries.

# Set units to all lower-case
scrap <- mutate(scrap, units = tolower(units))

# toupper() changes all of the letters to upper-case.

Add calculated columns

In our work we often use mutate to convert units for measurements. In this case, let’s estimate the total pounds for the scrap items that are reported in tons.

Tons to Pounds conversion

We can use mutate() to convert the amount column to pounds. Multiply the amount column by 2000 to get new values for a column named amount_lbs.

scrap_pounds <- mutate(scrap, amount_lbs = amount * 2000)

Final stretch!


We now have all the tools we need.

To get Rey’s ship working we need to track down more of the scrap item: Ion engine.

Step 1:

  • filter the data to only Ion engine


Show code

# Grab only the items named "Ion engine"
scrap_pounds <- filter(scrap_pounds, item == "Ion engine")




Next arrange the data in descending order of pounds so Rey knows where the highest amount of Ion engine scrap comes from.

Step 2:

  • arrange in descending order by amount_lbs


Show code

# Arrange data
scrap_pounds <- arrange(scrap_pounds, desc(amount_lbs))

# Return the origin of the highest amount_lbs of scrap
head(scrap_pounds, 1)

# Plot the total amount_lbs by origin
ggplot(scrap_pounds, aes(x = origin, y = amount_lbs)) + 
  geom_col()

Complete the mission!

For the item Ion engine, which origin has the most amount_lbs?

Tuanul
Cratertown
Outskirts
Reestki


Show solution

Cratertown

Yes!! You receive the ship parts to repair Rey’s Millennium Falcon. Onward!

First mission complete!

Nice work! Rey got a great deal on her engines and even traded in some scrap for spending cash. You’ve got the music in you, oh, oh, oh… Sorry.


Time to get off this dusty planet, we’re flying to Endor!


8 Save our data


You can’t break your original dataset if you name your edited data something new. Let’s use the readr package to save our new CSV with the tons converted to pounds.

# Save data as a CSV file 
write_csv(scrap_pounds, "scrap_in_pounds.csv")

Where did R save the file?

Hint: You can look in the lower right [Files] tab.


Let’s create a new results/ folder to keep our processed data separate from any raw data we receive. Now we can add results/ to our file path when we save the file.

# Save data as a CSV file to results folder
write_csv(scrap_pounds, "results/scrap_in_pounds.csv")


9 Plots with ggplot2


Plot the data, Plot the data, Plot the data


The ggplot() sandwich

A ggplot has 3 ingredients.

1. The base plot

library(ggplot2)
ggplot(scrap)


we load version 2 of the package library(ggplot2), but the function to make the plot is only ggplot(). No 2. Sorry.

2. The the X, Y aesthetics

The aesthetics assign the columns from the data that you want to use in the chart. This is where you set the X-Y variables that determine the dimensions of the plot.

ggplot(scrap, aes(x = origin, y = amount)) 

3. The layers or geometries

ggplot(scrap, aes(x = origin, y = amount)) + geom_col()


Colors


Now let’s change the fill color to match the origin.

ggplot(scrap, aes(x = origin, y = amount, fill = origin)) +
  geom_col() 

Explore!

Try making a column plot showing the total amount of scrap for each destination or for each item.

ggplot(scrap, aes(x = destination, y = amount )) + geom_col()


10 Shutdown complete


When you close R for the first time you’ll see some options about saving your workspace, history, and other files. In general, we advise against saving these files. This will help RStudio open a fresh and clean environment every time you start it. And it’s easy enough to re-run your script next time you open your project.

Follow these steps to set these options permanently, and not have to see the pop-up windows when you close R.

Turn off Save Workspace

  1. Go to Tools > Global Options.... on the top RStudio navigation bar
  2. On the first screen:
    • [ ] Uncheck Restore .Rdata into workspace at startup
    • Set Save workspace to .RData on exit to [“Never?].
    • [ ] Uncheck Always save history


REVIEW: Functions & arguments

Functions perform steps based on inputs called arguments and usually return an output object. There are functions in R that are really complex but most boil down to the same general setup:

new_output <- function(argument_input1, argument_input2)

You can make your own functions in R and name them almost anything you like, even my_amazing_starwars_function().



You can think of a function like a plan for making Clone Troopers.


create_clones(host       = "Jango Fett", 
              n_troopers = 2000)


The function above creates Clone Troopers based on two arguments: the host and n_troopers. When we have more than one argument, we use a comma to separate them. With some luck, the function will successfully return a new object - a group of 2,000 Clone Troopers.

The sum() function

We can use the sum() function to find the sum age of our Star Wars characters.

starwars_ages  <- c(19,19,25)

# Call the sum function with starwars_ages as input
ages_sum <- sum(starwars_ages) # Assigns the output to starwars_ages_sum

# Print the starwars_ages_sum value to the console
ages_sum
## [1] 63


The sum() function takes the starwars_ages vector as input, performs a calculation, and returns a number. Note that we assigned the output to the name ages_sum.

If we don’t assign the output it will be printed to the console and not saved.

# Alternative without assigning output
sum(starwars_ages) 
## [1] 63


Note

The original starwars_ages vector has not changed. Each function has its own “environment” and its calculations happen inside a bubble. In general, what happens inside a function won’t change your objects outside of the function.

starwars_ages
## [1] 19 19 25



New data -> New project!


It’s time for some Earth data. Use your new R skills to explore some data from closer to home.

Set up your project

  • Open a new project
  • Open a new R script
  • Create a data folder in your project directory
  • Copy your data to that folder

Begin your analysis

1. Read data into R

library(readr)
library(janitor)

# Read a CSV file
air_data <- read_csv("data/My-data.csv")


# Have an EXCEL file?
## You can use read_excel() from the readxl package
install.packages(readxl)

library(readxl)

# Read an EXCEL file
air_data <- read_excel("data/My-data.xlsx")

2. Clean the column names

air_data <- clean_names(air_data)

3. Get to know your data

Hint: summary(), glimpse(), nrow(), names()

4. Plot the data

library(ggplot2)

# Remember the ggplot sandwich!
ggplot(air_data, aes(x = TEMP_F, y = OZONE)) +
    geom_point(aes(color = site_name))

Keep exploring…


Return to Homebase