summarize() allows you to apply a summary function like
median() to a column and collapse the data down to a single row. To dig into
summarize you’ll want to learn some more summary functions like
sum() to find the total credits from all the scrap.
summarize(scrap, total_credits = sum(credits))
mean() to calculate the mean
price_per_pound in the scrap report.
summarize(scrap, mean_price = mean(price_per_pound, na.rm = T))
What’s the average of missing data? I don’t know.
Did you see the
na.rm = TRUE inside the
mean() function. This tells R to ignore empty cells or missing values that show up in R as
NA. If you leave
na.rm out, the mean function will return ‘NA’ if it finds a missing value anywhere in the data.
Use summarize to calculate the median price_per_pound in the scrap report.
summarize(scrap, median_price = median(price_per_pound, na.rm = T))
Use summarize to calculate the maximum price per pound any scrapper got for their scrap.
summarize(scrap, max_price = max(price_per_pound, na.rm = T))
Use summarize to calculate the minimum price per pound any scrapper got for their scrap.
summarize(scrap, min_price = min(price_per_pound, na.rm = T))
What is the standard deviation of the credits?
summarize(scrap, stdev_credits = sd(credits))
Quantiles are useful for finding the upper or lower range of a column. Use the
quantile() function to find the the 5th and 95th quantile of the prices.
summarize(scrap, price_5th_pctile = quantile(price_per_pound, 0.05, na.rm = T), price_95th_pctile = quantile(price_per_pound, 0.95))
na.rm = T to
n() stands for count.
Use summarize and
n() to count the number of reported scrap records.
summarize(scrap, scrap_records = n())
Create a summary of the scrap data that includes 3 of the summary functions above. The following is one example.
summary <- summarize(scrap, max_credits = __________, weight_90th_pct = quantile(Weight, 0.90), count_records = __________,
Use summarize and
n() to count the number of reported scrap records going to
niima_scrap <- filter(scrap, destination == "Niima Outpost") niima_scrap <- summarize(niima_scrap, scrap_records = n())
What if we wanted to count the number for each destination?
That sounds like a whole lot of summarizing.
It’d be nice if we could easily find the mean for every city. Then we could summarize once and move on.
Who’s selling goods for cheap? Use
group_by with the column Origin, and then use
summarize to find the
mean(price_per_pound) at each Origin City.
scrap_summary <- group_by(scrap, origin) %>% summarize(mean_price = mean(price_per_pound, na.rm = T))
You can round the prices to a certain number of digits using the
round() function. We can finish by adding the
arrange() function to sort the table by our new column.
scrap_means <- group_by(scrap, origin) %>% summarize(mean_price = mean(price_per_pound, na.rm = T), mean_price_round = round(mean_price, digits = 2)) %>% arrange(mean_price_round) %>% ungroup()
round() function in R does not automatically round values ending in 5 up. Instead it uses scientific rounding, which rounds values ending in 5 to the nearest even number. So 2.5 rounded to the nearest whole number rounds down to 2, and 3.5 rounded to the nearest whole number rounds up 4.
Who’s making lots of transactions? Try using
group_by with the column origin and then
summarize to count the number of scrap records at each city.
scrap_counts <- group_by(scrap, origin) %>% summarize(origin_count = n()) %>% ungroup()
ungroup() is good practice. This prevents your data from staying grouped after the summarizing has been completed.
3 Save files
Let’s save the mean price summary table we created to a CSV. That way we can transfer it to a droid courier for delivery to Rey. To save a data frame we can use the
write_csv() function from our favorite
# Write the file to your results folder write_csv(scrap_means, "results/prices_by_origin.csv")
By default, when saving R will overwrite a file if the file already exists in the same folder. It will not ask for confirmation. To be safe, save processed data to a new folder called
results/ and not to your raw
We can bring back
mutate to add a column based on the grouped values in a data set. For example, you may want to add a column showing the mean price by origin to the whole table, but still keep all of the records. This is a good way to add values to the table to serve as a reference point.
How does the price of Item X compare to the average price?
When you combine
mutate the new column will be calculated based on the values within each group. Let’s group by
origin to find the
mean() price per pound at each origin.
scrap <- group_by(scrap, origin) %>% mutate(origin_mean_price = mean(price_per_pound, na.rm = T)) %>% ungroup()
Star Wars edition
Are you the best Jedi detective out there? Let’s play a game to find out.
Guess what else comes with the
dplyr package? A Star Wars data set.
Open the data set:
- Load the
dplyrpackage from your
- Pull the Star Wars dataset into your environment.
library(dplyr) people <- starwars
- You have a top secret identity.
- Scroll through the Star Wars dataset and find a character you find interesting.
- Or run
sample_n(starwars_data, 1)to choose one at random.
- Or run
- Keep it hidden! Don’t show your neighbor the character you chose.
- Take turns asking each other questions about your partner’s Star Wars character.
- Use the answers to build a
filter()function and narrow down the potential characters your neighbor may have picked.
For example: Here’s a
filter() statement that filters the data to the character Plo Koon.
mr_koon <- filter(people, mass < 100, eye_color != "blue", gender == "male", homeworld == "Dorin", birth_year > 20)
Elusive answers are allowed. For example, if someone asks: What is your character’s mass?
- You can respond:
- My character’s mass is equal to one less than their age.
- Or if you’re feeling generous you can respond:
- My character’s mass is definitely more than 100, but less than 140.
My character has NO hair! (Missing values)
Sometimes a character will be missing a specific attribute. We learned earlier how R stores missing values as
NA. If your character has a missing value for hair color, one of your filter statements would be
The winner is the first to guess their neighbor’s character.
WINNERS Click here!
Want to rematch?
How about make it best of 3 games?
[If this is true],
"Otherwise do this"
Here’s a handy
ifelse statement to help you identify lightsabers.
ifelse(Lightsaber is GREEN?, Yes!
Then Yoda's it is, No!
Then not Yoda's)
Or say we want to label all the porgs over 60 cm as
tall, and everyone else as
short. Whenever we want to add a column where the value depends on the value found in another column. We can use
Or maybe we’re trying to save some money and want to flag all the items that cost less than 500 credits. How?
ifelse() is powerful!
On the cheap
ifelse() to add a column named
affordable to our scrap data.
# Add an affordable column scrap <- scrap %>% mutate(affordable = ifelse(price_per_unit < 500, "Cheap", "Expensive"))
Use your new column and
filter() to create a new
What is the cheapest item?
CONGRATULATIONS of galactic proportions to you.
We now have a clean and tidy data set. If BB8 ever receives new data again, we can re-run this script and in seconds we’ll have it all cleaned up.
6 Plots with ggplot2
Plot the data, Plot the data, Plot the data
ggplot has 3 ingredients.
1. The base plot
we load version 2 of the package
library(ggplot2), but the function to make the plot is only
2. The the X, Y aesthetics
The aesthetics assign the columns from the data that you want to use in the chart. This is where you set the
Y variables that determine the dimensions of the plot.
ggplot(scrap, aes(x = origin, y = amount))
3. The layers or geometries
ggplot(scrap, aes(x = origin, y = amount)) + geom_col()
Now let’s change the fill color to match the origin.
ggplot(scrap, aes(x = origin, y = amount, fill = origin)) + geom_col()
Try making a column plot showing the total amount of scrap for each
destination or for each
ggplot(scrap, aes(x = destination, y = amount )) + geom_col()
Try making a scatterplot of any two columns.
Hint: Numeric variables will be more informative.
ggplot(scrap, aes(x = __column1__, y = __column2__)) + geom_point()
Now let’s use color to show the origins of the scrap
ggplot(scrap, aes(x = destination, y = credits, color = origin)) + geom_point()
This is a A LOT of detail. Let’s make a bar chart and add up the sales to make it easier to understand.
ggplot(scrap, aes(x = destination, y = credits, fill = origin)) + geom_col()
It’s still tricky to compare sales by origin. Let’s change the position of the columns.
ggplot(scrap, aes(x = destination, y = credits, fill = origin)) + geom_col(position = "dodge")
7 More Plots
Now let’s use color to show the destinations of the scrap.
ggplot(scrap, aes(x = origin, y = credits, color = destination)) + geom_point()
One easy way to experiment with colors is to add layers like
scale_colour_brewer to your plot which will link to RcolorBrewer palettes so you can have accessible color schemes.
This is way too much detail. Let’s simplify and make a bar chart that adds up all the sales. Note that we use
aes() instead of
color=. If we use color, we get a colorful outline and gray bars.
ggplot(scrap, aes(x = origin, y = credits, fill = destination)) + geom_col()
Let’s change the position of the bars to make it easier to compare sales by destination for each origin? Remember, you can use
help(geom_col) to learn about the different options for that plot. Feel free to do the same with other
geom_’s as well.
ggplot(scrap, aes(x = origin, y = credits, fill = destination)) + geom_col(position = "dodge")
Does the chart feel crowded to you? Let’s use the
facet wrap function to put each origin on a separate chart.
ggplot(scrap, aes(x = origin, y = credits, fill = destination)) + geom_col(position = "dodge") + facet_wrap("destination")
We can add lables to the chart by adding the
labs() layer. Let’s give our chart from above a title.
Titles and labels
ggplot(scrap, aes(x = origin, y = credits, fill = destination)) + geom_col(position = "dodge") + facet_wrap("destination") + labs(title = "Scrap sales by origin and destination", subtitle = "Planet Jakku", x = "Origin", y = "Total sales")
More layers! Let’s say we were advised to avoid sales that were over 50 Billion credits. Let’s add that as a horizontal line to our chart. For that, we use
ggplot(scrap, aes(x = origin, y = credits, fill = destination)) + geom_col(position = "dodge") + facet_wrap("destination") + labs(title = "Scrap sales by origin and destination", subtitle = "Planet Jakku", x = "Origin", y = "Total sales") + geom_hline(yintercept = 5E+10, color = "orange", size = 2)
2.2e+06 scientific notation
Want to get rid of that ugly scientific notation? We can use
options(scipen = 999). Note that this is a general setting in R. Once you use
options(scipen = 999) in your current session, you don’t have to use it again. (Like loading a package, you only need to run the line once when you start a new R session).
options(scipen = 999) ggplot(scrap, aes(x = origin, y = credits, fill = destination)) + geom_col(position = "dodge") + facet_wrap("destination") + theme_bw() + labs(title = "Scrap sales by origin and destination", x = "Origin", y = "Total sales")
Let’s say we don’t like printing so many zeros and want the labels to be in Millions of credits. Any ideas on how to make that happen?
You may not like the appearance of these plots.
theme functions to change the appearance of a plot. Try some.
ggplot(scrap, aes(x = origin, y = credits, fill = destination)) + geom_col(position = "dodge") + facet_wrap("destination") + theme_bw()
Be bold and make a boxplot. We’ve covered how to do a scatterplot with
geom_point and a bar chart with
geom_col, but how would you make a boxplot showing the prices at each destination? Feel free to experiment with
May the force be with you.
You’ve made some plots you can be proud of, so let’s learn to save them so we can cherish them forever. There’s a function called
ggsave to do just that. How do we
ggsave our plots?
# Get help help(ggsave) ?ggsave # Run the R code for your favorite plot first ggplot(data, aes()) + .... + .... # Then save your plot to a png file of your choosing ggsave("results/plot_name.png")
Sometimes you may want to make a plot and save it for later. For that, you give your plot a name. Any name will do.
# Name the ggplot you want to save my_plot <- ggplot(...) + geom_point(...) # Save it ggsave(filename = "results/Save_it_here.png", plot = my_plot)
Learn more about saving plots: http://stat545.com/
It’s Finn time
Seriously, let’s pay that ransom already.
Q: Where should we go to get our 10,000 Black boxes?
Step 1: Make a
geom_col() plot showing the total pounds of Black boxes shipped to each destination.
ggplot(cheap_scrap, aes(x = ______ , y = ______ )) + geom_
ggplot(cheap_scrap, aes(x = destination, y = total_pounds) ) + geom_col()
Which destination has the most pounds of the cheapest item?
Woop! Go get em! So long Jakku - see you never!
Super-serious kudos to you. You have earned yourself a great award.
You can customize the look of your plot by adding a
- How to modify the gridlines behind your chart?
- Try the different themes at the end of this lesson:
- Or modify the color and size with
theme(panel.grid.minor = element_line(colour = "white", size = 0.5))
- There’s even
- Try the different themes at the end of this lesson:
- How do you set the x and y scale manually?
- Here is an example with a scatter plot:
ggplot() + geom_point() + xlim(beginning, end) + ylim(beginning, end)
- Warning: Values above or below the limits you set will not be shown. This is another great way to lie with data.
- Here is an example with a scatter plot:
- How do you get rid of the legend if you don’t need it?
geom_point(aes(color = facility_name), show.legend = FALSE)
- The R Cookbook shows a number of ways to get rid of legends.
- I only like dashed lines. How do you change the linetype to a dashed line?
geom_line(aes(color = facility_name), linetype = "dashed")
- You can also try
"dotdash", or even
- How many colors are there in R? How does R know
hotpinkis a color?
- Keyboard shortcuts for RStudio
- There is a Shortcut cheatsheet
- In RStudio you can go to Help > Keyboard Shortcuts Help
- Load one of the data sets below into R
- Create 2 plots using the data.
- Don’t worry if it looks really weird. Consider it art and try again.
When you add more layers using
+ remember to place it at the end of each line.
# This will work ggplot(scrap, aes(x = origin, y = credits)) + geom_point() # So will this ggplot(scrap, aes(x = origin, y = credits)) + geom_point() # But this won't ggplot(scrap, aes(x = origin, y = credits)) + geom_point()