Air, wind and roses

Let’s take a look at some air pollution data stored in an Excel file.

Download the data

DOWNLOAD — La Jolla PM25 data

Install packages

# Get devtools for installing new packages from GitHub
install.packages("devtools", dependencies = TRUE)

# Load the devtools library 
library(devtools)

# Install openair for windrose functions
install_github('davidcarslaw/openair')

Load the data

# Load packages
library(openair)
library(readxl)
library(ggplot2)
library(dplyr)
library(lubridate)

# Set the path to your the Excel file
excel_path <- "data/la_jolla_pm25_wind_data.xls"

# Read the file
air_data <- read_excel(excel_path)

Explore the data

What are the column names?

names(air_data)

glimpse(air_data)

summary(air_data)

Simplify the column names

# Drop special characters and shorten names
# Set all names to lowercase
air_data <- air_data %>% 
            rename(pm25 = "PM2.5 Conc (ug/m3)", 
                   wd   = "Wind Direction (Degrees)",
                   ws   = "Wind Speed (mph)") %>%
            rename_all(tolower)

# We need numbers for our data, not text
# Set wind speed and wind direction to numeric
air_data <- air_data %>% 
            mutate(wd   = as.numeric(wd),
                   ws   = as.numeric(ws))

## Warning: NAs introduced by coercion

Plot the data

Create a plot to show the distribution of each of the columns containing observations: wind speed, wind direction, and concentration.

ggplot(air_data, aes(x = ?, y = ?)) + geom_point()

Clean ship

Let’s drop the non-sense values. We can’t use the rows that have a missing windspeed or wind direction observation.

air_data <- filter(air_data, 
                   !is.na(ws), 
                   !is.na(wd),
                   !is.na(pm25),
                   ws > 0)

Wind rose

Now let’s make some wind roses.

# Plot the data
windRose(air_data)

polarFreq(air_data)

#-- Fine tune  wind rose
polarFreq(air_data, ws.int =5, ws.upper = 35)

polarFreq(air_data, ws.int =0.8, breaks = seq(2:30))

Pollution rose

To make a pollution rose we can replace the name of the wind speed column with the name of PM2.5 column - "PM2.5 Conc (ug/m3)"

# Pollution concentrations based on wind directions
pollutionRose(air_data, 
              pollutant = "pm25",
              key.footer = "PM2.5 ug/m3")

Time series

DOWNLOAD — Ozone air data

library(lubridate)
library(readr)
library(ggplot2)

# Read the file
excel_path <- "data/2014_AQS_FondduLac.xlsx"

air_data <- read_excel(excel_path)

## # A tibble: 5 x 12
##   StateCode CountyCode SiteNum Latitude Longitude Date               
##       <dbl>      <dbl>   <dbl>    <dbl>     <dbl> <dttm>             
## 1        27        137    7001     47.5     -92.5 2014-01-01 00:00:00
## 2        27        137    7001     47.5     -92.5 2014-01-01 00:00:00
## 3        27        137    7001     47.5     -92.5 2014-01-01 00:00:00
## 4        27        137    7001     47.5     -92.5 2014-01-01 00:00:00
## 5        27        137    7001     47.5     -92.5 2014-01-01 00:00:00
## # ... with 6 more variables: Time <dttm>, Hour <dbl>, Parameter <dbl>,
## #   Conc <dbl>, site_catid <chr>, Year <dbl>

Explore the data

names(air_data)

glimpse(air_data)

Simplify the column names

# Set all names to lowercase
air_data <- air_data %>% 
            rename_all(tolower) %>%
            rename(site = site_catid)

Let’s focus on PM2.5

Use filter to select only the parameter code 88101.

air_data <- filter(air_data, parameter == ??)

air_data <- filter(air_data, parameter == 88101)

Let’s summarize the observations by day and then make a time series chart to see how the pollution concentrations are changing over time.

Add monthly statistics

# Add a month column
air_data <- air_data %>% 
            mutate(day   = day(date),
                   month = month(date),
                   year  = year(date))


# Find average PM25 concentration for each day
#   - And upper and lower 10th percentile concentration
air_summary <- group_by(air_data, site, year, month, day, date) %>%
               summarize(conc_avg  = mean(conc, na.rm = T),
                         conc_10th = quantile(conc, 0.10, na.rm = T),
                         conc_90th = quantile(conc, 0.90, na.rm = T)) %>%
               ungroup()

Plot a line chart

ggplot(air_summary, aes(x = date, y = conc_avg)) +
   geom_line() +
   facet_wrap(~ site)

Now we can add a confidence band behind the line showing the upper and lower 10th percentile of the observations.

ggplot(air_summary, aes(x = date, y = conc_avg)) + 
   geom_smooth(method ="loess", level = 0.90) +
   geom_line(color = "tomato") +
   facet_wrap(~site)

What happens when you increase the level = 0.90 up to 0.999, making the shaded band a 99% confidence interval?

Try adding another confidence band, but make it a linear model: method = "lm", and set the color to “black”. Does the new line predict the concentration to be going down or up at each site?

ggplot(air_summary, aes(x = date, y = conc_avg)) + 
   geom_smooth(method = "loess", level = 0.999) +
   geom_line(color = "tomato") +
   geom_smooth(method = "lm", level = 0.90, color = "black") +
   facet_wrap(~site)

Download the data

Install packages

Load the data

Explore the data

Simplify the column names

Plot the data

Clean ship

Wind rose

Pollution rose

Time series

Explore the data

Simplify the column names

Let’s focus on PM2.5

Return to Day 3-AM