Air, wind and roses

Let’s take a look at some air pollution data stored in an Excel file.

Install packages

# Get devtools for installing new packages from GitHub
install.packages("devtools", dependencies = TRUE)

library(devtools)

# Install openair for windrose functions
install_github('davidcarslaw/openair')

# Load packages
library(openair)
library(ggplot2)
library(dplyr)
library(lubridate)

# Set the path to your the Excel file
excel_path <- "data/la_jolla_pm25_wind_data.xls"
# Read the file
air_data <- read_excel(excel_path)

Explore the data

What are the column names?

names(air_data)

glimpse(air_data)

summary(air_data)

Simplify the column names

# Drop special characters and shorten names
# Set all names to lowercase
air_data <- air_data %>%
rename(pm25 = "PM2.5 Conc (ug/m3)",
wd   = "Wind Direction (Degrees)",
ws   = "Wind Speed (mph)") %>%
rename_all(tolower)

# We need numbers for our data, not text
# Set wind speed and wind direction to numeric
air_data <- air_data %>%
mutate(wd   = as.numeric(wd),
ws   = as.numeric(ws))
## Warning: NAs introduced by coercion

Plot the data

Create a plot to show the distribution of each of the columns containing observations: wind speed, wind direction, and concentration.

ggplot(air_data, aes(x = ?, y = ?)) + geom_point()

Clean ship

Let’s drop the non-sense values. We can’t use the rows that have a missing windspeed or wind direction observation.

air_data <- filter(air_data,
!is.na(ws),
!is.na(wd),
!is.na(pm25),
ws > 0)

Wind rose

Now let’s make some wind roses.

# Plot the data
windRose(air_data)

polarFreq(air_data)

#-- Fine tune  wind rose
polarFreq(air_data, ws.int =5, ws.upper = 35)

polarFreq(air_data, ws.int =0.8, breaks = seq(2:30))

Pollution rose

To make a pollution rose we can replace the name of the wind speed column with the name of PM2.5 column - "PM2.5 Conc (ug/m3)"

# Pollution concentrations based on wind directions
pollutionRose(air_data,
pollutant = "pm25",
key.footer = "PM2.5 ug/m3")

Time series

library(lubridate)
library(ggplot2)

excel_path <- "data/2014_AQS_FondduLac.xlsx"

air_data <- read_excel(excel_path)
## # A tibble: 5 x 12
##   StateCode CountyCode SiteNum Latitude Longitude Date
##       <dbl>      <dbl>   <dbl>    <dbl>     <dbl> <dttm>
## 1        27        137    7001     47.5     -92.5 2014-01-01 00:00:00
## 2        27        137    7001     47.5     -92.5 2014-01-01 00:00:00
## 3        27        137    7001     47.5     -92.5 2014-01-01 00:00:00
## 4        27        137    7001     47.5     -92.5 2014-01-01 00:00:00
## 5        27        137    7001     47.5     -92.5 2014-01-01 00:00:00
## # ... with 6 more variables: Time <dttm>, Hour <dbl>, Parameter <dbl>,
## #   Conc <dbl>, site_catid <chr>, Year <dbl>

Explore the data

names(air_data)

glimpse(air_data)

Simplify the column names

# Set all names to lowercase
air_data <- air_data %>%
rename_all(tolower) %>%
rename(site = site_catid)

Let’s focus on PM2.5

Use filter to select only the parameter code 88101.

air_data <- filter(air_data, parameter == ??)
air_data <- filter(air_data, parameter == 88101)

Let’s summarize the observations by day and then make a time series chart to see how the pollution concentrations are changing over time.

# Add a month column
air_data <- air_data %>%
mutate(day   = day(date),
month = month(date),
year  = year(date))

# Find average PM25 concentration for each day
#   - And upper and lower 10th percentile concentration
air_summary <- group_by(air_data, site, year, month, day, date) %>%
summarize(conc_avg  = mean(conc, na.rm = T),
conc_10th = quantile(conc, 0.10, na.rm = T),
conc_90th = quantile(conc, 0.90, na.rm = T)) %>%
ungroup()

Plot a line chart

ggplot(air_summary, aes(x = date, y = conc_avg)) +
geom_line() +
facet_wrap(~ site)

Now we can add a confidence band behind the line showing the upper and lower 10th percentile of the observations.

ggplot(air_summary, aes(x = date, y = conc_avg)) +
geom_smooth(method ="loess", level = 0.90) +
geom_line(color = "tomato") +
facet_wrap(~site)

What happens when you increase the level = 0.90 up to 0.999, making the shaded band a 99% confidence interval?

Try adding another confidence band, but make it a linear model: method = "lm", and set the color to “black”. Does the new line predict the concentration to be going down or up at each site?

ggplot(air_summary, aes(x = date, y = conc_avg)) +
geom_smooth(method = "loess", level = 0.999) +
geom_line(color = "tomato") +
geom_smooth(method = "lm", level = 0.90, color = "black") +
facet_wrap(~site)