Air, wind and roses

Let’s take a look at some air pollution data stored in an Excel file.

Download the data

DOWNLOAD — La Jolla PM25 data

Install packages

# Get devtools for installing new packages from GitHub
install.packages("devtools", dependencies = TRUE)

# Load the devtools library 

# Install openair for windrose functions

Load the data

# Load packages

# Set the path to your the Excel file
excel_path <- "data/la_jolla_pm25_wind_data.xls"
# Read the file
air_data <- read_excel(excel_path)

Explore the data

What are the column names?




Simplify the column names

# Drop special characters and shorten names
# Set all names to lowercase
air_data <- air_data %>% 
            rename(pm25 = "PM2.5 Conc (ug/m3)", 
                   wd   = "Wind Direction (Degrees)",
                   ws   = "Wind Speed (mph)") %>%

# We need numbers for our data, not text
# Set wind speed and wind direction to numeric
air_data <- air_data %>% 
            mutate(wd   = as.numeric(wd),
                   ws   = as.numeric(ws))
## Warning: NAs introduced by coercion

Plot the data

Create a plot to show the distribution of each of the columns containing observations: wind speed, wind direction, and concentration.

ggplot(air_data, aes(x = ?, y = ?)) + geom_point()

Clean ship

Let’s drop the non-sense values. We can’t use the rows that have a missing windspeed or wind direction observation.

air_data <- filter(air_data, 
                   ws > 0)

Wind rose

Now let’s make some wind roses.

# Plot the data


#-- Fine tune  wind rose
polarFreq(air_data, =5, ws.upper = 35)

polarFreq(air_data, =0.8, breaks = seq(2:30))

Pollution rose

To make a pollution rose we can replace the name of the wind speed column with the name of PM2.5 column - "PM2.5 Conc (ug/m3)"

# Pollution concentrations based on wind directions
              pollutant = "pm25",
              key.footer = "PM2.5 ug/m3")

Time series

DOWNLOAD — Ozone air data


# Read the file
excel_path <- "data/2014_AQS_FondduLac.xlsx"

air_data <- read_excel(excel_path)
## # A tibble: 5 x 12
##   StateCode CountyCode SiteNum Latitude Longitude Date               
##       <dbl>      <dbl>   <dbl>    <dbl>     <dbl> <dttm>             
## 1        27        137    7001     47.5     -92.5 2014-01-01 00:00:00
## 2        27        137    7001     47.5     -92.5 2014-01-01 00:00:00
## 3        27        137    7001     47.5     -92.5 2014-01-01 00:00:00
## 4        27        137    7001     47.5     -92.5 2014-01-01 00:00:00
## 5        27        137    7001     47.5     -92.5 2014-01-01 00:00:00
## # ... with 6 more variables: Time <dttm>, Hour <dbl>, Parameter <dbl>,
## #   Conc <dbl>, site_catid <chr>, Year <dbl>

Explore the data



Simplify the column names

# Set all names to lowercase
air_data <- air_data %>% 
            rename_all(tolower) %>%
            rename(site = site_catid)

Let’s focus on PM2.5

Use filter to select only the parameter code 88101.

air_data <- filter(air_data, parameter == ??)
air_data <- filter(air_data, parameter == 88101)

Let’s summarize the observations by day and then make a time series chart to see how the pollution concentrations are changing over time.

Add monthly statistics

# Add a month column
air_data <- air_data %>% 
            mutate(day   = day(date),
                   month = month(date),
                   year  = year(date))

# Find average PM25 concentration for each day
#   - And upper and lower 10th percentile concentration
air_summary <- group_by(air_data, site, year, month, day, date) %>%
               summarize(conc_avg  = mean(conc, na.rm = T),
                         conc_10th = quantile(conc, 0.10, na.rm = T),
                         conc_90th = quantile(conc, 0.90, na.rm = T)) %>%

Plot a line chart

ggplot(air_summary, aes(x = date, y = conc_avg)) +
   geom_line() +
   facet_wrap(~ site)

Now we can add a confidence band behind the line showing the upper and lower 10th percentile of the observations.

ggplot(air_summary, aes(x = date, y = conc_avg)) + 
   geom_smooth(method ="loess", level = 0.90) +
   geom_line(color = "tomato") +

What happens when you increase the level = 0.90 up to 0.999, making the shaded band a 99% confidence interval?

Try adding another confidence band, but make it a linear model: method = "lm", and set the color to “black”. Does the new line predict the concentration to be going down or up at each site?

ggplot(air_summary, aes(x = date, y = conc_avg)) + 
   geom_smooth(method = "loess", level = 0.999) +
   geom_line(color = "tomato") +
   geom_smooth(method = "lm", level = 0.90, color = "black") +

