Reproducible Research Final Project

This article is the final project of the Reproducible Research course on Coursera offered by John Hopkins University. I make an attempt to look at the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and make deductions about the impact of these events on health and economy.

Data Loading

First step is load the data. I used the read_csv from readr (tidyverse) and it was able to read the file even in the compressed state.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

storm_data <- read_csv("repdata_data_StormData.csv.bz2")

Rows: 902297 Columns: 37
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (18): BGN_DATE, BGN_TIME, TIME_ZONE, COUNTYNAME, STATE, EVTYPE, BGN_AZI,...
dbl (18): STATE__, COUNTY, BGN_RANGE, COUNTY_END, END_RANGE, LENGTH, WIDTH, ...
lgl  (1): COUNTYENDN

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Cleaning and Transformation

To understand how the data look and the variables contained, use was made of the glimpse function.

# Explore the data
glimpse(storm_data)

Rows: 902,297
Columns: 37
$ STATE__    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ BGN_DATE   <chr> "4/18/1950 0:00:00", "4/18/1950 0:00:00", "2/20/1951 0:00:0…
$ BGN_TIME   <chr> "0130", "0145", "1600", "0900", "1500", "2000", "0100", "09…
$ TIME_ZONE  <chr> "CST", "CST", "CST", "CST", "CST", "CST", "CST", "CST", "CS…
$ COUNTY     <dbl> 97, 3, 57, 89, 43, 77, 9, 123, 125, 57, 43, 9, 73, 49, 107,…
$ COUNTYNAME <chr> "MOBILE", "BALDWIN", "FAYETTE", "MADISON", "CULLMAN", "LAUD…
$ STATE      <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",…
$ EVTYPE     <chr> "TORNADO", "TORNADO", "TORNADO", "TORNADO", "TORNADO", "TOR…
$ BGN_RANGE  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ BGN_AZI    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ BGN_LOCATI <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ END_DATE   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ END_TIME   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ COUNTY_END <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ COUNTYENDN <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ END_RANGE  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ END_AZI    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ END_LOCATI <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ LENGTH     <dbl> 14.0, 2.0, 0.1, 0.0, 0.0, 1.5, 1.5, 0.0, 3.3, 2.3, 1.3, 4.7…
$ WIDTH      <dbl> 100, 150, 123, 100, 150, 177, 33, 33, 100, 100, 400, 400, 2…
$ F          <dbl> 3, 2, 2, 2, 2, 2, 2, 1, 3, 3, 1, 1, 3, 3, 3, 4, 1, 1, 1, 1,…
$ MAG        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ FATALITIES <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 4, 0, 0, 0, 0,…
$ INJURIES   <dbl> 15, 0, 2, 2, 2, 6, 1, 0, 14, 0, 3, 3, 26, 12, 6, 50, 2, 0, …
$ PROPDMG    <dbl> 25.0, 2.5, 25.0, 2.5, 2.5, 2.5, 2.5, 2.5, 25.0, 25.0, 2.5, …
$ PROPDMGEXP <chr> "K", "K", "K", "K", "K", "K", "K", "K", "K", "K", "M", "M",…
$ CROPDMG    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ CROPDMGEXP <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ WFO        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ STATEOFFIC <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ ZONENAMES  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ LATITUDE   <dbl> 3040, 3042, 3340, 3458, 3412, 3450, 3405, 3255, 3334, 3336,…
$ LONGITUDE  <dbl> 8812, 8755, 8742, 8626, 8642, 8748, 8631, 8558, 8740, 8738,…
$ LATITUDE_E <dbl> 3051, 0, 0, 0, 0, 0, 0, 0, 3336, 3337, 3402, 3404, 0, 3432,…
$ LONGITUDE_ <dbl> 8806, 0, 0, 0, 0, 0, 0, 0, 8738, 8737, 8644, 8640, 0, 8540,…
$ REMARKS    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ REFNUM     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …

Next, since the most important variable in this analysis is the EVTYPE variable, there’s a need to ensure that there are no missing values. I may plot or do summary statistics and missing values could cause problems in either of these activities.

I also got the sum of fatalities, injuries and property damage as these are important in determining health and economic impact of the events.

# Data Cleaning
cleaned_storm_data <- storm_data |> 
  filter(!is.na(EVTYPE)) |> 
  mutate(PROPDMG = PROPDMG * ifelse(PROPDMGEXP %in% c("K", "M"), 1000, 1))


# Summary Statistics
summary_stats <- cleaned_storm_data |> 
  summarise(
    total_fatalities = sum(FATALITIES),
    total_injuries = sum(INJURIES),
    total_damage = sum(PROPDMG)
  )
summary_stats

# A tibble: 1 × 3
  total_fatalities total_injuries total_damage
             <dbl>          <dbl>        <dbl>
1            15145         140528 10875995063.

Results

Population Health Impact

Next, to properly look at the population health impact, the dataset is grouped by event type and sum of the fatalities and injuries are compared in a variable called event_harm. This variable is then plotted, taking a look at the top 10.

event_harm <- cleaned_storm_data |> 
  group_by(EVTYPE) |> 
  summarise(total_harm = sum(FATALITIES + INJURIES)) |> arrange(desc(total_harm))

# Visualization
event_harm |> 
  top_n(10, total_harm) |> 
  ggplot(aes(x = reorder(EVTYPE, -total_harm), y = total_harm)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(
    title = "Top 10 Events with Highest Population Health Impact",
    x = "Event Type",
    y = "Total Harm"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Economic Consequences

The final aspect is to analyse and check the economic consequences. Naturally, property damage is one of the metrics for measuring economic consequences of events like these. The data is grouped by event type and the property damage associated with each event is calculated. This is stored in a variable called event_damage that is then visualized.

event_damage <- cleaned_storm_data |> 
  group_by(EVTYPE) |> 
  summarise(total_damage = sum(PROPDMG)) |> 
  arrange(desc(total_damage))

# Visualization
event_damage |> 
  top_n(10, total_damage) |> 
  ggplot(aes(x = reorder(EVTYPE, -total_damage), y = total_damage)) +
  geom_bar(stat = "identity", fill = "red") +
  labs(
    title = "Top 10 Events with Greatest Economic Consequences",
    x = "Event Type",
    y = "Total Damage"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Conclusion

Looking at population health impact, tornadoes are the most harmful. Also, on economic impact, tornadoes are still the most damaging event followed in order by flash floods, TSTM wind, flood, thunderstorm wind, hail,lightning, thunderstorm winds, high wind and winterstorm as the top ten events that cause the most economic damage.