Welcome

The goal of this project is to get a idea of the nature of traffic accidents in Texas. To help with the goal, an analysis of weekdays vs. weekends will be done to give a better pictures of how accidents occur. Ontop of this, a interactive map will be created to better show the accident pattern in Texas

Data Source

Our dataset is a compilation of countrywide traffic accidents in 49 US states. The data ranges from Februrary 2016 to December 2019 and has about 3 million records. To make the data more managable, we decided to only look at records from Texas, which was still 300000 records.

There are 49 columns included in the dataset.

knitr::opts_chunk$set(warning=FALSE, message=FALSE)

####### ALL LIBRARIES USED ###########

#declare libraries
options(stringsAsFactors = FALSE)
library(tidyr)
library(dplyr)
library(highcharter)
library(lubridate)
library(padr)
library(tm)
library(DT)
library(knitr)
library(kableExtra)
library(maps)
library(leaflet)
library(htmltools)
library(geojsonio)

#reading in clean data
raw <- read.csv("cleaned-accidents.csv")

info=c("ID", "This is a unique identifier of the accident record.", 
      "Source", "Indicates source of the accident report (i.e. the API which reported the accident.).",
      "TMC", "A traffic accident may have a Traffic Message Channel (TMC) code which provides more detailed description of the event.",
      "Severity", "Shows the severity of the accident, a number between 1 and 4, where 1 indicates the least impact on traffic (i.e., short           delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay).",
      "Start_Time", "Shows start time of the accident in local time zone.",
      "End_Time", "Shows end time of the accident in local time zone.",
      "Start_Lat", "Shows latitude in GPS coordinate of the start point.",
      "Start_Lng", "Shows longitude in GPS coordinate of the start point.",
      "End_Lat", "Shows latitude in GPS coordinate of the end point.",
      "End_Lng", "Shows longitude in GPS coordinate of the end point.",
      "Distance(mi)", "The length of the road extent affected by the accident.",
      "Description", "Shows natural language description of the accident.",
      "Number", "Shows the street number in address field.",
      "Street", "Shows the street name in address field.",
      "Side", "Shows the relative side of the street (Right/Left) in address field.",
      "City", "Shows the city in address field.",
      "County", "Shows the county in address field.",
      "State", "Shows the state in address field.",
      "Zipcode", "Shows the zipcode in address field.",
      "Country", "Shows the country in address field.",
      "Timezone", "Shows timezone based on the location of the accident (eastern, central, etc.).",
      "Airport_Code", "Denotes an airport-based weather station which is the closest one to location of the accident.",
      "Weather_Timestamp", "Shows the time-stamp of weather observation record (in local time).",
      "Temperature(F)", "Shows the temperature (in Fahrenheit).",
      "Wind_Chill(F)", "Shows the wind chill (in Fahrenheit).",
      "Humidity(%)", "Shows the humidity (in percentage).",
      "Pressure(in)", "Shows the air pressure (in inches).",
      "Visibility(mi)", "Shows visibility (in miles).",
      "Wind_Direction", "Shows wind direction.",
      "Wind_Speed(mph)", "Shows wind speed (in miles per hour).",
      "Precipitation(in)", "Shows precipitation amount in inches, if there is any.",
      "Weather_Condition", "Shows the weather condition (rain, snow, thunderstorm, fog, etc.).",
      "Amenity", "A Point-Of-Interest (POI) annotation which indicates presence of amenity in a nearby location.",
      "Bump", "A POI annotation which indicates presence of speed bump or hump in a nearby location.",
      "Crossing", "A POI annotation which indicates presence of crossing in a nearby location.",
      "Give_Way", "A POI annotation which indicates presence of give_way sign in a nearby location.",
      "Junction", "A POI annotation which indicates presence of junction in a nearby location.",
      "No_Exit", "A POI annotation which indicates presence of no_exit sign in a nearby location.",
      "Railway", "A POI annotation which indicates presence of railway in a nearby location.",
      "Roundabout", "A POI annotation which indicates presence of roundabout in a nearby location.)",
      "Station", "A POI annotation which indicates presence of station (bus, train, etc.) in a nearby location.",
      "Stop", "A POI annotation which indicates presence of stop sign in a nearby location.", 
      "Traffic_Calming", "A POI annotation which indicates presence of traffic_calming means in a nearby location.",
      "Traffic_Signal", "A POI annotation which indicates presence of traffic_signal in a nearby location.", 
      "Turning_Loop", "A POI annotation which indicates presence of turning_loop in a nearby location.", 
      "Sunrise_Sunset", "Shows the period of day (i.e. day or night) based on sunrise/sunset.", 
      "Civil_Twilight", "Shows the period of day (i.e. day or night) based on civil twilight.", 
      "Nautical_Twilight", "Shows the period of day (i.e. day or night) based on nautical twilight.",
      "Astronomical_Twilight", "Shows the period of day (i.e. day or night) based on astronomical twilight.)")

infodf <- data.frame(ColumnName=info[seq(from=1,to=97, by=2)], Description=info[seq(from=2, to=98, by=2)], stringsAsFactors = FALSE)
infodf %>% kable() %>% kable_styling(bootstrap_options = "striped")

ColumnName	Description
ID	This is a unique identifier of the accident record.
Source	Indicates source of the accident report (i.e. the API which reported the accident.).
TMC	A traffic accident may have a Traffic Message Channel (TMC) code which provides more detailed description of the event.
Severity	Shows the severity of the accident, a number between 1 and 4, where 1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay).
Start_Time	Shows start time of the accident in local time zone.
End_Time	Shows end time of the accident in local time zone.
Start_Lat	Shows latitude in GPS coordinate of the start point.
Start_Lng	Shows longitude in GPS coordinate of the start point.
End_Lat	Shows latitude in GPS coordinate of the end point.
End_Lng	Shows longitude in GPS coordinate of the end point.
Distance(mi)	The length of the road extent affected by the accident.
Description	Shows natural language description of the accident.
Number	Shows the street number in address field.
Street	Shows the street name in address field.
Side	Shows the relative side of the street (Right/Left) in address field.
City	Shows the city in address field.
County	Shows the county in address field.
State	Shows the state in address field.
Zipcode	Shows the zipcode in address field.
Country	Shows the country in address field.
Timezone	Shows timezone based on the location of the accident (eastern, central, etc.).
Airport_Code	Denotes an airport-based weather station which is the closest one to location of the accident.
Weather_Timestamp	Shows the time-stamp of weather observation record (in local time).
Temperature(F)	Shows the temperature (in Fahrenheit).
Wind_Chill(F)	Shows the wind chill (in Fahrenheit).
Humidity(%)	Shows the humidity (in percentage).
Pressure(in)	Shows the air pressure (in inches).
Visibility(mi)	Shows visibility (in miles).
Wind_Direction	Shows wind direction.
Wind_Speed(mph)	Shows wind speed (in miles per hour).
Precipitation(in)	Shows precipitation amount in inches, if there is any.
Weather_Condition	Shows the weather condition (rain, snow, thunderstorm, fog, etc.).
Amenity	A Point-Of-Interest (POI) annotation which indicates presence of amenity in a nearby location.
Bump	A POI annotation which indicates presence of speed bump or hump in a nearby location.
Crossing	A POI annotation which indicates presence of crossing in a nearby location.
Give_Way	A POI annotation which indicates presence of give_way sign in a nearby location.
Junction	A POI annotation which indicates presence of junction in a nearby location.
No_Exit	A POI annotation which indicates presence of no_exit sign in a nearby location.
Railway	A POI annotation which indicates presence of railway in a nearby location.
Roundabout	A POI annotation which indicates presence of roundabout in a nearby location.)
Station	A POI annotation which indicates presence of station (bus, train, etc.) in a nearby location.
Stop	A POI annotation which indicates presence of stop sign in a nearby location.
Traffic_Calming	A POI annotation which indicates presence of traffic_calming means in a nearby location.
Traffic_Signal	A POI annotation which indicates presence of traffic_signal in a nearby location.
Turning_Loop	A POI annotation which indicates presence of turning_loop in a nearby location.
Sunrise_Sunset	Shows the period of day (i.e. day or night) based on sunrise/sunset.
Civil_Twilight	Shows the period of day (i.e. day or night) based on civil twilight.
Nautical_Twilight	Shows the period of day (i.e. day or night) based on nautical twilight.
Astronomical_Twilight	Shows the period of day (i.e. day or night) based on astronomical twilight.)

Let us take a look at what our data looks like!

raw %>% 
  as.tbl()

## # A tibble: 298,062 x 24
##    Source Severity Start_Time End_Time Distance.mi. Description City  County
##    <chr>     <int> <chr>      <chr>           <dbl> <chr>       <chr> <chr> 
##  1 MapQu~        2 2016-11-3~ 2016-11~         0.01 Accident o~ Aust~ Travis
##  2 MapQu~        2 2016-11-3~ 2016-11~         0    Accident o~ Aust~ Travis
##  3 MapQu~        2 2016-11-3~ 2016-11~         0    Accident o~ Aust~ Travis
##  4 MapQu~        2 2016-11-3~ 2016-11~         0.01 #2 lane bl~ Fort~ Tarra~
##  5 MapQu~        3 2016-11-3~ 2016-11~         0.01 #2 lane bl~ Fort~ Tarra~
##  6 MapQu~        2 2016-11-3~ 2016-11~         0.01 Accident o~ Dall~ Dallas
##  7 MapQu~        3 2016-11-3~ 2016-11~         0.01 #2 / #3 la~ Dall~ Dallas
##  8 MapQu~        2 2016-11-3~ 2016-11~         0    Accident o~ Dall~ Dallas
##  9 MapQu~        2 2016-11-3~ 2016-11~         0    Accident o~ Dall~ Dallas
## 10 MapQu~        3 2016-11-3~ 2016-11~         0.01 HOV lane b~ Dall~ Dallas
## # ... with 298,052 more rows, and 16 more variables: Temperature.F. <dbl>,
## #   Weather_Condition <chr>, Crossing <chr>, Give_Way <chr>, Junction <chr>,
## #   No_Exit <chr>, Railway <chr>, Roundabout <chr>, Station <chr>, Stop <chr>,
## #   Sunrise_Sunset <chr>, Traffic_Calming <chr>, Traffic_Signal <chr>,
## #   Turning_Loop <chr>, Start_Lat <dbl>, Start_Lng <dbl>

If you would like to learn more about the dataset, please visit: https://www.kaggle.com/sobhanmoosavi/us-accidents

Acknowledgments

As part of the license for utilizing this dataset, it is required that we cite 2 papers that this dataset was used for as part of research. The papers are as follows if you would like to learn more.

Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.
Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. “Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.” In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.

A Quick Overview

######### READING IN CLEANED DATA ##########

raw <- raw %>% 
    separate(Start_Time, c("Start_Date", "Start_Time"), sep = " ") %>% 
  separate(End_Time, c("End_Date", "End_Time"), sep = " ") %>% 
  mutate(
    wday = weekdays(as.Date(Start_Date, format = "%Y-%m-%d"))
  )

##### GETTING DALLAS DATA #######

#list for counties
dallas_plano_irving <- c("Collin", "Dallas", "Denton", "Ellis", "Hunt", "Kaufman", "Rockwall")

dfw <- data.frame() #create new variable

for (county in dallas_plano_irving){ #loop to grab all
  dfw <- raw %>% 
  subset(County == county) %>% 
    bind_rows(dfw)
}

####### GENERAL SUMMARIES #######

#getting counts for all cities
city_count <- raw %>% 
  group_by(City) %>% 
  dplyr::summarise(
    accidents = dplyr::n(),
    mean_sev = mean(Severity)
  ) %>% 
  arrange(-accidents)

#getting counts for all counties
county_count <- raw %>% 
  group_by(County) %>% 
  dplyr::summarise(
    accidents = dplyr::n(),
    mean_sev = mean(Severity)
  ) %>% 
  arrange(-accidents)

#getting counts for all dates
date_count <- raw %>% 
  group_by(Start_Date) %>% 
  dplyr::summarise(
    accidents = dplyr::n(),
    mean_sev = mean(Severity)
  ) %>% 
  arrange(-accidents)

#getting counts for all severities
severity_count <- raw %>% 
  group_by(Severity) %>% 
  dplyr::summarise(
    accidents = dplyr::n(),
    mean_sev = mean(Severity)
  ) %>% 
  arrange(-accidents)

#getting counts for all weekdays
weekday <- raw %>% 
  group_by(wday) %>% 
  dplyr::summarise(
    accidents = dplyr::n(),
    mean_sev = mean(Severity)
  ) %>% 
  arrange(-accidents)

#getting missing date values
date_count <- date_count %>% 
  mutate(
    Date = as.Date(Start_Date, format="%Y-%m-%d")
  )

#first and last day of data
min_d<- min(date_count$Date)
max_d<- max(date_count$Date)

Lets start with taking a look at a general overview of the data.

raw %>% 
  subset(select = c(Severity, Temperature.F., Distance.mi.)) %>% 
  summary()

##     Severity     Temperature.F.    Distance.mi.     
##  Min.   :1.000   Min.   :-40.00   Min.   :  0.0000  
##  1st Qu.:2.000   1st Qu.: 60.80   1st Qu.:  0.0000  
##  Median :2.000   Median : 73.40   Median :  0.0000  
##  Mean   :2.289   Mean   : 71.26   Mean   :  0.1152  
##  3rd Qu.:3.000   3rd Qu.: 82.90   3rd Qu.:  0.0100  
##  Max.   :4.000   Max.   :161.60   Max.   :130.4900  
##                  NA's   :5951

Here we can see the general statistics for the continuous data.

The average severity in Texas is 2.289
The average Temperature is 71.26 F
The average Distance is 0.1 miles

Overview of Severities

##### TREEMAP FOR SEVERITY ########
hc <- severity_count %>% 
  hchart(type="treemap", hcaes(x=paste("Severity:", Severity, sep=" "), value=accidents, color=accidents)) %>% 
  hc_colorAxis(minColor = "#F4D03F", maxColor="red") %>% 
  hc_add_theme(hc_theme_flat()) %>% 
  hc_title(text = "Total Accidents by Severity, for Texas",
           align = "left") %>%
  hc_subtitle(text = "color corrosponds to the number of accidents",
           align = "left") %>%
  hc_tooltip(formatter = JS("function(){
                            return (' Severity: ' + this.point.Severity + ' <br> # Accidents: ' + this.point.accidents)}"))

hc

This Treemap can tell us a few things about the Severity Category in the data, for example:

There are 4 Severity levels in the data,
- the Severity ranges on a scale of 1 - 4,
  - Severity 1: 120 accidents
  - Severity 2: 215027 accidents
  - Severity 3: 79467 accidents
  - Severity 4: 3448 accidents
98% of the recorded accidents are categorized as Severity level 2 or 3.
- 72% of the recorded accidents are level 2, respectively.
- 26% of the recorded accidents are level 3, respectively.
1.1% of the recorded accidents are categorized as level 4, the most severe accident.
While less than 0.04% of the recorded accidents are considered a Severity level of 1.
- This could be due to an underreporting of these accidents.
The total number of recorded accidents in Texas is 298062 accidents.

Overview of Counties

#### TREEMAP FOR COUNTY COUNT ######
hc <- county_count %>% 
  slice(1:8) %>% 
  hchart(type = "bar",
         hcaes(x=County, y=accidents, color=County)) %>%
  hc_add_theme(hc_theme_flat()) %>% 
  hc_yAxis(title = list(text = "# of Accidents")) %>% 
  hc_title(text = "Total Accidents by County, in Texas",
           align = "left") %>% 
  hc_tooltip(formatter = JS("function(){
                            return (' County: ' + this.point.County + ' <br> # Accidents: ' + this.point.accidents + ' <br> Severity: ' + this.point.mean_sev.toFixed(2))}"))

hc

From the above, we can see the four most accident prone Counties:

Harris County, TX:
- 99023 recorded accidents
- mean severity level of 2.24
Dallas County, TX:
- 68149 recorded accidents
- mean severity level of 2.41
Travis County, TX:
- 63603 recorded accidents
- mean severity level of 2.28
Bexar County, TX:
- 22206 recorded accidents
- mean severity level of 2.28

The graph shows the disparity of the top three Counties and the preceeding Counties. With the third highest County, Travis County, having 41397 more accidents than the fourth highest County, Bexar County.

Travis County (22206 accidents)
Bexar County (22206 accidents)

Overview of Cities

#### TREEMAP FOR CITY COUNTS ######
hc <- city_count %>% 
  slice(1:8) %>% 
  hchart(type = "treemap",
         hcaes(x=City, value=accidents, color=mean_sev)) %>%
  hc_colorAxis(minColor = "#F4D03F", maxColor="red") %>%
  hc_add_theme(hc_theme_flat()) %>% 
  hc_title(text = "Total Accidents by City, in Texas",
           align = "left") %>% 
  hc_subtitle(text = "color corrosponds to the mean severity level",
           align = "left") %>%
  hc_tooltip(formatter = JS("function(){
                            return (' City: ' + this.point.City + ' <br> # Accidents: ' + this.point.accidents + ' <br> Severity: ' + this.point.mean_sev.toFixed(2))}"))

hc

From the graph, The five most accident prone Cities in Texas are:

Houston, TX
- 93245 recorded accidents
- mean severity level of 2.23
Austin, TX
- 58553 recorded accidents
- mean severity level of 2.13
Dallas, TX
- 57823 recorded accidents
- mean severity level of 2.38
San Antonio, TX
- 21613 recorded accidents
- mean severity level of 2.27
El Paso, TX
- 9352 recorded accidents
- mean severity level of 2.12

Again, we are seeing a huge disparity of the top three Cities and the preceeding Cities.

Houston, TX has 31.2% of the total accidents in Texas.
Austin, TX has 19.6% of the total accidents in Texas.
Dallas, TX has 19.3% of the total accidents in Texas.

These three cities combined have 70% of the total recorded accidents. meaning the other 710 Cities make up for the remaining 30%.

Complete Timeline

#get all dates
alldates <- seq(min_d, max_d,1)

#take out dates that occur in data
alldates <- alldates %>% 
  subset(!(alldates %in% date_count$Date))

#create dataframe of all dates, with 0 accidents
alldates <- data.frame(Date=alldates, accidents=0, mean_sev=0)

#combining missing dates with old data
total <- alldates %>% 
  bind_rows(date_count) %>% 
  subset(select = c(Date, accidents, mean_sev)) %>% 
  arrange(Date) %>% 
  mutate(
    wday = weekdays(Date) #get weekdays
  )

total %>% 
  hchart(type="column", hcaes(x=Date, y=accidents), color="black") %>% 
  hc_add_theme(hc_theme_flat()) %>% 
  hc_title(text = "Total Accidents by Date",
           align = "left") %>% 
  hc_yAxis(title = list(text = "# of Accidents")) %>% 
  hc_xAxis(title = "", type = "datetime", dateTimeLabelFormats = list(day = '%Y/%m')) %>% 
  hc_tooltip(formatter = JS("function(){
                            return (' Date: ' + this.point.Date)+ ' <br> Weekday: ' + this.point.wday + ' <br> # Accidents: ' + this.point.accidents + ' <br> Mean Severity: ' + this.point.mean_sev.toFixed(2)}"))

This graph shows exactly how much data is in this data set, as well as any patterns in the number of Accidents. For example,

The graph seems to show high variation throughout the individual weeks,
- this can be tested by looking at the accidents per weekday.
There only seems to be one major dip in the data from May 24, 2017 to June 1, 2017.
There might be a small decline in the number of accidents over time,
- more analysis on this will need to be done to see if this is true.

Pearsons r and r²

Lets see if Severity, Distance of Accidents, and Temperature are correlated! To do this, we need use use pearsons r and pearsons r squared.

First, lets take a look at pearsons r.

raw_cor <- raw %>% 
  subset(!is.na(Severity)) %>% 
  subset(!is.na(Distance.mi.)) %>%
  subset(!is.na(Temperature.F.)) %>%
  subset(select = c(Severity, Distance.mi., Temperature.F.)) %>% 
  as.matrix() %>% 
  cor()

raw_cor

##                   Severity Distance.mi. Temperature.F.
## Severity        1.00000000   0.14062302    -0.02974895
## Distance.mi.    0.14062302   1.00000000    -0.01100098
## Temperature.F. -0.02974895  -0.01100098     1.00000000

Pearsons r cant tell us a lot, however we can gather a few bits of information:

Severity and Temperature appear to be negatively correlated.
- The pearsons r value is negative, telling us that as the Severity declines, the Temperature Rises.
  - This might suggest that the colder the day, the more sever an accident might be.
Distance and Temperature are also negatively correlated.
- Again, suggesting that as the Temperature gets colder, the distance of an accident might be lower.
Severity and Distance have a positive correlation.
- This would suggest that as the distance of an accident increases, the severity would also increase.

Lets check how well correlated these factors are, we can do this simply by just squaring the above correlation.

#lets take a look at the severity matched with the start time
raw_cor_2 <- raw_cor %>% 
  '^'(2) %>% 
    round(5) %>% #round
  hchart() %>% #graph
  hc_colorAxis(minColor = "#F4D03F", maxColor="red") %>%
  hc_add_theme(hc_theme_flat()) %>%
  hc_title(text = "Pearsons r squared matchup",
           align = "left") %>% 
  hc_subtitle(text = "For TX accidents",
              align = "left") %>% 
   hc_xAxis(categories = list("Severity", "Distance of Accident", "Temperature")) %>% 
  hc_yAxis(categories = list("Severity", "Distance of Accident", "Temperature"))%>% 
   hc_legend(align = "left") %>% 
  hc_plotOptions(
           series = list(
             boderWidth = 0,
             dataLabels = list(enabled = TRUE)))

raw_cor_2

All of the r squared values are very low, with the highest being Severity and Distance of the Accident (0.01977). This tells us that only ~1.9% of the Severity samples can be explained by the Distance of the Accident.

As all of these r² values are very low, this suggests that none of the factors can be explained by the other.

ie, the Distance an Accident occurs has little to do with the Severity of the Accident.
- The same can be said for all of the other matchups as well.

Analysis of Weekdays

In this Analysis, I will dive into the differences of the recorded accidents according to the day of the week!

#### CLEARING VARIABLES ###
rm(list = ls(all.names = TRUE))

### GETTING A NEW RAW
# This is to make merging the analysis into the main file easier
raw <- read.csv("cleaned-accidents.csv")

raw <- raw %>% 
    separate(Start_Time, c("Start_Date", "Start_Time"), sep = " ") %>% 
  separate(End_Time, c("End_Date", "End_Time"), sep = " ") %>% 
  mutate(
    wday = wday(Start_Date, label=TRUE)
  )

#getting counts for all dates
date_count <- raw %>% 
  group_by(Start_Date) %>% 
  dplyr::summarise(
    accidents = dplyr::n(),
    mean_sev = mean(Severity)
  ) %>% 
  arrange(-accidents)

#getting missing date values
date_count <- date_count %>% 
  mutate(
    Date = as.Date(Start_Date, format="%Y-%m-%d")
  )

#first and last day of data
min_d<- min(date_count$Date)
max_d<- max(date_count$Date)

#### PLOTTING ######
hc <- date_count %>% 
  subset(select = c(Date, accidents, mean_sev)) %>% 
  mutate(
    wday = wday(Date, label=TRUE) #get weekdays
  ) %>% 
  group_by(wday) %>% 
  dplyr::summarise(
    accidents = sum(accidents),
    mean_sev = mean(mean_sev)
  ) %>% 
  ungroup() %>% 
  hchart(type= "bar", hcaes(x=wday, y=accidents)) %>% 
  hc_add_theme(hc_theme_flat()) %>% 
  hc_title(text = "Total Accidents by Day of Week",
           align = "left") %>% 
  hc_yAxis(title = list(text = "# of Accidents")) %>% 
  hc_xAxis(title = list(text = "")) %>% 
  hc_tooltip(formatter = JS("function(){
                            return (' Day of Week: ' + this.point.wday + ' <br> # Accidents: ' + this.point.accidents + ' <br> Mean Severity: ' + this.point.mean_sev.toFixed(2))}"))

#call plot
hc

As shown, there is a great desperity between the Monday - Friday, and Saturday/Sunday.

While Monday to Friday has a range of 3,346 accidents. Monday to Sunday has a range of 44,344 accidents.
- Monday has 18.7% of the total accidents.
- Tuesday has 18.5% of the total accidents.
- Wednesday has 18.5% of the total accidents.
- Thursday has 17.9% of the total accidents.
- Friday has 17.6% of the total accidents.
- Saturday has 4.7% of the total accidents.
- Sunday has 3.8% of the total accidents.

While this huge drop is apparent on the weekend, one reason why might this occur is; there are less people are working, causing less people to be on the roads.

Now lets take a look into if this has any differences throughout the years.

#### PLOTTING #######

date_count %>% 
  subset(select = c(Date, accidents, mean_sev)) %>% 
  mutate(
    Year = as.numeric(format(date_count$Date, "%Y"))
  ) %>% 
  mutate(
    wday = wday(Date, label=TRUE) #get weekdays
  ) %>% 
  group_by(wday,Year) %>% 
  dplyr::summarise(
    accidents = sum(accidents),
    mean_sev = mean(mean_sev)
  ) %>% 
  ungroup() %>% #plot starts below
  hchart(type= "bar", hcaes(x=wday, y=accidents, group = Year)) %>% 
  hc_add_theme(hc_theme_flat()) %>% 
  hc_title(text = "Total Accidents by Day of Week",
           align = "left") %>% 
  hc_yAxis(title = list(text = "# of Accidents")) %>% 
  hc_xAxis(title = list(text = "")) %>% 
  hc_tooltip(formatter = JS("function(){
                            return (' Year: ' + this.point.Year + '<br>  Day of Week: ' + this.point.wday + ' <br> # Accidents: ' + this.point.accidents + ' <br> Mean Severity: ' + this.point.mean_sev.toFixed(2))}"))

You can de-select anything in the legend to get a closer look at the years.

Here we can see a significant drop in the number of accidents in 2016, this is due to not having the complete year. Remember, this dataset starts on July 14, 2016.

Despite this, we still see a massive drop in accidents occuring on a Saturday or a Sunday.

Lets dive into the text data to see if there are any indcations to why there is a massive drop on Saturday and Sunday.

####### FUNCTION TO GRAB HIGHWAYS #######
get_hwy <- function(data){
  new_data <- data.frame()
  #get temp year
  temp <- data
    
    #seperate text data into words, by weekday
  new_data <- stack(tapply(temp$Description, temp$wday, function(x) scan(text=x, what=''))) %>% 
      mutate(
        values = ifelse(grepl(values, pattern="\\-"), removePunctuation(values, preserve_intra_word_dashes = TRUE), NA)
      ) %>% 
      mutate( #remove blank values
        values = ifelse(values == "", NA, values)
      ) %>% 
      subset(!is.na(values)) %>% #delete NA's
      group_by(ind) %>% 
      mutate( #toupper all letters
        values = toupper(values)
      ) %>% 
      dplyr::rename( #rename
        "words" = values,
        "wday" = ind
      ) %>% 
      bind_rows(new_data) #bind back
  
  return(new_data)
}


#get text data
text_wday <- raw %>% 
  subset(select = c(wday, Description, Start_Date)) %>% 
  mutate(
    year = as.numeric(format(as.Date(raw$Start_Date,format = "%Y-%m-%d"), "%Y"))
  ) %>% 
  subset(select = -c(Start_Date)) %>% 
  get_hwy()

#subset into week
wk = c("Mon", "Tue", "Wed", "Thu", "Fri")
week <- data.frame()
for(day in wk){
  week <- text_wday %>% 
  subset(wday == day) %>% 
    bind_rows(week)
}
week <- week %>% 
  subset(select = words) %>% 
  dplyr::count(words) %>% 
  arrange(-n) %>% 
  mutate(
    "weekday hwy" = words,
    "weekday count" = n,
    rank = 1:nrow(.)
  ) %>% 
  subset(select = c("weekday hwy", "weekday count", "rank"))

#subset into weekend
wkd = c("Sat", "Sun")
weekend <- data.frame()
for(day in wkd){
  weekend <- text_wday %>% 
  subset(wday == day) %>% 
    bind_rows(weekend)
}
weekend <- weekend %>% 
  subset(select = words) %>% 
  dplyr::count(words) %>% 
  arrange(-n) %>% 
  mutate(
    "weekend hwy" = words,
    "weekend count" = n,
    rank = 1:nrow(.)
  ) %>% 
  subset(select = c("weekend hwy", "weekend count", "rank"))

#combining datasets for a data table
week %>% 
  slice(1:80) %>% 
  bind_cols(weekend %>% 
              slice(1:80)) %>% 
  subset(select = -c(rank1)) %>% 
  datatable()

This is a lookup table, you can use this to search through the top 80 ranks!

These are the top 10 accident prone highways for the weekdays and the weekend!

We can see that I-35 is, by far, the worst highway in texas. Besides I-35, the only highway that stays on the same rank is I-45.

Every other highway tends to change ranks, which could suggest that certain highways get proportionally less traffic on the weekdays, as opposed to the weekends, or vice versa!

Interactive Map of Accidents in 2019

Data Preparation

Some transformation is required to get dates and times in a more managable state as well as adding in a duration column. The years are filtered on just 2019 to get an overview of accidents for the year and to prevent crashes to to volume, coordinate columns are renamed and specific columns are selected to display later in a label on the map.

tx_accidents_2019 <- raw %>%
  mutate(start = Start_Date,
    end = End_Date,
    Start_Time = hms(Start_Time),
    End_Time = hms(End_Time),
    Duration.minutes = End_Time - Start_Time)%>% 
  filter(start >= "2019-01-01") %>%
  rename(Longitude = Start_Lng, Latitude = Start_Lat) %>%
  select(start, Severity, Duration.minutes, Sunrise_Sunset, Weather_Condition, 
         Temperature.F., Description, Longitude, Latitude, end)

Map of Texas Accidents in 2019

Here is an interactive map of Accidents in 2019 that occur within Texas. Data points are clustered together based off the level of Severity.

Feel free to play around with some of the controls.

You can do the following:

Zoom In and Out with your Scroll Wheel
Click on Specific Clusters of Data Points to Zoom In
Use the Filters on the Right to Display Different Levels of Severity

# Set color palette 
pal <- colorBin("YlOrRd", domain = tx_accidents_2019$Severity, 1:4)

# Creating groups of individual severity levels for filtering
groups <- as.character(sort(unique(tx_accidents_2019$Severity), decreasing = TRUE))

# Creating html labels for map
labs <- lapply(seq(nrow(tx_accidents_2019)), function(i) {
  paste0(
    'Date: ', tx_accidents_2019[i, "start"], '<br/>',
    'Severity: ', tx_accidents_2019[i, "Severity"], '<br/>',
    'Duration: ', tx_accidents_2019[i, "Duration.minutes"], ' minutes.<br/>',
    'Time of Day: ', tx_accidents_2019[i, "Sunrise_Sunset"], '<br/>',
    'Weather Condition: ', tx_accidents_2019[i, "Weather_Condition"], '<br/>',
    'Tempurature: ', tx_accidents_2019[i, "Temperature.F."], ' F<br/>',
    tx_accidents_2019[i, "Description"])
})

# Create a df for the labels
labeldf <- data.frame(Severity =  tx_accidents_2019$Severity, Labels = unlist(labs), stringsAsFactors = FALSE)

# Create a map from leaflet to build from
map <- leaflet(tx_accidents_2019, options = leafletOptions(minZoom = 5, maxZoom = 18)) %>%
  setView(lng = -99.9018, lat = 31.9686, zoom = 5) %>%
  setMaxBounds(-109.9018, 37.9686, -89.9018, 24.9686) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addLayersControl(overlayGroups = groups,
                   options = layersControlOptions(collapsed = TRUE))

# Create layers for each severity level to be filtered
for (x in groups) {
  indiGroup = tx_accidents_2019[tx_accidents_2019$Severity == x,]
  map <- map %>%
    addCircleMarkers(data = indiGroup,
    clusterOptions = markerClusterOptions(),
    ~Longitude, ~Latitude,
    group = x,
    radius = 5,
    color = ~pal(Severity),
    stroke = FALSE,
    fillOpacity = 0.5,
    label = lapply(labeldf[labeldf$Severity == x,]$Labels, htmltools::HTML)
  )
}

# Display graph
map

Conclusion

While occurences of accidents is seemingly independent of weather, temperature, duration, and severity, it is clear to see that accidents are more likely to happen on major highways. Using this it is clear to see there is a underlying pattern in what highways are more accident prone, or less accident prone, based on the day of the week. This could help the surrounding cities better plan for future accidents, and could help reduce response time!

Challenges

Since most of the accidents took place on highways, the categorical columns that had information on stop signs or trafic lights were not as useful when attempting to model accidents across Texas.
Severity, while calculated on how much an accident disrupted traffic flow, had no other context of how many injuries or deaths, which could have added additional meaning to the data.
Descriptions generally had little context other than to describe where an accident occured and if a road was closed/how many lanes were blocked. However, it was possible to extract the particular highways that are most effected.
It could be helpful to have a column that listed a cause for the accident, whether it be weather/visibilty conditions, alcohol or drug-use related, or purely accidental. This type of data could assist in forecasting accidents for each category.

Additional Work to be Done

Creation of a Shiny Dashboard with hierarchical levels of granularity and interactivity to filter out specific types of accidents.
Changes to the interactive map to add more filters (Months, Years, Highways vs Roads), display different information at various zoom levels (County clustered accidents, City Clustered Accidents, Road-specific Accidents), and to facilitate those additions, a geojson file containing the shapes of the Counties/Cities as well as data from out dataset that could be displayed in a simliar labelling fashion.

Texas Accidents Group Analysis

Alexander Kahanek and Austyn McLaughlin

4/21/2020

Welcome

Data Source

Acknowledgments

A Quick Overview

Overview of Severities

Overview of Counties

Overview of Cities

Complete Timeline

Pearsons r and r²

Analysis of Weekdays

Interactive Map of Accidents in 2019

Data Preparation

Map of Texas Accidents in 2019

Conclusion

Challenges

Additional Work to be Done

Texas Accidents Group Analysis

Alexander Kahanek and Austyn McLaughlin

4/21/2020

Welcome

Data Source

Acknowledgments

A Quick Overview

Overview of Severities

Overview of Counties

Overview of Cities

Complete Timeline

Pearsons r and r2

Analysis of Weekdays

Interactive Map of Accidents in 2019

Data Preparation

Map of Texas Accidents in 2019

Conclusion

Challenges

Additional Work to be Done

Pearsons r and r²