Data Gathering
Data gathering is the first step in the data science pipeline and it can be a challenging and time-consuming task. It involves collecting data from various sources such as databases, APIs, surveys, or by scraping the web. The quality and relevance of the data collected will greatly impact the success of the project, so it is important to ensure that the data is accurate, complete, and relevant to the research question being addressed. In addition, data gathering can also be limited by the availability and accessibility of data, which can present a challenge for some projects. Despite these challenges, the efforts invested in data gathering at the beginning of the project can greatly impact the success of the entire project.
I would be utilizing different data from various sources for this project. Below is a list and details of the two major data sources that I am using:
Static Data
A Countrywide Traffic Accident Dataset (2016 - 2021)
About this Dataset:
Description
This is a countrywide car accident dataset, which covers 49 states of the USA. The accident data are collected from February 2016 to Dec 2021, using multiple APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by a variety of entities, such as the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road-networks. Currently, there are about 2.8 million accident records in this dataset. Check here to learn more about this dataset.
Acknowledgements
Please cite the following papers if you use this dataset:
Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.
Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. "Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights." In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.
Content
This dataset has been collected in real-time, using multiple Traffic APIs. Currently, it contains accident data that are collected from February 2016 to Dec 2021 for the Contiguous United States. Check here to learn more about this dataset.
Inspiration
US-Accidents can be used for numerous applications such as real-time car accident prediction, studying car accidents hotspot locations, casualty analysis and extracting cause and effect rules to predict car accidents, and studying the impact of precipitation or other environmental stimuli on accident occurrence. The most recent release of the dataset can also be useful to study the impact of COVID-19 on traffic behavior and accidents.
Usage Policy and Legal Disclaimer
This dataset is being distributed only for Research purposes, under Creative Commons Attribution-Noncommercial-ShareAlike license (CC BY-NC-SA 4.0). By clicking on download button(s) below, you are agreeing to use this data only for non-commercial, research, or academic applications. You may need to cite the above papers if you use this dataset.
Data Attributes
The data is provided in terms of a CSV file. The following table describes the data attributes (visit the paper to learn more about these attributes and how they were obtained):
Traffic crash records on city streets within the City of Chicago.
About this Dataset:
Description
Crash data shows information about each traffic crash on city streets within the City of Chicago limits and under the jurisdiction of the Chicago Police Department (CPD). Data are shown as is from the electronic crash reporting system (E-Crash) at CPD, excluding any personally identifiable information. Records are added to the data portal when a crash report is finalized or when amendments are made to an existing report in E-Crash. Data from E-Crash are available for some police districts in 2015, but citywide data are not available until September 2017. About half of all crash reports, mostly minor crashes, are self-reported at the police district by the driver(s) involved, and the other half are recorded at the scene by the police officer responding to the crash. Many of the crash parameters, including street condition data, weather condition, and posted speed limits, are recorded by the reporting officer based on the best available information at the time, but many of these may disagree with posted information or other assessments on road conditions. If any new or updated information on a crash is received, the reporting officer may amend the crash report at a later time. A traffic crash within the city limits for which CPD is not the responding police agency typically crashes on interstate highways, freeway ramps, and on local roads along the City boundary, are excluded from this dataset.
As per Illinois statute, only crashes with a property damage value of $1,500 or more or involving bodily injury to any person(s) and that happen on a public roadway and that involve at least one moving vehicle, except bike dooring, are considered reportable crashes. However, CPD records every reported traffic crash event, regardless of the statute of limitations, and hence any formal Chicago crash dataset released by the Illinois Department of Transportation may not include all the crashes listed here.
Data Owner
Chicago Police Department
Data Attributes
The following table describes the data attributes:
US Traffic-Fatalities and Fatality Rates byState
Breakdown by Licensed Drivers and Fatalities per state, 2016
About this Dataset:
Description
This is a public dataset that was released by the United States Department of Transportation's National Highway Traffic Safety Administration (NHTSA). This data shows traffic fatalities and the fatality rates based on population, licensed drivers and registered vehicles.
Acknowledgement
This dataset was provided by the National Highway Traffic Safety Administration.
API
Fatal car crashes for 2015-2016
About this Dataset:
Description
This is a public dataset that is created by the United States Department of Transportation's National Highway Traffic Safety Administration (NHTSA) using Fatality Analysis Reporting System (FARS) and includes about 40 separate tables that describe numerous aspects of traffic accidents that resulted in fatalities. Aspects of traffic accidents include: the types of cars and roads, the maneuvers that preceded the accident, and the involvement of pedestrians and cyclists.
This is a public dataset hosted in Google BigQuery. BigQuery is a fully managed enterprise data warehouse that helps manage and analyze complex and vast data.
Fatality Analysis Reporting System (FARS) was created in the United States by the National Highway Traffic Safety Administration (NHTSA) to provide an overall measure of highway safety, to help suggest solutions, and to help provide an objective basis to evaluate the effectiveness of motor vehicle safety standards and highway safety programs.
FARS contains data on a census of fatal traffic crashes within the 50 States, the District of Columbia, and Puerto Rico. To be included in FARS, a crash must involve a motor vehicle traveling on a traffic way customarily open to the public and result in the death of a person (occupant of a vehicle or a non-occupant) within 30 days of the crash. FARS has been operational since 1975 and has collected information on over 989,451 motor vehicle fatalities and collects information on over 100 different coded data elements that characterizes the crash, the vehicle, and the people involved.
FARS is vital to the mission of NHTSA to reduce the number of motor vehicle crashes and deaths on the nation's highways, and subsequently, reduce the associated economic loss to society resulting from those motor vehicle crashes and fatalities. FARS data is critical to understanding the characteristics of the environment, trafficway, vehicles, and persons involved in the crash.
NHTSA has a cooperative agreement with an agency in each state government to provide information in a standard format on fatal crashes in the state. Data is collected, coded, and submitted into a micro-computer data system and transmitted to Washington, D.C. Quarterly files are produced for analytical purposes to study trends and evaluate the effectiveness of highway safety programs.
Querying BigQuery tables
We can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. We then have to manipulate the output to get the results in the desired format for further use. Tables are at bigquery-public-data.nhtsa_traffic_fatalities.[TABLENAME].
Acknowledgement
This dataset was provided by the National Highway Traffic Safety Administration.
BigQuery API
The BigQuery API enables a user or a group of users to create, analyze, share, and manage complex datasets. Check here to learn more about BigQuery API