Sheldon Sebastian
The costs of fatalities and injuries due to traffic accidents have a great impact on the society. This project explores the deaths caused by car accidents in the DC area. Car crash dataset was analyzed and it was found that the target FATAL class has severe imbalance. To predict the fatality caused by car accidents predictive models such as Logistic Regression, Decision Tree, Random Forest, XGBoost and ensemble of best models were used for classification. The performance of models are compared and feature importance for each model is performed.
Keywords: Binary Classification, Class Imbalance, Logistic Regression, Decision Tree, Random Forest, XGBoost, Ensembling
Workflow
Folder Structure
Performance Comparison
Model Feature Importance
Conclusion
Road traffic crashes are one of the world’s largest public health and injury prevention problems. The problem is all the more acute because the victims are overwhelmingly healthy before their crashes. This project focuses on predicting fatalities or deaths caused by car crash accidents in DC. If we can predict the person’s risk of fatality based on input features we can provide high priority emergency services. This will reduce death rate and save the person’s life.
This project can help emergency agencies and services to dispatch proper response based on risk of fatality for individuals in car crash.
To achieve the task of predicting fatality for user in car crash the dataset was downloaded from Metropolitan Police Department’s Crash Data Management System. It can be accessed here
Data dictionary:
Column Name | Description | Notes |
---|---|---|
OBJECTID | Unique Identifier to the dataset | |
CRIMEID | Foreign key to external dataset | |
CCN | Foreign key to external dataset | |
PERSONID | Unique Identifier for person | |
PERSONTYPE | Type of participant | Possible values are: Passenger, Driver, Pedestrian, Bicyclist |
AGE | Age of participants | Numeric column |
MAJORINJURY | Participant suffered Major Injuries | Yes-No |
MINORINJURY | Participant suffered Minor Injuries | Yes-No |
VEHICLEID | Unique Identifier for vehicle | |
INVEHICLETYPE | Type of vehicle | Passenger car, large truck, taxi, government, bicycle, pedestrian, etc |
TICKETISSUED | If persons issued a ticket | Yes-No |
LICENSEPLATESTATE | If a vehicle, the state (jurisdiction) license plate was issued (not license plate number) | 50 States |
IMPAIRED | Are any persons deemed ‘impaired’ | Yes-No |
SPEEDING | Was person in vehicle where speeding was indicated | Yes-No |
FATAL | Fatality of the person involved in car crash | This is the TARGET variable and is binary. Yes-No |
The following tasks were performed in this project:
The following diagram describe the folder structure:
The model_trainer folder contains the following folders:
The detailed process of data preprocessing and cleaning notebook can be accessed here
The original dataset had 596381 rows. The distribution of FATAL column:: N: 595964, Y: 417
The following data cleaning steps were performed:
INVEHICLETYPE colummn: It originally had 21 levels with varying frequency. They were combined to get 9 levels as follows:
[‘Passenger Car’, ‘Other Vehicle’, ‘Suv (sport Utility Vehicle)’, ‘Large/heavy Truck’, ‘Cargo Van’, ‘Other Small/light Truck’, ‘Bus’, ‘2 wheeler’, ‘Pickup Truck’]
AGE column: The numeric age column was converted into categorical with the following levels:
[0-17, 18-20, 21-30, 31-40, 41-50, 51-60, 61-70, 71-80, 81-90]
After cleaning, the dataset had 8181 rows. The distribution of FATAL column:: N: 8008, Y: 173
The detailed code of the EDA can be found here. After the data was cleaned the distribution of the features and target column was plotted.
1. FATAL:
This is the target column which we are going to predict. As we can see we have a severe class imbalance, where we have only 173 Fatal encounters and 8008 Non-Fatal encounters.
2. PERSONTYPE:
The distribution of PERSONTYPE column is shown below. Majority of values are for Driver.
3. AGE:
The distribution of AGE column is shown below. Majority of participants are in 21-30 range.
4. MAJORINJURY:
The distribution of MAJORINJURY column is shown below.
5. MINORINJURY:
The distribution of MINORINJURY column is shown below.
6. INVEHICLETYPE:
The distribution of INVEHICLETYPE column is shown below. Majority of cars are passenger car.
7. TICKETISSUED:
The distribution of TICKETISSUED column is shown below.
8. LICENSEPLATESTATE:
The distribution of LICENSEPLATESTATE column is shown below. Majority of values are DC.
9. IMPAIRED:
The distribution of IMPAIRED column is shown below.
10. SPEEDING:
The distribution of SPEEDING column is shown below.
The detailed code of the Train-Validation-Test Split can be found here. The data was split into Train-Validation-Test in a stratified manner.
The scheme for splitting the data is: 60% Train - 20% Validation - 20% Test
The detailed code of handling missing data can be found here.
The train-validation-test data had the following percentage of missing data:
The missing data was imputed using the most_frequent strategy. The statistics for imputing missing data was computed using training data. This training data statistic was used in validation and testing data.
The detailed code of encoding categorical data can be found here.. The categorical feature data was encoded using One Hot Encoding and the target data was encoded using Label Encoder.
After the data was cleaned, split, encoded and imputed the features column and target column were separated for train, validation and test and were stored as pickle files here.
Since the Target FATAL Class is extremely imbalanced we cannot use accuracy to evaluate model.
For this project False Negatives are costlier than False Positives.
In this context, FALSE NEGATIVE is:
In this context, FALSE POSITIVE is:
Thus we are using F2 score for evaluating our models.
The detailed code for Dummy Classifier can be found here.
Since False Negatives are more costlier than False positives, a dummy classifier will always predict FATAL class. Based on this we get:
Model | F2 Score | Split |
---|---|---|
Dummy Classifier | 0.02584 | Validation |
This baseline model has no skill and all predicted models should perform better than this F2 score.
The detailed code for Logistic Regression can be found here.
The following steps were performed for Logistic Regression:
The performance of best model on validation data is:
Model | F2 Score | Split |
---|---|---|
Logistic Regression | 0.462046 | Validation |
The detailed code for Decision Tree Classifier can be found here.
The following steps were performed for Decision Tree Classifier:
The performance of best model on validation data is:
Model | F2 Score | Split |
---|---|---|
Decision Tree Classifier | 0.397351 | Validation |
The detailed code for random forest classifier can be found here.
The following steps were performed for Random Forest Classifier:
The performance of best model on validation data is:
Model | F2 Score | Split |
---|---|---|
Random Forest Classifier | 0.416667 | Validation |
The detailed code for XGBoost can be found here.
The following steps were performed for XGBoost:
The performance of best model on validation data is:
Model | F2 Score | Split |
---|---|---|
XGBoost | 0.432277 | Validation |
The detailed code for ensembling can be found here
The top 3 models using validation score was selected in hard voting manner for ensembling. They include:
The performance of best model on validation data is:
Model | F2 Score | Split |
---|---|---|
Ensemble | 0.50793 | Validation |
All trained models were evaluated on Test Dataset and the F2 scores are as follows:
Model | F2 Score | Split |
---|---|---|
Dummy Classifier (Baseline) | 0.0265 | Test |
Logistic Regression | 0.4029 | Test |
Decision Tree Classifier | 0.3532 | Test |
Random Forest Classifier | 0.4078 | Test |
XGBoost Classifier | 0.3743 | Test |
Ensemble (LR + RFC + XGB) | 0.4302 | Test |
The Ensemble model was based on top 3 models with high F2 score for validation dataset. The F2 scores for all models were better than Dummy Classifier.
The detailed code for model feature importance can be found here
Feature importance for logistic regression is shown below:
Feature importance for decision tree is shown below:
Feature importance for random forest is shown below:
Feature importance for XGBoost is shown below:
In conclusion, the best model was hard vote ensemble of Logistic Regression, Random Forest and XGBoost. This can be used to predict the fatality of a person in car accident and deploy emergency services.
Future Scope:
Future scope includes exploring feature engineering, adding external datasets, exploring different data sampling strategies to improve the overall performance of models.