- zahrahosseini1
- Oct 5, 2020
- 4 min read
Updated: Nov 14, 2020

Image copied from https://dailyhive.com/montreal/montreal-traffic-study-february-2017?auto=true (Image: FOTOimage Montreal/ Shutterstock)
Introduction
A machine learning approach is presented to predict the severity of a collision if a collision occurs. Such prediction can be used to plan for road developments or improvements in traffic regulations to lower possibility of collisions with severe outcomes. Or it could be used in map applications to give a warning of possible collisions with severe outcomes to the users and suggest alternative safer routes or other means of commute.
The car collision dataset for the city of Montreal was downloaded from: https://open.canada.ca/data/en/dataset/cd722e22-376b-4b89-9bc2-7c7ab317ef6b
The primary language of the dataset is French, and it includes 190552 car collision data entries from year of 2012 to 2019. The dataset has 68 columns providing information about data quality, severity of collisions, date, hour, weather and road condition when the collision occurred, geographical location, and several other parameters.
For this project, data from 2019 was used as it is deemed to be the most relevant for future predictions.
Definition of Severity
There are multiple columns in the dataset that are relevant to the collision severity, such as number of vehicles and type of vehicles involved in the collision, and number of seriously or slightly injured people. For this project, the information in column ‘GRAVITE’ is used to categorize the collision severity. Data entries are categorized into 5 types based on the labels in this column:
1. Minor Property Damage: No casualties, and the damage assessment is lower or equal to the reporting threshold of $2,000
2. Major Property Damage: No casualties, and the damage assessment is above the reporting threshold of $2,000
3. Non-Hospitalized Injury: Only one or more victims slightly injured (injuries not requiring hospitalization, even if they require treatment from a doctor or in a hospital center)
4. Hospitalized Injury: No fatalities and at least one victim seriously injured (injuries requiring hospitalization, including those for which the person remains under observation in hospital)
5. Fatal: At least one victim died within 30 days of the accident
In this project, categories with any type of injury (3, 4 and 5) were assumed to be severe and were lumped together.

Feature Selection
Table below presents the columns/description of parameters taken from the dataset for feature extraction. The third column summarizes the processing steps and the extracted features that were used for the machine learning model.

Modelling
The dataset was split into three sets:
1. Training: This dataset was used to fit the model parameters. 60% of the samples were used for the training set.
2. Test: This dataset was used to optimize model parameters to maximize the accuracy. 20% of the samples were used for the test set.
3. Evaluation: This dataset was used to evaluate the accuracy of the model. 20% of the samples were used for evaluation.
Two model approaches were used: logistic regression and decision tree. For the logistic regression the optimal value for the inverse of the regularization factor C was found to be 0.6, and for the decision tree model, the optimum number of layers was found to be 6, as shown in figures below.

Model accuracy vs model parameter for the logistic regression (left) and decision tree (right)
Results
Both the decision tree and logistic regression approaches result in a similar accuracy. Both models predict collisions with minor property damage and collisions with injury reasonably well, but they are less confident in predicting collisions with major property damage. The overall accuracy of both models is 0.53.
The logistic regression model was used with the addition of polynomial features of 3rd degree. The largest improvement is seen for collisions with major property damage outcome, with recall changing from 0.22 to 0.26. The overall accuracy, however, remains almost unchanged.

Discussion and Concluding Remarks
Two machine learning algorithms, the logistic regression and the decision tree, were used to predict the severity of a collision given circumstances such as weather and road condition, hour, day and month of the accident, road type and configuration, and other features. Both models performed similarly, with an acceptable accuracy around 0.6 for predicting light collisions with property damages less than 2000$ and no injury, and more severe collisions involving injury. However, the confidence of predictions falls for collisions with an intermediate outcome, with no injury but damages higher than 2000$.
One reason for such poor performance could be due to the wide range of collisions that are flagged as ‘Major Property Damage’. For example, a collision with slightly higher damage than the reporting threshold (e.g., 2500$) and a more serious collision with much higher damage but no injury, all are lumped into this category. If this is the case, a more refined classification is required to differentiate collisions with more severe outcomes from those with lighter consequences in the Major Property Damage category. One approach could be to combine the Severity flags with other pieces of information, such as number of vehicles involved in the crash, type of vehicles, etc., to refine the collision severity classes further. Further improvements might also be possible by applying a more sophisticated machine learning algorithm like a neural network algorithm. These investigations are left as the future work for the purpose of this capstone project.