Traffic Volume Prediction
Sheldon Sebastian
Photo by Alexander Popov on Unsplash
Abstract
We are predicting the traffic volume per day for the I94 interstate. The traffic volume per day is the number of cars which use the I94 interstate between St. Paul and Minneapolis. To make accurate forecasts, 6 models Average Model, Naïve Model, Drift Model, Holt Winter Model, Multiple linear regression and ARMA were used. The performance of the all models are compared and the best performing model is recommended to forecast traffic volume.
Keywords: Forecasting, Traffic, Average model, Naïve model, Drift Model, Holt Winter, ARMA, Linear Regression
Table of contents
- Introduction
- Data description and preprocessing
- Stationarity check
- Average Model
- Naive Model
- Drift Model
- Time series decomposition
- Holt Winters method
- Multiple Linear Regression
- ARMA model
- Final Model Selection
- Conclusion
1. Introduction
We are predicting the number of cars in a day on the I94 interstate between St. Paul and Minneapolis as shown in the below figure:
Business value of project:
- Avoid Traffic Congestion: We can predict the days when there will be heavy traffic congestion and thus take contingencies to avoid them.
- Road Maintenance: Using the traffic volume predictions we can estimate how long before the road needs repairs and we can schedule repairs when there is least traffic volume.
For achieving the goal of predicting traffic volume, we are considering 6 prediction models: Average, Naïve, Drift, Holt Winter, ARMA model and Multiple Linear Regression model.
In average model, all the future predictions are average of the training data. In naïve model, we predict all the future values by taking the last value of the training dataset. In drift model, we plot a line from the first point of the data to the last point and extend it to predict all the future values. In the Holt Winter method, we will find whether traffic volume follows additive or multiplicative trend and then make predictions.
For the Linear Regression Model, we will scale the feature variables and perform data cleaning and then make predictions. Finally, for ARMA model, we will estimate the order of the ARMA process using GPAC table, estimate the parameters for ARMA and check whether the residuals pass the chi square test or not.
Once all the models are created, we will compare the performance and recommend the best performing model.
2. Data description and preprocessing
The dataset has hourly traffic volume from October 2012 to September 2018. Traffic volume is defined as count of cars in an hour on the interstate. As described previously, the hourly traffic volume is tracked between Minneapolis and St Paul, MN.
The dataset is sourced from the following website:
https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume
Resampling strategy
For computational purposes and model interpretability the hourly data was resampled into daily data. Also, we are focusing on traffic volume data for September 2016 to September 2018.
When we perform resampling the following functions were applied to the variables:
- Mean: temp, clouds_all, traffic_volume, rain_1h, snow_1h.
- First: weather_main, holiday.
After resampling the shape of the dataset is
(731,7).
Summary Statistics
The summary statistics for numeric columns are as follows:
We notice that the snow_1h column has all values as zero, thus we remove that column.
The summary statistics for categorical columns are as follows:
We notice that the holiday column has 22 values only, thus we replace all the other NaN values with “No Holiday” values. After replacing all the holiday NaN columns with 'No Holiday' value we get value counts for holiday column as:
We also notice that the weather_main column contains 9 unique values which are:
We condense this information as follows:
- Rain additionally covers the values Drizzle, Thunderstorm
- Fog additionally covers the values Mist, Haze, Smoke
Thus, after condensing the value counts are as follows:
Finally, after resampling and data cleaning the column count with NaN values are:
Hence, we do not need to perform any data imputation.
Traffic Volume over time
We plot the traffic volume over time, the traffic volume data is resampled to daily data and the scope of data is from 09/2016 to 09/2018.
ACF of traffic volume
For plotting the ACF plot we have set the value of lag = 200.
We notice that the ACF values show decay at lag = 200.
Train-Test Split
We split the resampled data into train and test datasets. The dimension for train dataset is (584,6). The dimension for test dataset is (147,6)
3. Stationarity check
To check if traffic volume is stationary or not, we perform the ADF test.
From ADF test we conclude that traffic volume is stationary.
4. Average Model
We compute the mean of training data and perform h step predictions to match the size of the test data.
The model statistics are as follows:
RMSE |
MSE |
Residual Variance |
Residual Mean |
531.918 |
282937.565 |
279859.840 |
55.477 |
The plot of forecasted values with the actual value is shown below:
The ACF of residuals is as follows:
We conclude that the ACF plot does not resemble white noise.
5. Naive Model
We find the last sample of training data and perform h step predictions to match the size of the test data.
The model statistics are as follows:
RMSE |
MSE |
Residual Variance |
Residual Mean |
1091.679 |
1191763.077 |
279859.840 |
954.936 |
The plot of forecasted values with the actual value is shown below:
The ACF of residuals is as follows:
We conclude that the ACF plot does not resemble white noise.
6. Drift Model
The performance measures for the Drift Model are as follows:
RMSE |
MSE |
Residual Variance |
Residual Mean |
1223.722 |
1497497.754 |
284349.208 |
1101.430 |
The plot of forecasted values with the actual value is shown below:
The ACF of residuals is as follows:
We conclude that the ACF plot does not resemble white noise.
7. Time series decomposition
We will decompose traffic volume to comprehend whether trend and seasonality are additive or multiplicative.
The Multiplicative time series decomposition plot is shown below:
The Additive time series decomposition plot is shown below:
We notice that the additive residuals have high variance and it ranges from +1000 to -1000, whereas all the multiplicative residuals are close to one.
Thus, the multiplicative decomposition best represents the traffic volume data and we see a strong seasonality component but there is no trend visible.
8. Holt Winters method
Based on the time series decomposition we will configure the Holt-Winters parameters for predicting the test data. We will set the seasonality to be multiplicative and set trend to be None.
The performance measures for the Holt Winters Model are as follows:
RMSE |
MSE |
Residual Variance |
Residual Mean |
291.017 |
84690.827 |
84197.577 |
-22.209 |
The plot for Holt Winter model prediction along with actual predictions is shown below:
We notice from above plot that the Holt Winter model predictions are close to the actual values.
ACF of residuals: We notice that the ACF plot for Holt Winter method resembles the white noise.
9. Multiple Linear Regression
Linear Model with all Features:
We will now perform multiple linear regression and for this we need to scale the data and convert the categorical columns into numerical columns.
We scale the feature variables using sklearn.preprocessing’s MinMaxScaler function and then compute the MSE values for the test data and the predicted values.
We convert the categorical values into numerical values by using pandas get_dummies(…) function.
The summary of linear model with all variables is:
Conclusions based on above summary:
F-test : The F-test passes since the Prob(F-statistics) is less than 0.05 and thus our model performs better than null model.
AIC: The AIC value is 9028
BIC: The BIC value is 9111
T-test: There are variables which fail the t-test, to fix this we drop the variables which fail the t-test.
Linear Model after Feature Selection:
Since there are variables which fail the t-test when we use all variables model, we remove these variables and repeat the linear model process until all the variables pass the t-test.
The summary of linear model after feature selection is:
Conclusions based on above model:
F-test: The Prob(F-statistics) < 0.05 thus our model performs better than the null model and it passes the F-test.
AIC: The AIC value is 9016 which is lower than all variables model.
BIC: The BIC value is 9060 which is lower than all variables model.
T-test: The t-test passes for all variables since the P(t) < 0.05 for all variables.
R-Squared: The R-squared value is 0.115
Adjusted R-squared: The adjusted R-squared value is 0.101 which is better than all variables model.
The performance measures for the Linear Model after feature selection are as follows:
RMSE |
MSE |
Residual Variance |
Residual Mean |
506.384 |
256424.508 |
255166.372 |
-35.470 |
ACF of residuals:
We observe the residuals are decaying in ACF plot, but they do not resemble a white noise.
The Q value of residuals of Linear Model is: 1333.365
Correlation Coefficient Matrix:
The correlation coefficient matrix includes only those variables which have been used for final linear model. After feature selection the final independent features are:
holiday_Columbus Day, holiday_No Holiday, holiday_State Fair, holiday_Veterans Day, temp, weather_main_Clear, weather_main_Clouds, weather_main_Fog, weather_main_Rain, and weather_main_Snow.
We notice there is no strong relationship between the variables and thus no multicollinearity is present. The correlation coefficient matrix is as follows:
The plot for linear model prediction along with actual predictions is shown below:
10. ARMA model
We will now predict the test data using the ARMA process. For this we will create the GPAC table and then find potential order of the ARMA process. After we find the order of ARMA process we will estimate its parameters.
We computed the mean of training data and subtracted it from the training data. This is done to relax the ARMA constraint. Since the ARMA model does not include an intercept, it might be a challenge to fit data with non-zero mean.
Once the order and parameters are estimated we will forecast the values and add the mean of training data. After adding mean, we check whether the residuals of forecasted and actual values are significant or not using the chi squared diagnostic test.
GPAC Table:
The GPAC table with j=12, and k=12 is shown below:
From the GPAC table we consider the following orders for ARMA parameter estimation.
(n_a, n_b) = [(2, 5), (2, 7), (4, 0), (4, 2), (4, 5), (4, 7), (6, 5), (10, 3)]
We noticed that none of the identified ARMA order from the GPAC table pass the chi squared test.
Thus, we try for all possible combinations of orders from the GPAC table in a brute force manner; the ARMA(4,6) passes the Chi Square test, but shows no pattern in GPAC table; this might be possible since we have only 584 samples in the training data.
Chi Square Test:
After trying a brute force approach for all possible order combinations for GPAC table, the ARMA(4,6) passes the chi square test. And as mentioned previously it shows no pattern in the GPAC table, a possible reason could be the small training size of 584 data points.
Parameter Estimation:
The estimated parameters based on n_a=4 and n_b=6 is:
Summary of ARMA(4,6) model:
The summary of ARMA(4,6) model is as follows:
Simplification of Model:
We will now simplify the ARMA(4,6) model by checking if zeros are included in confidence interval or not.
The confidence interval for the parameters are as follows:
We notice there are no zeros in the confidence interval band. Thus, no simplification needed.
We will also simplify model based on zero/pole cancellation by checking the roots of numerator and denominator. The roots of the AR and MA process are:
None of the roots are same, thus no zero/pole cancellation required. Hence the final ARMA model after simplification is ARMA(4,6).
Plot of Prediction: The plot of forecasted values with the actual value is shown below:
ACF of residual: The ACF of residuals for ARMA(4,6) is,
We conclude that the ACF plot does not resemble white noise.
A possible reason for poor performance of ARMA model is that our data contains seasonality.
The performance measures for the ARMA model are as follows:
RMSE |
MSE |
Residual Variance |
Residual Mean |
919.591 |
845648.87 |
842603.408 |
55.185 |
Biased or Unbiased models: Since the absolute value of mean of the residuals is greater than 0.05, we say that the model is biased. We can remove the bias by adding the mean to all the predictions.
Variance of error of estimated parameters:
Estimated variance of error for na = 4 and nb = 6 is 119649.321
11. Final Model Selection
In our analysis for predicting traffic volume we have used the Average Method, Naïve Method, Drift Method, Holt Winter Model, Multiple Linear Regression model and the ARMA model. We will now compare the outputs of these models and provide conclusions.
Model |
MSE |
RMSE |
Residual Mean |
Residual Variance |
Holt Winter Model |
84690.827 |
291.016884 |
-22.209225 |
84197.577336 |
Multiple Linear Regression Model |
256424.508 |
506.383756 |
-35.470217 |
255166.372084 |
Average Model |
282937.565 |
531.918758 |
55.477248 |
279859.840159 |
ARMA(4, 6) Model |
845648.870 |
919.591687 |
55.185695 |
842603.408654 |
Naive Model |
1191763.077 |
1091.679017 |
954.936248 |
279859.840159 |
Drift Model |
1497497.754 |
1223.722907 |
1101.430227 |
284349.208489 |
When we consider RMSE to be performance metric, we conclude that Holt Winter has the smallest RMSE and thus performs the best.
Hence for traffic volume prediction problem the Holt Winter model is the best model.
The plots for predicted and the actual values for all models are shown:
12. Conclusion
In conclusion based on the RMSE values, the Holt Winters model is recommended for traffic volume prediction. For future scope we may want to explore other models like SARIMA or recurrent neural networks.
References
1. Link to all code files: https://github.com/sheldonsebastian/Traffic_Prediction_Using_Time_Series