Electric Vehicle Range Prediction

Electric Vehicle Range Prediction

Code Availability

View on GitHub

Introduction

In recent years, more and more people are buying Electrical Vehicles (EVs) for environmental, aesthetic, and financial reasons. The number of car companies inventing EVs for their brand are increasing. Companies such as Tesla, Ford, and Rivian are taking advantage of this move toward EVs. The goal of our project is to examine just how much the range of EVs changes due to other factors such as battery pack. In this paper we analyze the relation between the range and other variables such as acceleration, top speed, battery pack, efficiency, fast charge and price in order to give us a deeper understanding of what element influences the range of EVs the most. Additionally, we can also see what car manufacturers should do to improve their EVs, and provide useful data for car companies.

Background

Dataset

Divyanshu Gupta, Kaggle (2021), Cars Dataset with Battery Capacity Data File. Retrieved from Kaggle.

Data Collection Method

The data was collected from different companies such as Tesla, Porsche, BMW. The data set also gives us the specific make and model of the cars. The dataset contains 14 explanatory variables with 1 response variable and a total of 102 data points.

Preliminary Analysis

Hypothesized Variables That Impact Electric Vehicle Range

The main covariate that we believe will have the largest impact on the range of an electric vehicle is battery packs in kilowatts per hour. The other variables that we are looking at would be acceleration, top speed, efficiency, how fast the car charges, and price. Qualitative variables would be the plug style, number of seats, power train (all wheel drive vs. four wheel drive), and type of car. This mix of qualitative and quantitative variables will allow us to give a clearer understanding of how different factors affect the range of EVs.

BrandModelAccelSecTopSpeed_KmHRange_KmBattery_Pack KwhEfficiency_WhKmFastCharge_KmHRapidChargePowerTrainPlugTypeBodyStyleSegmentSeatsPriceEuro
0TeslaModel 3 Long Range Dual Motor4.623346070.0161940YesAWDType 2 CCSSedanD555480
1VolkswagenID.3 Pure10.016027045.0167250YesRWDType 2 CCSHatchbackC530000
2Polestar24.721040075.0181620YesAWDType 2 CCSLiftbackD556440
3BMWiX36.818036074.0206560YesRWDType 2 CCSSUVD568040
4Hondae9.514517028.5168190YesRWDType 2 CCSHatchbackB432997
97NissanAriya 63kWh7.516033063.0191440YesFWDType 2 CCSHatchbackC545000
98Audie-tron S Sportback 55 quattro4.521033586.5258540YesAWDType 2 CCSSUVE596050
99NissanAriya e-4ORCE 63kWh5.920032563.0194440YesAWDType 2 CCSHatchbackC550000
100NissanAriya e-4ORCE 87kWh Performance5.120037587.0232450YesAWDType 2 CCSHatchbackC565000
101BytonM-Byte 95 kWh 2WD7.519040095.0238480YesAWDType 2 CCSSUVE562000

Exploring Influence of Outliers

We tested different outlier removal methods such as taking away data that was two and three standard deviations away from mean. Figure 1 compares the linear relationship of covariates against Range_Km with and without outliers respectively:

outliers

After analyzing the best-fit lines with and without outliers we went on to analyze the corresponding residual plots. Figure 2 compares the residual plots of covariates against Range_Km with and without outliers respectively:

outliersResidual

We found a minimal difference in the strength of the linear relationship (measured with R2R^2) between the covariates and response variable when filtering outliers. Additionally, our dataset is relatively small with 102 samples. For these reasons, we decided not to remove outliers for further stages of our analysis.

CovariateR² BeforeR² After
AccelSec0.460.522
TopSpeed_KmH0.560.463
Battery_Pack Kwh0.8290.753
Efficiency_WhKm0.0980.068
FastCharge_KmH0.5690.6
PriceEuro0.4580.411

Table 1

Normal Quantile Plot

In order to verify one of the central assumptions of linear regression, normally distributed residuals, we created a Q-Q plot in Figure 3 for all of our covariates to observe that this assumption holds true for most of our covariates.

normalQuantile

Transformations

Analyzing most of the residuals we can see that most of them are random, normally distributed, independent, and show homoscedasticity. However, these conditions are not met for the covariates Battery pack and Price in Euro. For this reason, we decided that it is important to transform Battery pack and Price in Euro.

We tested two conventional transformations. First, we applied square root transformations on Battery pack and Price in Euro. Next, we applied log transformations to the two covariates. We determined that the log transformations led to a stronger linear relationship (R2R^2) so our final transformation was a log transform on Battery Pack and Price. Figure 4 shows the improved residual plots after the log transformation:

transforms
CovariateR² BeforeR² After
AccelSec0.460.539
TopSpeed_KmH0.560.552
Battery_Pack Kwh0.8290.768
Efficiency_WhKm0.0980.087
FastCharge_KmH0.5690.187
PriceEuro0.4580.529

Table 2

Main Results

Multicollinearity

We created a correlation coefficient matrix (Figure 5) in order to test for multicollinearity within our covariates. Range was highly correlated with all of our covariates which is a good sign that our model is predictive of range. We noticed that acceleration has a negative correlation with most of the other variables including range. The efficiency coefficient had the lowest correlation coefficient with range at 0.48. There is high correlation between price and our other covariates, and we believe there is evidence of multicollinearity because of the high correlation coefficient. Removing price may reduce the multicollinearity issues we encounter. It may be best to use a stepwise forward regression, to allow the model to remove covariates that are causing multicollinearity.

correlation

With the covariates that remained after feature selection, we measured multicollinearity using the variance inflation factor (VIF) given by:

VIFi=11Ri2VIF_i = \frac{1}{1 - {R_i}^2}

The result (Table 3) shows that all of the selected covariates are below 5, which means that they will not have multicollinearity. The variable TopSpeed is less significant than the other covariates, but because it is significant at the 5% level and has a VIF level below 10, we believe it should still be included for analysis.

TermEstimateStd Errort RatioProb > |t|VIF
Intercept-511.936334.95776-14.64< .0001*.
TopSpeed_KmH0.44940410.1750152.570.0120*1.9033032
Log(Battery_Pack KWh)235.0975214.1982816.56< .0001*2.7759986
Efficiency_WhKm-1.0134240.184575-5.49< .0001*1.8698509

Table 3

Variable Selection

MethodProb to enter/leaveEquationAdjusted R²
Forward SelectionPE: 0.25
PL: 0.1
y^=909.39+0.29x1+39.04x22.08x3+288.90x428.80x5+8.22x645.13x752.62x8+70.71x9+43.78x10\hat{y} = -909.39 + 0.29x_1 + 39.04x_2 - 2.08x_3 + 288.90x_4 - 28.80x_5 + 8.22x_6 - 45.13x_7 - 52.62x_8 + 70.71x_9 + 43.78x_{10}0.941
Backward EliminationPE: 0.25
PL: 0.1
y^=909.39+0.29x1+39.04x22.08x3+288.90x428.80x5+8.22x645.13x752.62x8+70.71x9+43.78x10\hat{y} = -909.39 + 0.29x_1 + 39.04x_2 - 2.08x_3 + 288.90x_4 - 28.80x_5 + 8.22x_6 - 45.13x_7 - 52.62x_8 + 70.71x_9 + 43.78x_{10}0.941
MixedPE: 0.25
PL:0.25
y^=909.39+0.29x1+39.04x22.08x3+288.90x428.80x5+8.22x645.13x752.62x8+70.71x9+43.78x10\hat{y} = -909.39 + 0.29x_1 + 39.04x_2 - 2.08x_3 + 288.90x_4 - 28.80x_5 + 8.22x_6 - 45.13x_7 - 52.62x_8 + 70.71x_9 + 43.78x_{10}0.941
AICCN/Ay^=909.39+0.29x1+39.04x22.08x3+288.90x428.80x5+8.22x645.13x752.62x8+70.71x9+43.78x10\hat{y} = -909.39 + 0.29x_1 + 39.04x_2 - 2.08x_3 + 288.90x_4 - 28.80x_5 + 8.22x_6 - 45.13x_7 - 52.62x_8 + 70.71x_9 + 43.78x_{10}0.941
BICN/Ay^=909.39+0.29x1+39.04x22.08x3+288.90x428.80x5+8.22x645.13x752.62x8+70.71x9+43.78x10\hat{y} = -909.39 + 0.29x_1 + 39.04x_2 - 2.08x_3 + 288.90x_4 - 28.80x_5 + 8.22x_6 - 45.13x_7 - 52.62x_8 + 70.71x_9 + 43.78x_{10}0.941

Table 4

Model Fit

The transformations that we included in our model included the log of price and log of battery pack. We achieved a perfect Ra2{R_a}^2 value and realized that the car Model and brand covariate had a high cardinality (factor levels were equal to the number of data points in this case), hence, causing overfitting (over-generalization) of the fitted model. For this reason, we removed the a Model and brand covariate and were left with a much more applicable adjusted Ra2{R_a}^2 and low RMSE that predicts range.

Summary of Fit
RSquare0.954078
RSquare Adj0.949031
Root Mean Square Error28.60422
Mean of Response338.6275
Observations (or Sum Wgts)102

Table 5

SourceDFSum of SquaresMean SquareF Ratio
Model101546901.5154690189.0612
Error9174456.3818Prob > F
C. Total1011621357.8<.0001*

Table 6

TermEstimateStd Errort RatioProb > |t|
Intercept-909.3919117.3835-7.75< .0001
TopSpeed_KmH0.29428740.1401422.100.0385
RapidCharge[No]39.0424488.5528914.56< .0001
Efficiency_WhKm-2.0788490.161668-12.86< .0001
Log(Battery_Pack Kwh)288.8974914.3083320.19< .0001*
BodyStyle{SPV&Hatchback&MPV&SUV-Station&Sedan&Cabrio&Liftback&Pickup}-28.802095.374477-5.36< .0001*
BodyStyle{SPV&Hatchback-MPV&SUV}8.21533963.6237982.270.0258
BodyStyle{Station&Sedan&Cabrio&Liftback-Pickup}-45.1267212.9488-3.490.0008*
BodyStyle{Station&Sedan-Cabrio&Liftback}-52.616646.97113-7.55< .0001*
BodyStyle{Cabriolet-Liftback}70.71285311.465896.17< .0001*
Log(PriceEuro)43.78241414.524273.010.0033*

Table 7

After we used forward selection as our variable selection method and set p-value enter is 0.25 and p-value leave as 0.1, we determined our final model to be as follows:

y^=909.39+0.29x1+39.04x22.08x3+288.90x428.80x5+8.22x645.13x752.62x8+70.71x9+43.78x10\begin{aligned} \hat{y} = & -909.39 + 0.29x_1 + 39.04x_2 - 2.08x_3 + 288.90x_4 \\ & - 28.80x_5 + 8.22x_6 - 45.13x_7 - 52.62x_8 \\ & + 70.71x_9 + 43.78x_{10} \end{aligned}

where xix_i are listed in the table below:

x1x_1Topspeed_KmH
x2x_2Rapid Charge
x3x_3Efficiency_WhKm
x4x_4log(Battery_Pack Kwh)
x5x_5BodyStyle{SPV&Hatchback&MPV&SUV–
Station&Sedan&Cabriolet&Lift back&Pickup}
x6x_6BodyStyle{SPV&Hatchback–MPV&SUV}
x7x_7BodyStyle{Station&Sedan&Cabriolet&Lift
back–Pickup}
x8x_8BodyStyle{Station&Sedan–Cabriolet&Lift
back}
x9x_9BodyStyle{Cabriolet–Lift back}
x10x_{10}log(PriceEuro)

Table 8

We found our final model to be adequate as it had a high Ra2{R_a}^2 and all of the covariates we found were significant at the α=0.05\alpha = 0.05 level. The reason we used forward selection is to remove the issue of multicollinearity and it was the simplest and easiest model. Figure 9 shows the performance of our model. Some points to highlight are that our model’s Ra2{R_a}^2 is 0.949, has a statistically significant FF statistic of 189.061 in the ANOVA test.

Model Interpretation

There is a natural interpretation of the covariates in that each one shows us how much they affect the range of EVs. The data we got shows that the battery pack does affect the range the most. Other than the powertrain covariate, all of the data was significant at the 5% level of significance. We found how the different body styles of cars can affect the range as well.

Conclusion

Based on the results of our model we can see that there is a strong positive correlation between battery pack and range. This can help consumers and car manufacturers when designing new EVs to create larger battery packs and understand that with the combination of Bodystyle we can see which are the best cars for range. For example, Cabrio or convertibles and hatchbacks have the greatest impact on range with their coefficient being 42. SUVs and pickup trucks for example have a negative effect on range.

Another interesting covariate we found was how rapid charge effects range, with it being the third largest β\beta that we found. This makes sense as charging cars will lead to more range for the car, it could make the car more efficient. The price was also another factor that impacts range, one that makes sense as price increases, the technology and range of the EV increases as well. Our model allows car manufacturers, consumers, and investors to understand the factors of EVs that most affect range allowing for better improvement of EVs for the future, improving the environment and lowering the market for gas powered vehicles.

👥 Collaborators

  • Yuhang Du (Boston University)
  • Matthew George (Boston University)
  • Xiangru He (Boston University)
  • Yujie Yang (Boston University)