Car correlations.

Hitchhiker’s guide to Used Car Prices Estimation

What is your car’s worth? This tutorial will guide you through the steps for estimating it’s value by using Machine Learning techniques. I will use my Peugeot 106 as an example!

By the way, join our community and ask your question for free here!

Why Machine Learning at all?

I discussed the price of my over 10-year-old vehicle with a family member. I asked a simple question: What do you think of the value of my car? He laughed and searched for some Peugeot cars on the internet and reported the average price of the cars he saw. This is a great first approach, but I think we can do better. And, spoiler alert, we can! The Machine Learning approach presented in this article will give us some valuable insights in the estimation of the price. Let’s start on our used car prices estimation journey!

Hitchhiker’s guide to Used Car Prices

In this tutorial, we will go through the following steps:

  1. Dataset creation.
  2. Relations between the variables.
  3. Variable selection.
  4. Function approximation.

That’s all! First, we need to create a dataset by gathering information about your car (here, my Peugeot 106). Then, we will figure out what the relations are between the different variables. After that, we will only use the most influencial variables. Having all the variables is great, but not all variables contribute a lot to the price of a car. Finally, by having our predictor variables, we can try to figure out a relation between these variables and the ask price. Let’s start!

Step 0: Importing dependencies

The code has a few dependencies. You can load the dependencies using the following code snippet:

# The imports
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

Step 1: Creating the dataset

Before we can do any machine learning, we need data. I collected the data using a online used car selling platform in the Netherlands. The following script writes the dataset to a file and then loads the data into a dataframe:

data = """
Brand,Type,Color,Construction Year,Odometer,Ask Price,Days Until MOT,HP
Peugeot 106,1.0,blue,2002,166879,999,138,60
Peugeot 106,1.0,blue,1998,234484,999,346,60
Peugeot 106,1.1,black,1997,219752,500,-5,60
Peugeot 106,1.1,red,2001,223692,750,-87,60
Peugeot 106,1.1,grey,2002,120275,1650,356,59
Peugeot 106,1.1,red,2003,131358,1399,266,60
Peugeot 106,1.1,green,1999,304277,799,173,57
Peugeot 106,1.4,green,1998,93685,1300,0,75
Peugeot 106,1.1,white,2002,225935,950,113,60
Peugeot 106,1.4,green,1997,252319,650,133,75
Peugeot 106,1.0,black,1998,220000,700,82,50
Peugeot 106,1.1,black,1997,212000,700,75,60
Peugeot 106,1.1,black,2003,255134,799,197,60

# Write this dataset to a file and read it
with open('data.csv', 'w') as output_file:
df = pd.read_csv('data.csv')

This will then result into the following dataframe:

BrandTypeColorConstruction YearOdometerAsk PriceDays Until MOTHP
0Peugeot 1061.0blue200216687999913860
1Peugeot 1061.0blue199823448499934660
2Peugeot 1061.1black1997219752500-560
3Peugeot 1061.1red2001223692750-8760
4Peugeot 1061.1grey2002120275165035659
5Peugeot 1061.1red2003131358139926660
6Peugeot 1061.1green199930427779917357
7Peugeot 1061.4green1998936851300075
8Peugeot 1061.1white200222593595011360
9Peugeot 1061.4green199725231965013375
10Peugeot 1061.0black19982200007008250
11Peugeot 1061.1black19972120007007560
12Peugeot 1061.1black200325513479919760

As you can see, I collected the brand (Peugeot 106), the type (1.0, 1.1, …), the color of the car (black, blue, …) the construction year of the car, the odometer of the car (which is the distance in kilometers (km) traveled with the car at this point in space and time), the ask price of the car (in Euro’s), the days until the MOT (Ministry of Transport test, a required periodical check-up of your car) and the horse power (HP) of the car. Feel free to use your own variables/units!

Cleaning the dataset

There are a few issues with the dataset. First of all, not all variables are of ordered types. For example, there is no logical way of ordering the color. Therefore, we will convert these columns to binary columns which simply says per color whether the car is that color or not. I use the following script for transforming categorical variables to binary variables:

# Convert the color column to one binary column for each color
df_colors = df['Color'].str.get_dummies().add_prefix('Color: ')
# Convert the type column to one binary column for each type
df_type = df['Type'].apply(str).str.get_dummies().add_prefix('Type: ')

# Add all dummy columns
df = pd.concat([df, df_colors, df_type], axis=1)
# And drop all categorical columns
df = df.drop(['Brand', 'Type', 'Color'], axis=1)

The output is now the following:

Car dataset.

Car dataset.

Great! Now we can dive into the relations between the variables.

Update: I obtained better results when I take the log of the log of the odometer variable. To do so, you need to execute the following:

df['Odometer Log Log'] = np.log(np.log(df['Odometer']))

Step 2: Relations between the variables

It is quite simple using Pandas to find out correlations between our variables:

matrix = df.corr()
f, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(matrix, vmax=0.7, square=True)
plt.title('Car Price Variables')

This results into the following plot:

Car correlations.

Car correlations.

Now things get interesting. The Ask Price has a strong positive correlation with the construction year which makes a lot of sense. Newer cars are worth more money. It also has a positive correlation with the days until the MOT. This also makes a lot of sense! If you just did a MOT, the buyer is more certain that nothing is wrong with your car. The person is definitely willing to pay more if you just did a MOT. There is a strong negative correlation with the odometer. If you drove a lot with your car, it is more likely that the car will get more and more issues. Also interesting: the color grey is also correlated. But that is because of underlying (latent) variables. In my dataset, the grey cars were sport types of the Peugeot 106. Therefore, they have more horse power and are more optimized then a normal Peugeot 106 and that is why the ask price is higher. But this has nothing to do with the color of the car.

Here is a plot of the interesting variables:

Interesting variables.

Interesting variables.

Step 3: Variable selection

As pointed out, we will use the following variables for predicting the ask price:

  • Days until MOT (periodical check-up).
  • Odometer (traveled distance of the car).
  • Construction year.

Update: For the odometer, I used the log of the log of the original odometer variable

The following plot reveals the relation between the interesting variables:

sns.pairplot(df[['Construction Year', 'Days Until MOT', 'Odometer Log Log', 'Ask Price']], size=1.5)
Relations between variables.

Relations between variables.

By the way, using this plot I figured out that the log of the log of the odometer worked better for me.

Step 4: The Ask Price function

How can we approximate the ask price given the variables? Using a neural network ofcourse! We will use a Multi-Layered Perceptron to estimate the price. Also note that the variables are not normalized. Normalization is important for a neural network. If you will not normalize your data, then certain variables will have a different scale than other variables. Take for example the construction year variable. The construction year will be way lower than the traveled distance. An example construction year is 1999 and an example traveled distance is 250.000 km. This is approximately 1000x more than the construction year! Therefore, we will normalize such that all numbers are centered around zero (and have a standard deviation of approximately 1). Luckily, scikit-learn (sklearn) can take care of that!

from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
import numpy as np

X = df[['Construction Year', 'Days Until MOT', 'Odometer Log Log']]
y = df['Ask Price'].reshape(-1, 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

X_normalizer = StandardScaler()
X_train = X_normalizer.fit_transform(X_train)
X_test = X_normalizer.transform(X_test)

y_normalizer = StandardScaler()
y_train = y_normalizer.fit_transform(y_train)
y_test = y_normalizer.transform(y_test)

model = MLPRegressor(hidden_layer_sizes=(100, 100), random_state=42), y_train)

Now we can predict prices:

y_pred = model.predict(X_test)

y_pred_inv = y_normalizer.inverse_transform(y_pred)
y_test_inv = y_normalizer.inverse_transform(y_test)

# Build a plot
plt.scatter(y_pred_inv, y_test_inv)
plt.ylabel('Real value')

# Now add the perfect prediction line
diagonal = np.linspace(500, 1500, 100)
plt.plot(diagonal, diagonal, '-r')
plt.xlabel('Predicted ask price (€)')
plt.ylabel('Ask price (€)')

This resulted into the following plot for me:

Price estimation.

Price estimation.

Perfect! The red line shows the perfect predictions. If all blue dots would lie on this line, then the predictions would be perfect. But this is not bad! We are able to predict the ask price quite good.

Bonus: Relations between the variables

Now we are able to predict the ask price, we can ask ourselves the following question: what are the relations between the several variables? We will explore the relations (in case of my Peugeot 106) in this section.

Distance traveled (odometer) versus ask price

distances = np.linspace(150000, 250000, 10000)

my_car = pd.DataFrame([
        'Construction Year': 1998,
        'Odometer': odometer,
        'Days Until MOT': 150
for odometer in distances])

my_car['Odometer Log Log'] = np.log(np.log(my_car['Odometer']))
X_custom = my_car[['Construction Year', 'Days Until MOT', 'Odometer Log Log']]
X_custom = X_normalizer.transform(X_custom)

y_pred = model.predict(X_custom)
price_prediction = y_normalizer.inverse_transform(y_pred)

fig, ax = plt.subplots(1, 1)
ax.plot(distances, price_prediction)
ax.get_xaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
ax.set_xlabel('Distance travelled (km)')
ax.set_ylabel('Predicted ask price (€)')
plt.title('Predicted ask price versus Distance travelled')




I think this relation is general for most cars. Please report if you find a different relation for you car! As you can see, the more you travel with a car, the less the ask price becomes. The difference in the price is almost €200,-!

Construction year versus ask price

This will definitely differ per car type, but you will probably see a positive relation. If the car is younger, then the ask price will be higher.

construction_years = list(range(1997, 2001 + 1))

my_car = pd.DataFrame([
        'Construction Year': construction_year,
        'Odometer': 220000,
        'Days Until MOT': 150
for construction_year in construction_years])

my_car['Odometer Log Log'] = np.log(np.log(my_car['Odometer']))
X_custom = my_car[['Construction Year', 'Days Until MOT', 'Odometer Log Log']]
X_custom = X_normalizer.transform(X_custom)

y_pred = model.predict(X_custom)
price_prediction = y_normalizer.inverse_transform(y_pred)

fig, ax = plt.subplots(1, 1)
ax.plot(construction_years, price_prediction)
plt.xticks(construction_years, construction_years)
ax.set_xlabel('Construction year')
ax.set_ylabel('Predicted ask price (€)')
plt.title('Predicted ask price versus Construction year')
Construction year.

Construction year.

Periodical check-up (MOT) versus ask price

days_until_MOT = np.linspace(-365, 365, 100)

my_car = pd.DataFrame([
        'Construction Year': 1998,
        'Odometer': 220000,
        'Days Until MOT': days
for days in days_until_MOT])

my_car['Odometer Log Log'] = np.log(np.log(my_car['Odometer']))
X_custom = my_car[['Construction Year', 'Days Until MOT', 'Odometer Log Log']]
X_custom = X_normalizer.transform(X_custom)

y_pred = model.predict(X_custom)
price_prediction = y_normalizer.inverse_transform(y_pred)

fig, ax = plt.subplots(1, 1)
ax.plot(days_until_MOT, price_prediction)
ax.set_xlabel('Days until MOT')
ax.set_ylabel('Predicted ask price (€)')
ax.set_xlim(0, 365)
plt.title('Predicted ask price versus Days until MOT')
Days until MOT.

Days until MOT.

If the MOT costs less than $1100-700=400$ Euro, then it is definitely worth it to do a check-up! The predicted ask price is approximately $700$ Euro when the MOT is soon. there are 365 days until MOT left when the MOT just happened. When there are 365 days until MOT, the predicted ask price is approximately $1100$ Euro. So definitely do a MOT if you want to sell your car and for this car, the MOT costs less than $400$ Euro.

What is the value of my Peugeot?

My car has the following properties:

  • Construction year: 1998.
  • Traveled distance: 220.000 km.
  • Days until MOT: 150.

Using our neural network (MLP) model, I can approximate the ask price!

my_car = pd.DataFrame([
        'Construction Year': 1998,
        'Odometer': 220000,
        'Days Until MOT': 150

my_car['Odometer Log Log'] = np.log(np.log(my_car['Odometer']))
X_custom = my_car[['Construction Year', 'Days Until MOT', 'Odometer Log Log']]
X_custom = X_normalizer.transform(X_custom)

y_pred = model.predict(X_custom)
price_prediction = y_normalizer.inverse_transform(y_pred)
print('Predicted ask price: €%.2f' % price_prediction)

The output was €828.86. That is great since I bought it for only €800,-. What is the approximated price for your car?

Conclusion (TL;DR)

Using machine learning for predicting the ask price of your car is great! It reveals the underlying variables which are important for the car price. In my case (for my Peugeot 106), the traveled distance, days until periodical check-up and the construction year were the most important variables. Using a neural network regressor, I was able to predict the ask price for my Peugeot 106. If you have any comments or questions, feel free to ask! If you liked this article, then please share it on Social Media! If you are interested in predicting house prices using machine learning, then you should read this article.

Kevin Jacobs

Kevin Jacobs

Kevin Jacobs is a certified Data Scientist and blog writer for Data Blogger. He is passionate about any project that involves large amounts of data and statistical data analysis. Kevin can be reached using Twitter (@kmjjacobs), LinkedIn or via e-mail: