What is your car’s worth? This tutorial will guide you through the steps for estimating it’s value by using Machine Learning techniques. I will use my Peugeot 106 as an example!
By the way, join our community and ask your question for free here!
Why Machine Learning at all?
I discussed the price of my over 10-year-old vehicle with a family member. I asked a simple question: What do you think of the value of my car? He laughed and searched for some Peugeot cars on the internet and reported the average price of the cars he saw. This is a great first approach, but I think we can do better. And, spoiler alert, we can! The Machine Learning approach presented in this article will give us some valuable insights in the estimation of the price. Let’s start on our used car prices estimation journey!
Are you interested in learning more Python? Order our new "Mastering Pandas" course now on Data Blogger Courses for only
- Learn to visualize data using Pandas
- Learn how to load and store data effectively
- Learn advanced data operations
Hitchhiker’s guide to Used Car Prices
In this tutorial, we will go through the following steps:
- Dataset creation.
- Relations between the variables.
- Variable selection.
- Function approximation.
That’s all! First, we need to create a dataset by gathering information about your car (here, my Peugeot 106). Then, we will figure out what the relations are between the different variables. After that, we will only use the most influencial variables. Having all the variables is great, but not all variables contribute a lot to the price of a car. Finally, by having our predictor variables, we can try to figure out a relation between these variables and the ask price. Let’s start!
Step 0: Importing dependencies
The code has a few dependencies. You can load the dependencies using the following code snippet:
Step 1: Creating the dataset
Before we can do any machine learning, we need data. I collected the data using a online used car selling platform in the Netherlands. The following script writes the dataset to a file and then loads the data into a dataframe:
This will then result into the following dataframe:
|Brand||Type||Color||Construction Year||Odometer||Ask Price||Days Until MOT||HP|
As you can see, I collected the brand (Peugeot 106), the type (1.0, 1.1, …), the color of the car (black, blue, …) the construction year of the car, the odometer of the car (which is the distance in kilometers (km) traveled with the car at this point in space and time), the ask price of the car (in Euro’s), the days until the MOT (Ministry of Transport test, a required periodical check-up of your car) and the horse power (HP) of the car. Feel free to use your own variables/units!
Cleaning the dataset
There are a few issues with the dataset. First of all, not all variables are of ordered types. For example, there is no logical way of ordering the color. Therefore, we will convert these columns to binary columns which simply says per color whether the car is that color or not. I use the following script for transforming categorical variables to binary variables:
The output is now the following:
Great! Now we can dive into the relations between the variables.
Update: I obtained better results when I take the log of the log of the odometer variable. To do so, you need to execute the following:
Step 2: Relations between the variables
It is quite simple using Pandas to find out correlations between our variables:
This results into the following plot:
Now things get interesting. The Ask Price has a strong positive correlation with the construction year which makes a lot of sense. Newer cars are worth more money. It also has a positive correlation with the days until the MOT. This also makes a lot of sense! If you just did a MOT, the buyer is more certain that nothing is wrong with your car. The person is definitely willing to pay more if you just did a MOT. There is a strong negative correlation with the odometer. If you drove a lot with your car, it is more likely that the car will get more and more issues. Also interesting: the color grey is also correlated. But that is because of underlying (latent) variables. In my dataset, the grey cars were sport types of the Peugeot 106. Therefore, they have more horse power and are more optimized then a normal Peugeot 106 and that is why the ask price is higher. But this has nothing to do with the color of the car.
Here is a plot of the interesting variables:
Step 3: Variable selection
As pointed out, we will use the following variables for predicting the ask price:
- Days until MOT (periodical check-up).
- Odometer (traveled distance of the car).
- Construction year.
Update: For the odometer, I used the log of the log of the original odometer variable
The following plot reveals the relation between the interesting variables:
By the way, using this plot I figured out that the log of the log of the odometer worked better for me.
Step 4: The Ask Price function
How can we approximate the ask price given the variables? Using a neural network ofcourse! We will use a Multi-Layered Perceptron to estimate the price. Also note that the variables are not normalized. Normalization is important for a neural network. If you will not normalize your data, then certain variables will have a different scale than other variables. Take for example the construction year variable. The construction year will be way lower than the traveled distance. An example construction year is 1999 and an example traveled distance is 250.000 km. This is approximately 1000x more than the construction year! Therefore, we will normalize such that all numbers are centered around zero (and have a standard deviation of approximately 1). Luckily, scikit-learn (sklearn) can take care of that!
Now we can predict prices:
This resulted into the following plot for me:
Perfect! The red line shows the perfect predictions. If all blue dots would lie on this line, then the predictions would be perfect. But this is not bad! We are able to predict the ask price quite good.
Bonus: Relations between the variables
Now we are able to predict the ask price, we can ask ourselves the following question: what are the relations between the several variables? We will explore the relations (in case of my Peugeot 106) in this section.
Distance traveled (odometer) versus ask price
I think this relation is general for most cars. Please report if you find a different relation for you car! As you can see, the more you travel with a car, the less the ask price becomes. The difference in the price is almost €200,-!
Construction year versus ask price
This will definitely differ per car type, but you will probably see a positive relation. If the car is younger, then the ask price will be higher.
Periodical check-up (MOT) versus ask price
If the MOT costs less than $1100-700=400$ Euro, then it is definitely worth it to do a check-up! The predicted ask price is approximately $700$ Euro when the MOT is soon. there are 365 days until MOT left when the MOT just happened. When there are 365 days until MOT, the predicted ask price is approximately $1100$ Euro. So definitely do a MOT if you want to sell your car and for this car, the MOT costs less than $400$ Euro.
What is the value of my Peugeot?
My car has the following properties:
- Construction year: 1998.
- Traveled distance: 220.000 km.
- Days until MOT: 150.
Using our neural network (MLP) model, I can approximate the ask price!
The output was €828.86. That is great since I bought it for only €800,-. What is the approximated price for your car?
Using machine learning for predicting the ask price of your car is great! It reveals the underlying variables which are important for the car price. In my case (for my Peugeot 106), the traveled distance, days until periodical check-up and the construction year were the most important variables. Using a neural network regressor, I was able to predict the ask price for my Peugeot 106. If you have any comments or questions, feel free to ask! If you liked this article, then please share it on Social Media! If you are interested in predicting house prices using machine learning, then you should read this article.
Help building the Data Blogger CommunityHelp to grow our community to spread AI and Data Science education around the globe.
Every penny counts.