In this blog post, I will use machine learning and Python for predicting house prices. I will use a Random Forest Classifier (in fact Random Forest regression). In the end, I will demonstrate my Random Forest Python algorithm!
There is no law except the law that there is no law. – John Archibald Wheeler
Data Science is about discovering hidden patterns (laws) in your data. Observing your data is as important as discovering patterns in your data. Without examining the data, your pattern detection will be imperfect and without pattern detection, you cannot draw any conclusions about your data. Therefore, both ideas are needed for drawing conclusions.
The remainder of this notebook is divided into the following chapters:
- Study the variables. What is the problem about? What are the target variables and what do the variables represent?
- Variable analysis. We will focus on the target variable and predicting variables and try to clean up as many variables as possible.
- Machine Learning. Here we will build and test our pattern detection algorithm. Yay!
Let’s give it a try!
# Loading stuff import pandas as pd import numpy as np import sklearn import matplotlib.pyplot as plt import seaborn as sns import warnings sns.set() pd.set_option('max_columns', 1000) warnings.filterwarnings('ignore') %matplotlib inline
# Load the data df_train = pd.read_csv('../input/train.csv')
# Explore the columns print(df_train.columns.values) print('No. variables:', len(df_train.columns.values))
['Id' 'MSSubClass' 'MSZoning' 'LotFrontage' 'LotArea' 'Street' 'Alley' 'LotShape' 'LandContour' 'Utilities' 'LotConfig' 'LandSlope' 'Neighborhood' 'Condition1' 'Condition2' 'BldgType' 'HouseStyle' 'OverallQual' 'OverallCond' 'YearBuilt' 'YearRemodAdd' 'RoofStyle' 'RoofMatl' 'Exterior1st' 'Exterior2nd' 'MasVnrType' 'MasVnrArea' 'ExterQual' 'ExterCond' 'Foundation' 'BsmtQual' 'BsmtCond' 'BsmtExposure' 'BsmtFinType1' 'BsmtFinSF1' 'BsmtFinType2' 'BsmtFinSF2' 'BsmtUnfSF' 'TotalBsmtSF' 'Heating' 'HeatingQC' 'CentralAir' 'Electrical' '1stFlrSF' '2ndFlrSF' 'LowQualFinSF' 'GrLivArea' 'BsmtFullBath' 'BsmtHalfBath' 'FullBath' 'HalfBath' 'BedroomAbvGr' 'KitchenAbvGr' 'KitchenQual' 'TotRmsAbvGrd' 'Functional' 'Fireplaces' 'FireplaceQu' 'GarageType' 'GarageYrBlt' 'GarageFinish' 'GarageCars' 'GarageArea' 'GarageQual' 'GarageCond' 'PavedDrive' 'WoodDeckSF' 'OpenPorchSF' 'EnclosedPorch' '3SsnPorch' 'ScreenPorch' 'PoolArea' 'PoolQC' 'Fence' 'MiscFeature' 'MiscVal' 'MoSold' 'YrSold' 'SaleType' 'SaleCondition' 'SalePrice'] No. variables: 81
Study the variables
So there are roughly 80 variables. That is a lot and we probably don’t need most of them. The ‘SalePrice’ variable is our target variable. We would like to predict this variable given the other variables. What are the other variables and what are their types? At this point, I will not throw away any variable unless it does not give any information.
Clean missing data
Now let’s take a look at which variables contain lots of NaNs. We will dump these variables since they do not contribute a lot to the predictability of the target variable.
num_missing = df_train.isnull().sum() percent = num_missing / df_train.isnull().count() df_missing = pd.concat([num_missing, percent], axis=1, keys=['MissingValues', 'Fraction']) df_missing = df_missing.sort_values('Fraction', ascending=False) df_missing[df_missing['MissingValues'] > 0]
To simplify the problem, we will throw away any variable with a missing column. This will make our prediction worse, but this also ensures we do not have to make any assumptions about these variables (which could also be dangerous).
variables_to_keep = df_missing[df_missing['MissingValues'] == 0].index df_train = df_train[variables_to_keep]
Here we will do a quick analysis of the variables and the underlying relations. Let’s build a correlation matrix.
# Build the correlation matrix matrix = df_train.corr() f, ax = plt.subplots(figsize=(16, 12)) sns.heatmap(matrix, vmax=0.7, square=True)
Now we can zoom in on the SalePrice and determine which variables are strongly correlated to it.
interesting_variables = matrix['SalePrice'].sort_values(ascending=False) # Filter out the target variables (SalePrice) and variables with a low correlation score (v such that -0.6 <= v <= 0.6) interesting_variables = interesting_variables[abs(interesting_variables) >= 0.6] interesting_variables = interesting_variables[interesting_variables.index != 'SalePrice'] interesting_variables
OverallQual 0.790982 GrLivArea 0.708624 GarageCars 0.640409 GarageArea 0.623431 TotalBsmtSF 0.613581 1stFlrSF 0.605852 Name: SalePrice, dtype: float64
Nice! So apparently, the overall quality is the most predicting variable so far. Which makes sense, but it is also quite vague. What is exactly meant by this score? Let’s zoom in on the most predicting variable.
values = np.sort(df_train['OverallQual'].unique()) print('Unique values of "OverallQual":', values)
Unique values of "OverallQual": [ 1 2 3 4 5 6 7 8 9 10]
So apparently, we have a semi-categorical variable “OverallQual” with a score from 1 to 10. According to the description of the variables, 1 means Very Poor, 5 means Average and 10 means Very Excellent. Let’s plot the relationship between “OverallQual” and “SalePrice”:
data = pd.concat([df_train['SalePrice'], df_train['OverallQual']], axis=1) data.plot.scatter(x='OverallQual', y='SalePrice')
Okay, the trend is clearly visible. Now let’s analyse all of our variables-of-interest.
cols = interesting_variables.index.values.tolist() + ['SalePrice'] sns.pairplot(df_train[cols], size=2.5) plt.show()
This plot reveals a lot. It gives clues about the types of the different variables. There are a few discrete variables (OverallQual, GarageCars) and some continuous variables (GrLivArea, GarageArea, TotalBsmtSF, 1stFlrSF).
We will now zoom in on the heatmap we produced earlier by only showing the variables of interest. This could potentially reveal some underlying relations!
# Build the correlation matrix matrix = df_train[cols].corr() f, ax = plt.subplots(figsize=(8, 6)) sns.heatmap(matrix, vmax=1.0, square=True)
I see definitely some clusters here! GarageCars and GarageArea are strongly correlated, which also makes a lot of sense. Furthermore, TotalBsmtSF and 1stFlrSF are also correlated which also makes sense. And we intended to only use variables which were correlated to SalePrice which is also visible in this plot. Great! Now we will start with some Machine Learning and try to predict the SalePrice!
Machine Learning (Random Forest regression)
In this chapter, I will use a Random Forest classifier. In fact, it is Random Forest regression since the target variable is a continuous real number. I will split the train set into a train and a test set since I am not interested in running the analysis on the test set. Let’s find out how well the models work!
from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split pred_vars = [v for v in interesting_variables.index.values if v != 'SalePrice'] target_var = 'SalePrice' X = df_train[pred_vars] y = df_train[target_var] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42) model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train)
y_pred = model.predict(X_test) # Build a plot plt.scatter(y_pred, y_test) plt.xlabel('Prediction') plt.ylabel('Real value') # Now add the perfect prediction line diagonal = np.linspace(0, np.max(y_test), 100) plt.plot(diagonal, diagonal, '-r') plt.show()
That is great! The red line shows the perfect predictions. If the prediction would equal the real value, then all points would lie on the red line. Here you can see that there are some deviations and a few outliers, but that is mainly the case for prices which are extremely high. There are some outliers in the low range and it would be interesting to find out what is the cause of these outliers. To conclude, we can compute the RMS error (Root Mean Squared error):
from sklearn.metrics import mean_squared_log_error, mean_absolute_error print('MAE:\t$%.2f' % mean_absolute_error(y_test, y_pred)) print('MSLE:\t%.5f' % mean_squared_log_error(y_test, y_pred))
MAE: $23552.62 MSLE: 0.03613
A deviation of $23K,- which is mainly due to the extreme outliers, not too bad for a quick try! This is definitely something to keep in mind when buying a house!
With the help of just a Random Forest Classifier (which is in fact Random Forest regression), it is possible to predict the house prices fairly good! So, if you are about to buy a house, please contact me! Oh, and if you are interested in learning more about Pandas, definitely check out this article. If you are interested in writing an article on Data Blogger, please do so here! Yes, you will get paid :-).
The description of the competition can be found on Kaggle and my final notebook can be found here. Interested in predicting the value of your car? Then definitely read this article which uses a Neural Network for the price prediction. Another article on another Kaggle competition about restaurant reservations can be found here.