Introduction

Let’s see how to run a machine learning model from beginning to end: from data preprocessing, model training to model selection and evaluation. Firstly, data comes in many shapes and sizes, hence the preprocess is crucial in helping the algorithm to process and highlight the best findings in those data. Then training a model requires deep understanding of statistics and other mathematical concepts. You need to know when the model overfits and how to use regularization to help with the symptoms. After carefully calibrate different models, you might also need to select between them or to ensemble them together for better purpose. In the end, let’s choose the metrics to choose what to display to the world.

Data preprocessing

Data preprocessing is a critical step. It involves cleaning, transforming, and organizing the data to prepare it for the modeling phase. If you have good data, it greatly helps the performance and accuracy of the model. The preprocessing step can involve several techniques such as handling missing values, removing outliers, data distribution investigation, normalization, categorical variable encoding, and feature engineering. Each technique aims to prepare the data in a way that best suits the machine learning algorithm that will be used for the task at hand. By the end of this recipe, you will have a clear understanding of the importance of data preprocessing and the tools and techniques necessary to prepare your data for machine learning.

We will use a dataset of housing price in Melbourne. You can use any dataset of your choice. Let’s first look at the histograms of the numerical variables:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

housing=pd.read_csv('melbourne-housing/melbourne_housing.csv')
housing.hist(bins=50, figsize=(15, 10))
plt.show()

png

20MLrecipe_1_0

Notice that the Price is transformed into a range from 0 to 1.25. Price, Distance,.. are skew. Rooms, Price, Bedroom2, Bathroom, Car.. have very long right tails. We can address those issues later. Now let’s look into the information to solve the missing values first:

housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34857 entries, 0 to 34856
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 Suburb         34857 non-null  object 
 Address        34857 non-null  object 
 Rooms          34857 non-null  int64  
 Type           34857 non-null  object 
 Price          27247 non-null  float64
 Method         34857 non-null  object 
 SellerG        34857 non-null  object 
 Date           34857 non-null  object 
 Distance       34856 non-null  float64
 Postcode       34856 non-null  float64
Bedroom2       26640 non-null  float64
Bathroom       26631 non-null  float64
Car            26129 non-null  float64
Landsize       23047 non-null  float64
BuildingArea   13742 non-null  float64
YearBuilt      15551 non-null  float64
CouncilArea    34854 non-null  object 
Lattitude      26881 non-null  float64
Longtitude     26881 non-null  float64
Regionname     34854 non-null  object 
Propertycount  34854 non-null  float64
dtypes: float64(12), int64(1), object(8)
memory usage: 5.6+ MB

housing.head()

	Suburb	Address	Rooms	Type	Price	Method	SellerG	Date	Distance	Postcode	...	Bathroom	Car	Landsize	BuildingArea	YearBuilt	CouncilArea	Lattitude	Longtitude	Regionname	Propertycount
0	Abbotsford	68 Studley St	2	h	NaN	SS	Jellis	3/09/2016	2.5	3067.0	...	1.0	1.0	126.0	NaN	NaN	Yarra City Council	-37.8014	144.9958	Northern Metropolitan	4019.0
1	Abbotsford	85 Turner St	2	h	1480000.0	S	Biggin	3/12/2016	2.5	3067.0	...	1.0	1.0	202.0	NaN	NaN	Yarra City Council	-37.7996	144.9984	Northern Metropolitan	4019.0
2	Abbotsford	25 Bloomburg St	2	h	1035000.0	S	Biggin	4/02/2016	2.5	3067.0	...	1.0	0.0	156.0	79.0	1900.0	Yarra City Council	-37.8079	144.9934	Northern Metropolitan	4019.0
3	Abbotsford	18/659 Victoria St	3	u	NaN	VB	Rounds	4/02/2016	2.5	3067.0	...	2.0	1.0	0.0	NaN	NaN	Yarra City Council	-37.8114	145.0116	Northern Metropolitan	4019.0
4	Abbotsford	5 Charles St	3	h	1465000.0	SP	Biggin	4/03/2017	2.5	3067.0	...	2.0	0.0	134.0	150.0	1900.0	Yarra City Council	-37.8093	144.9944	Northern Metropolitan	4019.0

5 rows × 21 columns

Before anything else, notice that there are 21 attributes: 13 of them are numerical, 8 of them is categorical. We would need to encode those categorical variables later. Now let’s address the missing values.

Missing value

You can see that the Bedroom2 variable has 26640/34867 non null observations, so it has 24% of the data missing which is a lot. Similar to other variables Price, Bathroom, Car,… BuildingArea and YearBuilt both have only half the observations.

There are different ways to handle missing data: to remove the attribute (remove total bedrooms variable altogether), remove the 24% rows of those missing data, or to fill them in (impute). For imputation, you can impute with mean, mode or median strategy. In general, we want to fill in the blanks with the central tendency of the data. Mean imputation is to fill in with the average of the column. Median is the middle point when the distribution is skewed. Mode is the most frequent value in the column. Apart from those simple techniques, there are other statistical ways to do it. One way is to run regression to predict missing values based on the relationship with other features or the target variable, using a regression or classification model. Another way is to find similar or nearby observations in the dataset (could be with a kmean or nearest neighbors algorithm). We can also generate multiple plausible imputed data using statistical methods and combining them to estimate uncertainty

Deciding on a suitable imputation method or combination of methods, depending on the type, distribution, and amount of missing values, as well as the context and assumptions of the task. In our example above, we can simply replace missing values with the mean/mode. Apart from that, since we have an understanding of this dataset to be the house price by suburbs in Melbourne, we can find similar suburbs (similar by location, area or by population), and impute with those “nearest neighbors” so to speak.

Specifically, we can drop the BuildingArea and YearBuilt, cut the observations to be equal to Price, and impute the rest with mean value for numerical variables and mode value for categorical variables. Keep in mind that an imputation of 24% is quite large though.

housing=housing.drop(['BuildingArea', 'YearBuilt', 'Landsize'], axis=1)
housing = housing[housing['Price'].notna()]

housingcopy=housing.copy()
housingcopy['Distance'].fillna(int(housingcopy['Distance'].mean()), inplace=True)
housingcopy['Postcode'].fillna(int(housingcopy['Postcode'].mode()), inplace=True)
housingcopy['Bedroom2'].fillna(int(housingcopy['Bedroom2'].mode()), inplace=True)
housingcopy['Bathroom'].fillna(int(housingcopy['Bathroom'].mode()), inplace=True)
housingcopy['Car'].fillna(int(housingcopy['Car'].mode()), inplace=True)
housingcopy['CouncilArea'].fillna(housingcopy['CouncilArea'].mode(), inplace=True)
housingcopy['Lattitude'].fillna(int(housingcopy['Lattitude'].mode()), inplace=True)
housingcopy['Longtitude'].fillna(int(housingcopy['Longtitude'].mode()), inplace=True)
housingcopy['Regionname'].fillna(housingcopy['Regionname'].mode(), inplace=True)
housingcopy['Propertycount'].fillna(int(housingcopy['Propertycount'].mode()), inplace=True)

housingcopy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27247 entries, 1 to 34856
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 Suburb         27247 non-null  object 
 Address        27247 non-null  object 
 Rooms          27247 non-null  int64  
 Type           27247 non-null  object 
 Price          27247 non-null  float64
 Method         27247 non-null  object 
 SellerG        27247 non-null  object 
 Date           27247 non-null  object 
 Distance       27247 non-null  float64
 Postcode       27247 non-null  float64
Bedroom2       27247 non-null  float64
Bathroom       27247 non-null  float64
Car            27247 non-null  float64
CouncilArea    27244 non-null  object 
Lattitude      27247 non-null  float64
Longtitude     27247 non-null  float64
Regionname     27244 non-null  object 
Propertycount  27247 non-null  float64
dtypes: float64(9), int64(1), object(8)
memory usage: 3.9+ MB

Skewness and Outliers

housingcopy.hist(bins=50, figsize=(15, 10))
plt.show()

png

20MLrecipe_9_0

Right now the histograms have looked a bit different. We will address other notable issues: skew and outliers. We can see that Price and Distance simply have very long right tails (skew) while Rooms, Postcode, Bedroom2, Bathroom and Car have many outliers. To address skew distributions, we can use log, for outliers we can remove or cap them.

Skewness is a measure of the asymmetry of a distribution, or how much the distribution deviates from a normal distribution. A skewed distribution can affect the performance of machine learning models, especially those that assume a normal distribution or have sensitivity to outliers.

# calculate skewness of each column in the DataFrame
skewness = housingcopy.skew()

# print the skewness values
print(skewness)

Rooms            0.511276
Price            2.588969
Distance         1.478797
Postcode         3.981854
Bedroom2         0.909076
Bathroom         1.639065
Car              1.598751
Lattitude        1.095955
Longtitude      -1.082893
Propertycount    1.016217
dtype: float64


/var/folders/kf/5_ggvsz93vxdbx_h0tvy66xh0000gn/T/ipykernel_10575/18523924.py:2: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  skewness = housingcopy.skew()

We are going to address skewness of Price and Distance by adding log of them. We are not going to address the skewness of Postcode despite it being large. For other attributes, we would handle outliers by capping them.

minDistance=0.1
housingcopy['Distance']=housingcopy['Distance'].where(housingcopy['Distance']>=0.1,0.1)
maxRooms = 8
housingcopy['Rooms']=housingcopy['Rooms'].where(housingcopy['Rooms'] <= maxRooms, maxRooms)      
maxPostcode=3400
housingcopy['Postcode']=housingcopy['Postcode'].where(housingcopy['Postcode'] <= maxPostcode, maxPostcode)
maxBedroom2=7.5
housingcopy['Bedroom2']=housingcopy['Bedroom2'].where(housingcopy['Bedroom2'] <= maxBedroom2, maxBedroom2)
maxBathroom=4
housingcopy['Bathroom']=housingcopy['Bathroom'].where(housingcopy['Bathroom'] <= maxBathroom, maxBathroom)
maxCar=7.5
housingcopy['Car']=housingcopy['Car'].where(housingcopy['Car'] <= maxCar, maxCar)
maxLattitude=-37.2
housingcopy['Lattitude']=housingcopy['Lattitude'].where(housingcopy['Lattitude'] <= maxLattitude, maxLattitude)
minLongtitude=144.4
housingcopy['Longtitude']=housingcopy['Longtitude'].where(housingcopy['Longtitude'] >= minLongtitude, minLongtitude)

housingcopy['logPrice'] = np.log10(housingcopy['Price'])
housingcopy['logDistance'] = np.log10(housingcopy['Distance'])
minlogDistance=0.1
housingcopy['logDistance']=housingcopy['logDistance'].where(housingcopy['logDistance'] >= minlogDistance, minlogDistance)

housingcopy=housingcopy.drop(['Price', 'Distance'], axis=1)


housingcopy.hist(bins=50, figsize=(20, 15))
plt.show()

png

20MLrecipe_13_0

Voila! The histograms have looked much better.

Feature engineering

To see if we can combine variables into new attributes, we first look at the correlation among them:

from pandas.plotting import scatter_matrix
attributes = ["logPrice", "Rooms", "Bedroom2",
              "Bathroom","Postcode","Car","logDistance"]
scatter_matrix(housingcopy[attributes], figsize=(12, 12))
plt.show()

png

20MLrecipe_15_0

corr_matrix = housingcopy.corr()
corr_matrix["logPrice"].sort_values(ascending=False)

logPrice         1.000000
Rooms            0.529889
Bedroom2         0.428506
Bathroom         0.391019
Longtitude       0.212303
Car              0.171917
Postcode         0.110744
Propertycount   -0.085084
logDistance     -0.162746
Lattitude       -0.193534
Name: logPrice, dtype: float64

It seems that the price is most related to the number of rooms, we can combine the number of bedroom and bathroom to have a new variable: total_restrooms

housingcopy['total_restrooms']=housingcopy['Bedroom2']+housingcopy['Bathroom']

corr_matrix = housingcopy.corr()
corr_matrix["logPrice"].sort_values(ascending=False)

logPrice           1.000000
Rooms              0.529889
total_restrooms    0.464473
Bedroom2           0.428506
Bathroom           0.391019
Longtitude         0.212303
Car                0.171917
Postcode           0.110744
Propertycount     -0.085084
logDistance       -0.162746
Lattitude         -0.193534
Name: logPrice, dtype: float64

Together, the number of restrooms is more correlated to logPrice than each of them, so we can drop the number of bedrooms and bathrooms and use the number of rest rooms variable instead.

Categorical variable encoding

There are different ways to encode categorical variables into numerical variables: label encoding, one hot encoding, and the like. In label encoding, we assign a unique integer or ordinal number to each category, based on their order or frequency. This method works well for categorical variables that have a natural order or hierarchy, such as education level or income bracket. However, it may not be suitable for variables that have no inherent order, such as colors or brands. In one-hot encoding, we create a binary vector or matrix for each category, where each element represents the presence or absence of the category. This method works well for categorical variables that have no order or hierarchy, and can handle multiple categories for each observation. However, it may create a large number of features or variables, and can lead to overfitting or sparsity issues.

housingcopy.info()

Looking into those categorical variables, we will drop Address, SellerG, Date and would use one hot encoding for the rest since those variables are not hierarchical (living in one region is not inherently superior than another region).

housing_cat=housingcopy.select_dtypes(include=[np.object])
housing_cat=housing_cat.drop(['Address','SellerG','Date'],axis=1)
housing_cat

/var/folders/kf/5_ggvsz93vxdbx_h0tvy66xh0000gn/T/ipykernel_10575/894280556.py:1: DeprecationWarning: `np.object` is a deprecated alias for the builtin `object`. To silence this warning, use `object` by itself. Doing this will not modify any behavior and is safe. 
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  housing_cat=housingcopy.select_dtypes(include=[np.object])

	Suburb	Type	Method	CouncilArea	Regionname
1	Abbotsford	h	S	Yarra City Council	Northern Metropolitan
2	Abbotsford	h	S	Yarra City Council	Northern Metropolitan
4	Abbotsford	h	SP	Yarra City Council	Northern Metropolitan
5	Abbotsford	h	PI	Yarra City Council	Northern Metropolitan
6	Abbotsford	h	VB	Yarra City Council	Northern Metropolitan
...	...	...	...	...	...
34852	Yarraville	h	PI	Maribyrnong City Council	Western Metropolitan
34853	Yarraville	h	SP	Maribyrnong City Council	Western Metropolitan
34854	Yarraville	t	S	Maribyrnong City Council	Western Metropolitan
34855	Yarraville	h	SP	Maribyrnong City Council	Western Metropolitan
34856	Yarraville	h	PI	Maribyrnong City Council	Western Metropolitan

27247 rows × 5 columns

np.set_printoptions(suppress=True)

from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot=pd.DataFrame.sparse.from_spmatrix(housing_cat_1hot)
housing_cat_1hot.columns=cat_encoder.get_feature_names_out()
housing_cat_1hot = housing_cat_1hot.reindex(range(1,len(housing_cat_1hot)+1))
from sklearn.impute import SimpleImputer
modeimputer=SimpleImputer(strategy="most_frequent")
modeimputer.fit(housing_cat_1hot)
housing_cat_1hot=modeimputer.transform(housing_cat_1hot)
housing_cat_1hot=pd.DataFrame.sparse.from_spmatrix(housing_cat_1hot)
housing_cat_1hot.columns=cat_encoder.get_feature_names_out()


housing_cat_1hot

	Suburb_Abbotsford	Suburb_Aberfeldie	Suburb_Airport West	Suburb_Albanvale	Suburb_Albert Park	Suburb_Albion	Suburb_Alphington	Suburb_Altona	Suburb_Altona Meadows	Suburb_Altona North	...	CouncilArea_nan	Regionname_Eastern Metropolitan	Regionname_Eastern Victoria	Regionname_Northern Metropolitan	Regionname_Northern Victoria	Regionname_South-Eastern Metropolitan	Regionname_Southern Metropolitan	Regionname_Western Metropolitan	Regionname_Western Victoria	Regionname_nan
0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
1	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
2	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
3	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
4	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
27242	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0
27243	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0
27244	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0
27245	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0
27246	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

27247 rows × 396 columns

Normalize

After encoding the categorical variables, let’s normalize the numerical attributes. The goal of normalization is to scale the values of different features or variables to a common range, so that they have similar magnitudes and are not dominated by one feature. We can use techniques like standardization (subtracting the mean and dividing by the standard deviation) or min-max scaling (scaling the values to a range between 0 and 1). We will use the standardization technique.

housing_num=housingcopy.select_dtypes(include=[np.number])
housing_num

	Rooms	Postcode	Bedroom2	Bathroom	Car	Lattitude	Longtitude	Propertycount	logPrice	logDistance	total_restrooms
1	2	3067.0	2.0	1.0	1.0	-37.79960	144.99840	4019.0	6.170262	0.397940	3.0
2	2	3067.0	2.0	1.0	0.0	-37.80790	144.99340	4019.0	6.014940	0.397940	3.0
4	3	3067.0	3.0	2.0	0.0	-37.80930	144.99440	4019.0	6.165838	0.397940	5.0
5	3	3067.0	3.0	2.0	1.0	-37.79690	144.99690	4019.0	5.929419	0.397940	5.0
6	4	3067.0	3.0	1.0	2.0	-37.80720	144.99410	4019.0	6.204120	0.397940	4.0
...	...	...	...	...	...	...	...	...	...	...	...
34852	4	3013.0	4.0	1.0	3.0	-37.81053	144.88467	6543.0	6.170262	0.799341	5.0
34853	2	3013.0	2.0	2.0	1.0	-37.81551	144.88826	6543.0	5.948413	0.799341	4.0
34854	2	3013.0	2.0	1.0	2.0	-37.82286	144.87856	6543.0	5.848189	0.799341	3.0
34855	3	3013.0	3.0	1.0	2.0	-37.20000	144.40000	6543.0	6.056905	0.799341	4.0
34856	2	3013.0	2.0	1.0	0.0	-37.81810	144.89351	6543.0	6.008600	0.799341	3.0

27247 rows × 11 columns

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled = scaler.fit_transform(housing_num)
scaled_num = pd.DataFrame(scaled,index=housing_num.index,columns=housing_num.columns)
scaled_num.head()

	Rooms	Postcode	Bedroom2	Bathroom	Car	Lattitude	Longtitude	Propertycount	logPrice	logDistance	total_restrooms
1	-1.046448	-0.52353	-1.260957	-0.696023	-0.916577	-0.493070	0.509046	-0.789709	0.937812	-2.094451	-1.141261
2	-1.046448	-0.52353	-1.260957	-0.696023	-2.083830	-0.524085	0.490689	-0.789709	0.245997	-2.094451	-1.141261
4	0.009182	-0.52353	-0.041078	0.856158	-2.083830	-0.529316	0.494360	-0.789709	0.918107	-2.094451	0.398809
5	0.009182	-0.52353	-0.041078	0.856158	-0.916577	-0.482981	0.503539	-0.789709	-0.134922	-2.094451	0.398809
6	1.064812	-0.52353	-0.041078	-0.696023	0.250676	-0.521469	0.493259	-0.789709	1.088619	-2.094451	-0.371226

scaled_num=scaled_num.reindex(range(1,len(scaled_num)+1))
scaled_num

	Rooms	Postcode	Bedroom2	Bathroom	Car	Lattitude	Longtitude	Propertycount	logPrice	logDistance	total_restrooms
1	-1.046448	-0.523530	-1.260957	-0.696023	-0.916577	-0.493070	0.509046	-0.789709	0.937812	-2.094451	-1.141261
2	-1.046448	-0.523530	-1.260957	-0.696023	-2.083830	-0.524085	0.490689	-0.789709	0.245997	-2.094451	-1.141261
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	0.009182	-0.523530	-0.041078	0.856158	-2.083830	-0.529316	0.494360	-0.789709	0.918107	-2.094451	0.398809
5	0.009182	-0.523530	-0.041078	0.856158	-0.916577	-0.482981	0.503539	-0.789709	-0.134922	-2.094451	0.398809
...	...	...	...	...	...	...	...	...	...	...	...
27243	0.009182	-0.823824	-0.041078	0.856158	1.417930	-0.240095	0.309322	-0.018541	-0.013639	-0.161467	0.398809
27244	0.009182	-0.823824	-0.041078	-0.696023	0.250676	1.747460	-1.687915	-0.018541	-0.276526	-0.161467	-0.371226
27245	0.009182	-0.823824	-0.041078	-0.696023	0.250676	1.747460	-1.687915	-0.018541	0.100486	-0.161467	-0.371226
27246	0.009182	-0.823824	-0.041078	-0.696023	0.250676	1.747460	-1.687915	-0.018541	-0.228164	-0.161467	-0.371226
27247	0.009182	-1.006612	-0.041078	0.856158	0.250676	-0.803217	-0.409169	1.774102	-0.683839	0.703768	0.398809

27247 rows × 11 columns

meanimputer = SimpleImputer(strategy="mean")
meanimputer.fit(scaled_num)

scaled_num2=meanimputer.transform(scaled_num)
scaled_num2=pd.DataFrame(scaled_num2,columns=scaled_num.columns)
scaled_num2

	Rooms	Postcode	Bedroom2	Bathroom	Car	Lattitude	Longtitude	Propertycount	logPrice	logDistance	total_restrooms
0	-1.046448	-0.523530	-1.260957	-0.696023	-0.916577	-0.493070	0.509046	-0.789709	0.937812	-2.094451	-1.141261
1	-1.046448	-0.523530	-1.260957	-0.696023	-2.083830	-0.524085	0.490689	-0.789709	0.245997	-2.094451	-1.141261
2	-0.031743	-0.029750	-0.069013	-0.004629	-0.067320	-0.086359	0.077531	-0.010363	0.007633	-0.059730	-0.045860
3	0.009182	-0.523530	-0.041078	0.856158	-2.083830	-0.529316	0.494360	-0.789709	0.918107	-2.094451	0.398809
4	0.009182	-0.523530	-0.041078	0.856158	-0.916577	-0.482981	0.503539	-0.789709	-0.134922	-2.094451	0.398809
...	...	...	...	...	...	...	...	...	...	...	...
27242	0.009182	-0.823824	-0.041078	0.856158	1.417930	-0.240095	0.309322	-0.018541	-0.013639	-0.161467	0.398809
27243	0.009182	-0.823824	-0.041078	-0.696023	0.250676	1.747460	-1.687915	-0.018541	-0.276526	-0.161467	-0.371226
27244	0.009182	-0.823824	-0.041078	-0.696023	0.250676	1.747460	-1.687915	-0.018541	0.100486	-0.161467	-0.371226
27245	0.009182	-0.823824	-0.041078	-0.696023	0.250676	1.747460	-1.687915	-0.018541	-0.228164	-0.161467	-0.371226
27246	0.009182	-1.006612	-0.041078	0.856158	0.250676	-0.803217	-0.409169	1.774102	-0.683839	0.703768	0.398809

27247 rows × 11 columns

After preprocessing numerical and categorical variables, we concatenate them into one dataframe:

result = pd.concat([scaled_num2, housing_cat_1hot], axis=1, join='inner')
result.shape
result

	Rooms	Postcode	Bedroom2	Bathroom	Car	Lattitude	Longtitude	Propertycount	logPrice	logDistance	...	CouncilArea_nan	Regionname_Eastern Metropolitan	Regionname_Eastern Victoria	Regionname_Northern Metropolitan	Regionname_Northern Victoria	Regionname_South-Eastern Metropolitan	Regionname_Southern Metropolitan	Regionname_Western Metropolitan	Regionname_Western Victoria	Regionname_nan
0	-1.046448	-0.523530	-1.260957	-0.696023	-0.916577	-0.493070	0.509046	-0.789709	0.937812	-2.094451	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
1	-1.046448	-0.523530	-1.260957	-0.696023	-2.083830	-0.524085	0.490689	-0.789709	0.245997	-2.094451	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
2	-0.031743	-0.029750	-0.069013	-0.004629	-0.067320	-0.086359	0.077531	-0.010363	0.007633	-0.059730	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.009182	-0.523530	-0.041078	0.856158	-2.083830	-0.529316	0.494360	-0.789709	0.918107	-2.094451	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.009182	-0.523530	-0.041078	0.856158	-0.916577	-0.482981	0.503539	-0.789709	-0.134922	-2.094451	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
27242	0.009182	-0.823824	-0.041078	0.856158	1.417930	-0.240095	0.309322	-0.018541	-0.013639	-0.161467	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0
27243	0.009182	-0.823824	-0.041078	-0.696023	0.250676	1.747460	-1.687915	-0.018541	-0.276526	-0.161467	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0
27244	0.009182	-0.823824	-0.041078	-0.696023	0.250676	1.747460	-1.687915	-0.018541	0.100486	-0.161467	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0
27245	0.009182	-0.823824	-0.041078	-0.696023	0.250676	1.747460	-1.687915	-0.018541	-0.228164	-0.161467	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0
27246	0.009182	-1.006612	-0.041078	0.856158	0.250676	-0.803217	-0.409169	1.774102	-0.683839	0.703768	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

27247 rows × 407 columns

EDA

If we plot the data by longitude and latitude with respect to the price, we can see that the houses are clustered in the middle and has the highest price there. Going from the center, the price seems to increase in the southeast part of melbourne, with the price in the northwest part mostly the same. This pattern shows up in the log price scatter plot too.

housing.plot(kind="scatter", x="Longtitude", y="Lattitude", grid=True,alpha=0.2)

<AxesSubplot:xlabel='Longtitude', ylabel='Lattitude'>

png

20MLrecipe_35_1

housing.plot(kind="scatter", x="Longtitude", y="Lattitude", grid=True,alpha=0.2,
             s=housing["Rooms"], label="Rooms",
             c="Price", cmap="jet", colorbar=True,
             legend=True, sharex=False, figsize=(10, 7))
plt.show()

png

20MLrecipe_36_0

result.plot(kind="scatter", x="Longtitude", y="Lattitude", grid=True,alpha=0.2,
#              s=housing["Rooms"], label="Rooms",
             c="logPrice", cmap="jet", colorbar=True,
             legend=True, sharex=False, figsize=(10, 7))
plt.show()

png

20MLrecipe_37_0

Training

Linear Regression with regularization

Linear regression is a simple yet powerful method that helps to model the relationship between a dependent variable and one or more independent variables. We would run a linear regression and then add some regularization. Regularization usually helps the model to cut down on overfitting, and in general it generalizes better, hence improve the performance of the model.

There are two main types of regularization: L1 regularization and L2 regularization. L1 regularization, also known as Lasso regularization, adds a penalty term proportional to the absolute value of the model coefficients. This results in sparse models, where some of the coefficients are set to zero, effectively removing some of the features from the model.

Lasso (least absolute shrinkage and selection operator) adds the absolute form on constraint (\(\alpha \sum_{j=1}^{p} \mid \theta_{j} \mid\)) to the parameters \(\theta\) in the loss function:

\[Loss = \sum_{i=1}^{n}(y_i - \sum_{j=1}^{p} x_{ij} \theta_j)^2 + \alpha \sum_{j=1}^{p} \mid \theta_{j} \mid\]

L2 regularization, also known as Ridge regularization, adds a penalty term proportional to the squared magnitude of the model coefficients. This results in models with smaller coefficients and smoother decision boundaries since it punish too big or too small parameters.

Ridge Regression adds a squared form of constraint (\(\alpha \sum_{j=1}^{p} \theta_{j}^{2}\)) on the parameter \(\theta\) in the loss function:

\[Loss = \sum_{i=1}^{n}(y_i - \sum_{j=1}^{p} x_{ij} \theta_j)^2 + \alpha \sum_{j=1}^{p} \theta_{j}^{2}\]

Elastic Net is a regularization technique used in linear regression models that combines the strengths of both L1 (Lasso) and L2 (Ridge) regularization. Elastic Net adds a penalty term to the loss function of the model that is a weighted combination of the L1 and L2 penalties. The weight of each penalty is controlled by a hyperparameter that determines the strength of the regularization.

The elastic net penalty term encourages the model to learn sparse representations, similar to L1 regularization, while also ensuring that correlated features are shrunk together, similar to L2 regularization. This makes it a particularly effective technique for dealing with high-dimensional datasets with correlated features, where L1 and L2 regularization alone may not be sufficient.

Here is the elatic net term:

\(Loss = \sum_{i=1}^{n}(y_i - \sum_{j=1}^{p} x_{ij} \theta_j)^2 + \alpha \sum_{j=1}^{p} \theta_{j}^{2} + + \beta \sum_{j=1}^{p} \mid \theta_{j} \mid\) with the hyperparameters \(\alpha\) and \(\beta\) to balance between L1 loss and L2 loss.

In our example, Ridge gives us the best root mean squared error on the test set (0.196)

y=housingcopy['logPrice']

X= np.array(result)
y=np.array(y)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.metrics import mean_squared_error

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
y_pred = lin_reg.predict(X_test)
lin_rmse = mean_squared_error(y_test, y_pred,
                               squared=False)

lin_rmse

6923604520.172732

clf = Lasso(alpha=0.1)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
lin_rmse = mean_squared_error(y_test, y_pred,
                               squared=False)

lin_rmse

0.2233975109057049

clf2 = Ridge(alpha=1.0)
clf2.fit(X_train, y_train)
y_pred = clf2.predict(X_test)
lin_rmse = mean_squared_error(y_test, y_pred,
                               squared=False)

lin_rmse

0.196456119297143

regr = ElasticNet(random_state=0)
regr.fit(X_train, y_train)
y_pred=regr.predict(X_test)
lin_rmse = mean_squared_error(y_test, y_pred,
                               squared=False)

lin_rmse

0.2233975109057049

K fold cross validation

For example, a 10-fold cross validation method will split the dataset into 10 non overlapping subsets called folds. Each time it uses one fold: training on 9 folds and evaluating on the last fold. The scores (mean squared error) are 10 scores from 10 training times. We use Ridge regression, Decision Tree and Random Forest, Ridge again achieves the best score (lowest) of 0.19. RandomForest is a bit worse at 0.2, and Decision Tree is around 0.27.

from sklearn.model_selection import cross_val_score
lin_rmses = -cross_val_score(clf2, X_train, y_train,
                              scoring="neg_root_mean_squared_error", cv=10)
lin_rmses

array([0.19274845, 0.19486979, 0.1949542 , 0.19586131, 0.19846673,
       0.19671158, 0.19747097, 0.1977136 , 0.20391569, 0.19843234])

from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(X_train, y_train)
y_pred = tree_reg.predict(X_test)
tree_rmse = mean_squared_error(y_test, y_pred,
                                squared=False)

tree_rmse

0.2700937748340098

from sklearn.model_selection import cross_val_score
tree_rmses = -cross_val_score(tree_reg, X_train, y_train,
                              scoring="neg_root_mean_squared_error", cv=10)
tree_rmses

array([0.26613102, 0.26311385, 0.26655769, 0.26467363, 0.26733308,
       0.27428677, 0.26872589, 0.27553391, 0.26702197, 0.2702954 ])

from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor(random_state=42)
forest_rmses = -cross_val_score(forest_reg, X_train, y_train,
                                scoring="neg_root_mean_squared_error", cv=10)
forest_rmses

array([0.20237999, 0.20497054, 0.20594208, 0.20602234, 0.20787097,
       0.20677218, 0.20866224, 0.21074428, 0.21268268, 0.21056594])

Grid Search

A grid search allows you to fine tune the hyper parameters with a provided array (grid) of them. There is grid search and randomized search. Randomized search is when you give the range of the hyper parameters and the machine would choose randomly in the process. Since we get good result with Ridge regression, we continue to search for the best hyperparameter for it. The result is that the best alpha for ridge is 3 and the mean squared error reduces from 0.1965 to 0.1964.

grid_space={'alpha':[0.5,1,3,10]
           }
from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(clf2,param_grid=grid_space,cv=3,scoring='neg_root_mean_squared_error')
model_grid = grid.fit(X_train,y_train)
print('Best hyperparameters are: '+str(model_grid.best_params_))
print('Best score is: '+str(model_grid.best_score_))

Best hyperparameters are: {'alpha': 3}
Best score is: -0.19693259524684636

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
param_distribs = {'alpha': randint(low=1, high=50),}
rnd_search = RandomizedSearchCV(
    clf2, param_distributions=param_distribs, n_iter=10, cv=3,
    scoring='neg_root_mean_squared_error', random_state=42)
rnd_search.fit(X_train, y_train)
print('Best hyperparameters are: '+str(model_grid.best_params_))
print('Best score is: '+str(model_grid.best_score_))

Best hyperparameters are: {'alpha': 3}
Best score is: -0.19693259524684636

clf2 = Ridge(alpha=3)
clf2.fit(X_train, y_train)
y_pred = clf2.predict(X_test)
lin_rmse = mean_squared_error(y_test, y_pred,
                               squared=False)

lin_rmse

0.19635727066256592

Monitor

There is one approach to monitor for anomalies in housing price in Melbourne, so that people are not ripped off. Since we have a nice distribution of log price, we will consider something above or below 5 standard deviations from the mean to be outlier. We can print out the outliers and notify the admin if necessary.

result['logPrice'].hist()

<AxesSubplot:>

png

20MLrecipe_57_1

result['logPrice'].skew()

0.33355745693169997

result['logPrice'].describe()

count    27247.000000
mean         0.007633
std          0.893743
min         -4.589008
25%         -0.510495
50%          0.007633
75%          0.483157
max          4.852758
Name: logPrice, dtype: float64

print("Upper limit",result['logPrice'].mean() + 5*result['logPrice'].std())
print("Lower limit",result['logPrice'].mean() - 5*result['logPrice'].std())

Upper limit 4.476346245938641
Lower limit -4.461080863899662

result[(result['logPrice'] > 4.48) | (result['logPrice'] < -4.46)]

	Rooms	Postcode	Bedroom2	Bathroom	Car	Lattitude	Longtitude	Propertycount	logPrice	logDistance	...	CouncilArea_nan	Regionname_Eastern Metropolitan	Regionname_Eastern Victoria	Regionname_Northern Metropolitan	Regionname_Northern Victoria	Regionname_South-Eastern Metropolitan	Regionname_Southern Metropolitan	Regionname_Western Metropolitan	Regionname_Western Victoria	Regionname_nan
4377	-2.102077	-1.254681	-2.480836	-0.696023	-2.083830	-0.461308	0.111067	0.000371	-4.589008	-0.609686	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0
25634	1.064812	1.030168	1.178802	2.408339	0.250676	-0.843387	0.465099	0.669859	4.852758	0.172301	...	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0

2 rows × 407 columns

housing=housing.reindex(range(0,len(scaled_num)))

housing.iloc[4377]

Suburb                          Footscray
Address                  4/33 Ballarat Rd
Rooms                                   3
Type                                    t
Price                            585000.0
Method                                  S
SellerG                            Nelson
Date                            3/09/2016
Distance                              6.4
Postcode                           3011.0
Bedroom2                              3.0
Bathroom                              1.0
Car                                   1.0
Landsize                            259.0
BuildingArea                          NaN
YearBuilt                             NaN
CouncilArea      Maribyrnong City Council
Lattitude                        -37.7955
Longtitude                       144.9063
Regionname           Western Metropolitan
Propertycount                      7570.0
Name: 4377, dtype: object

housing.iloc[25634]

Suburb                        Brighton
Address                 78 Champion St
Rooms                                3
Type                                 h
Price                        2040000.0
Method                               S
SellerG                       Marshall
Date                        28/10/2017
Distance                          10.5
Postcode                        3186.0
Bedroom2                           3.0
Bathroom                           3.0
Car                                2.0
Landsize                         446.0
BuildingArea                       NaN
YearBuilt                          NaN
CouncilArea       Bayside City Council
Lattitude                     -37.9213
Longtitude                   145.00454
Regionname       Southern Metropolitan
Propertycount                  10579.0
Name: 25634, dtype: object

A machine learning project recipe

TOC

Introduction

Data preprocessing

Missing value

Skewness and Outliers

Feature engineering

Categorical variable encoding

Normalize

EDA

Training

Linear Regression with regularization

K fold cross validation

Grid Search

Monitor