Keras Regression — King County House Dataset

Huseyn Kishiyev
6 min readJan 21, 2023
Photo by Étienne Beauregard-Riverin on Unsplash

In this tutorial I will be building a regression model by exploring a numerical house dataset to predict prices of houses in King County area. I use TensorFlow and Keras for building and evaluating the model. You can see a brief Exploratory Data Analysis (EDA) process too.
First, we will start by importing necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Next, we are going to load our dataset in our Python environment

df = pd.read_csv('kc_house_data.csv')

Data Cleaning

Here are columns of our dataset;

df.columns
Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
'lat', 'long', 'sqft_living15', 'sqft_lot15'],
dtype='object')

It’s obvious that “id” and “date” columns are irrelevant and we can easily drop them out of the dataset in this particular example. You may as well notice that the “zipcode” column is a collection of numerical values that represents zip codes in King County. Since “zipcode” is a numerical value, if we train it without doing the necessary feature engineering, the model will assume that it’s continuous. Hence, this assumption will deteriorate the accuracy of the label prediction (“price” column in this case). Let’s use .drop() method with two keyword arguments — list of columns to be dropped and the axis.

df = df.drop(['id', 'date', 'zipcode'], axis=1) #axis=1 means column axis

We can also check if our dataset contains any null values, so we can think of couple methods to replace them with correlated values.

df.isnull().sum() #Total number of null values

Output:
price 0
bedrooms 0
bathrooms 0
sqft_living 0
sqft_lot 0
floors 0
waterfront 0
view 0
condition 0
grade 0
sqft_above 0
sqft_basement 0
yr_built 0
yr_renovated 0
zipcode 0
lat 0
long 0
sqft_living15 0
sqft_lot15 0
dtype: int64

A Brief Exploratory Data Analysis (EDA)

There are several ways to start analyzing a dataset using specific techniques. You can start by observing statistical measures and plotting histograms, box plots, count plots and so on.

df.describe() #Will give you mean, std, min and max values and etc.

Visualizing this Dataset

Let’s pick some variables from the dataset and plot their histograms. We can either plot each variable’s histogram individually or use this method to save some time and write a better code. In this example, I will illustrate how to expand your tool box by writing a function to visualize your data. For instance, I can pick numerical values from this dataset ‘bedrooms’, ’bathrooms’, ’sqft_living’, ’sqft_lot’,’floors’ and simply visualize all of them by first putting them into a list.

def plotHistogram(variable):
"""
Input: Variable/Column name
Output: Histogram
"""
plt.figure(figsize=(10,5))
plt.hist(df[variable], bins=85, color="blue")
plt.xlabel(variable)
plt.ylabel("Frequency")
plt.title(f"Data Frequency - {variable}")
plt.show()

As you can see, function plotHistogram takes one parameter called variable, creates a figure and sets all the properties one by one. We can easily call this function so that the output will be different histograms with corresponding variables.

numerical_variables = ['bedrooms','bathrooms','sqft_living','sqft_lot','floors']
plot_ = [plotHistogram(i) for i in numerical_variables]

This is a very functional illustration of visualizing more than one histograms by writing reusable functions.

You might have already guessed that the output of this code is several histograms with corresponding variables. We can continue by doing further exploratory data analysis but this is just a regression tutorial.

Splitting Labels and Features

X = df.drop('price', axis=1) 
y = df['price'].values
"""
Since X represents features, dropping the label column
will provide remaining features.
Similarly, picking the price column will provide the label.
"""

Proceeding with train test split;

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=101)

We are going to perform feature scaling right after the train test split. This way we will only fit to the training set to prevent data leakage from the test set. Technically, we can use any scaler we wish (e.g StandardScaler) but in this case MinMax scaler is a simple one to continue.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
#Change the definition of the training set as a scaled version.
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Notice that we are applying the same procedure to our test set but while scaling test sets, we are using a single .transform() method instead of fit_transform(). We don’t fit our test set because we don’t want to assume prior information about the test set. Now that we’ve applied necessary splitting and preprocessing steps, let’s build the model.

Building the model

Essentially, you will need to have both TensorFlow and Keras installed in your environment. While creating a Sequential model, bear in mind that the number of neurons included in layers we will be based on the actual number features.

X_train.shape #Output: (15117, 19)

Notice that activation functions on all four layers are the same (ReLU or rectified linear unit). Now that we have constructed a deep neural network, we just need to add one last layer with a single neuron — the output layer that contains the prediction.

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

model = Sequential()

model.add(Dense(19, activation='relu'))
model.add(Dense(19, activation='relu'))
model.add(Dense(19, activation='relu'))
model.add(Dense(19, activation='relu'))

model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')

Let’s look at the model compilation part of this code. You may notice that we use Adam optimizer and mean-squared error as a loss function. First of all, there are several reasons why a lot of models use adam as a default optimizer. It’s generally better than any other optimization algorithms, has faster computation time and requires fewer parameters for tuning. On the other hand, mean-squared error is the most adequate way of measuring amount of error in statistical models. It’s used as a default metric for evaluation of the performance of most regression algorithms.

Training the model

model.fit(x=X_train, y=y_train, validation_data=(X_test, y_test),
batch_size=128, epochs=400)

Since we are dealing with a relatively large dataset, we need to train the model in batches. Batch sizes are usually set to powers of 2 (e.g 64, 128, 256) and smaller the batch size, longer the training is going to take. Most of the time this correlation has an advantage of preventing the overfitting since you are not passing the whole training set at once.

Finally, I chose an arbitatry number of epochs. More on early stopping mechanisms later.

Evaluation

In order to see if we are overfitting to the training data on our model, we have to find out losses so that we can compare them to validation losses.

loss = pd.DataFrame(model.history.history)

We can proceed by comparing loss on training and the loss on validation data to find out if we are overfitting to the training data. Let’s plot loss and val_loss variables.

loss.plot()

Here’s the plot:

loss vs val_loss

We can clearly see that both losses experience dramatic decrease but keep steady position after a certain batch. This is a strong indicator that we could continue training without overfitting to our training set. If we see that the val_loss is experiencing some sort of increase after a certain batch, it means that we are overfitting to the training data.

Evaluation

I will use some evaluation metrics to compare the original values and predictions.

predictions = model.predict(X_test)
predictions #list of predictions
mean_squared_error(y_test, predictions) #using mse compare y_test and predictions
#Output: 27452360549.322517

Since it’s hard to understand the output we can use mean-absolute error to find out average absolute error on predictions.

mean_absolute_error(y_test, predictions)
#Output: 102609.94870394483

References

[1] Online data science courses and cloud computing (2022) Pierian Training. Available at: https://www.pieriandata.com/courses/python-for-data-science-and-machine-learning-bootcamp/lectures/1487773 (Accessed: January 22, 2023).

[2] Lex Fridman (2020) MIT Deep Learning and Artificial Intelligence Lectures. Available at: https://deeplearning.mit.edu/ (Accessed: January 22, 2023).

--

--