Machine Learning : Regression in Python

4 min readJan 9, 2022

In my first article, I will explain how to implement a Simple Linear Regression Model in Python. Before we get started, I would like to inform you about the theory first, then demonstrate how it works.

Example: Visualization of Simple Linear Regression

Theory behind the Linear Regression

Simple Linear Regression is a type of Regression algorithms that models the relationship between a dependent variable (usually y) and a single independent variable (x). The relationship shown by a Simple Linear Regression model is linear or a sloped (m) straight line, thus it is called Simple Linear Regression. It has two main goals:

Understand the relationship between two variables.
Foresee new observations.

I put the mathematical representation below, so you can understand how it works.

(a0 stands for y-intercept, a1 stands for the slope, and ε stands for the error term).

                          y = a0 + a1x + ε

Problem Statement: Salary vs Years of Experience

We are going to define this problem first, then implement our solution based on Linear Regression. There’s a direct correlation between salary and years of experience— for most of our lives we will earn more when we gain more experience and increase our potentials. We will use Linear Regression to model the relationship between the amount of salary with the years of working experience.

Implementation — Importing Libraries

Let’s get our hands dirty and dive into the implementation part. The first step of our implementation is Data Pre-Processing. We will start by importing libraries that will help us to manipulate data.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Importing Dataset & Dividing it into two parts: Features and Labels

dataset = pd.read_csv(‘Salary_Data.csv’)

To divide data into two parts, we will use .iloc and make proper indexing to get ready for the data preprocessing. Notice that X stands for features (independent variable) and y stands for the label (dependent variable).

X = dataset.iloc[:, :-1].values 
y = dataset.iloc[:, -1].values

Now that we have imported and divided data into two parts, we will proceed with taking care of missing data.

Taking Care of missing Data

There are several ways of dealing with missing data. Perhaps the best strategy is to take the mean of existing numerical values and replace missing parts with it. We need SimpleImputer() from Scikit-learn to accomplish this step. You can read the documentation by clicking the link.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
print(X)

Splitting the Dataset into the Training Set and Test Set

Since we have already implemented the required Data Pre-processing steps, we have to split our Dataset into Training Set and Test Set. The train-test split is a technique for evaluating the performance of a machine learning algorithm. The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)

Training the Simple Linear Regression Model on the Training Set

It’s time to build our Linear Regression model. First, we are going to import LinearRegression() from sklearn.model_selection to create its object. Then, we are going to call .fit() method and pass two arguments: X_ train and y_train. To predict test results, we should call .predict() and pass X_test as an argument.

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

Visualizing the Training set results

So far we have applied Data Preprocessing and Model Training. We will need to visualize our training and test set results to see if the line of best fit(a line that represents predicted values) is accurate or not. Import matplotlib.pyplot and plot training set.

plt.scatter(X_train, y_train, color = ‘red’)
plt.plot(X_train, regressor.predict(X_train), color = ‘blue’)
plt.title(‘Salary vs Experience (Training set)’)
plt.xlabel(‘Years of Experience’)
plt.ylabel(‘Salary’)
plt.show()

Visualizing the Test set results

plt.scatter(X_test, y_test, color = ‘red’)
plt.plot(X_train, regressor.predict(X_train), color = ‘blue’)
plt.title(‘Salary vs Experience (Test set)’)
plt.xlabel(‘Years of Experience’)
plt.ylabel(‘Salary’)
plt.show()