Tutorials > Case Study 1 - Artificial Intelligence: Machine Learning - Part 2

Case Study 1 - Artificial Intelligence: Machine Learning - Part 2

Data Preprocessing and Feature Engineering

Data preprocessing, often called preparation is a vital step in the analysis process. Almost every naturally occurring data set has discrepancies that can lead to incorrect predictions. Before moving into the process of building features out of the data objects present in the input, it is important to standardize these inputs and make them ready for analysis. Below are a few key steps we take during preparing the data for analysis.
  1. Standardization: Standardization is a commonly used scaling technique that turns values into a standard scale centered around the mean of the target variable. This data is circled around a unit standard deviation only. The mean of the entire column that is standardized becomes zero and its S.D. becomes 1.
    X^'=(X-μ)/σ
    Mu here represents the mean of the values and Sigma represents the S.D. of the set.
  2. Normalization: Normalization is also used as a scaling technique on input data. It uses a formula to shift and rescale the values of a sample and keep their range between 0 and 1. This way distance-based algorithms function perfectly, since the range is small and predefined.
    X^'=(X-X_min)/(X_max-X_min )
    Here Xmax and Xmin represent the maximum and minimum observations present in the sample space.
In any real-world project or problem, we get raw data captured naturally. Training and Testing data is not readily available distinctly. Therefore, to make the analysis accurate we will combine the training and testing data in this step. Preprocessing the combined data together will make the model accurate as the overall fitting size will increase. Once the preprocessing is complete on the combined set, we will then separate the testing and training data once again. Let us now start the processing and feature engineering process on our dataset.

# << Code continued from the first section >>
# Separating Target variable and Features
var_target = housing_train_data['SalePrice']
housing_test_id = housing_test_data['Id']
housing_test_data = housing_test_data.drop(['Id'],axis = 1)
housing_data2 = housing_train_data.drop(['SalePrice'], axis = 1)
# Concatenating the train & test datasets
housing_train_test = pd.concat([housing_data2,housing_test_data], axis=0, sort=False)

# Finding the NaN variables from this dataset and Plotting all NaN
nan_df = pd.DataFrame(housing_train_test.isna().sum(), columns = ['Sum_NaN'])
nan_df['feature_set'] = nan_df.index
nan_df['amount_percent'] = (nan_df['Sum_NaN']/1460)*100
plt.figure(figsize = (32,8))
sns.barplot(x = nan_df['feature_set'], y = nan_df['amount_percent'])
plt.xticks(rotation=45)
plt.title('Features and their corresponding NaN Weightage')
plt.xlabel('Features Sets')
plt.ylabel('Amount of Missing Data')
plt.show()
>>>

6-1.png

# Although the shown numbers of NaN are high, they all might not really be missing values. Looking at the column
descriptions some values are expected to be absent and therefore are not NaN. We will impute these values
accordingly.

# Converting the non-numeric interpreters stored as integers into string
housing_train_test['MSSubClass'] = housing_train_test['MSSubClass'].apply(str)
housing_train_test['YrSold'] = housing_train_test['YrSold'].apply(str)
housing_train_test['MoSold'] = housing_train_test['MoSold'].apply(str)

# Manually filling up the categorical NaN values
housing_train_test['Functional'] = housing_train_test['Functional'].fillna('Typ')
housing_train_test['Electrical'] = housing_train_test['Electrical'].fillna("SBrkr")
housing_train_test['KitchenQual'] = housing_train_test['KitchenQual'].fillna("TA")
housing_train_test["PoolQC"] = housing_train_test["PoolQC"].fillna("None")
housing_train_test["Alley"] = housing_train_test["Alley"].fillna("None")
housing_train_test['FireplaceQu'] = housing_train_test['FireplaceQu'].fillna("None")
housing_train_test['Fence'] = housing_train_test['Fence'].fillna("None")
housing_train_test['MiscFeature'] = housing_train_test['MiscFeature'].fillna("None")

# Performing Mode Imputation on some columns
housing_train_test['Exterior1st'] = housing_train_test['Exterior1st'].fillna(housing_train_test['Exterior1st'].mode()[0])
housing_train_test['Exterior2nd'] = housing_train_test['Exterior2nd'].fillna(housing_train_test['Exterior2nd'].mode()[0])
housing_train_test['SaleType'] = housing_train_test['SaleType'].fillna(housing_train_test['SaleType'].mode()[0])

# Removing the variables that will not contribute to the prediction
non_affecting_cols = ['GarageYrBlt','YearRemodAdd'] 
housing_train_test = housing_train_test.drop(non_affecting_cols, axis = 1)

# We now come into the den of Feature Engineering. Here we will create new features by combining existing
variables. This will help us increase model’s performance.
# Combining features to create new ones
housing_train_test["SqFtPerRoom"] = housing_train_test["GrLivArea"] / (housing_train_test["TotRmsAbvGrd"] + housing_train_test["FullBath"] + housing_train_test["HalfBath"] + housing_train_test["KitchenAbvGr"])
housing_train_test['Total_Home_Quality'] = housing_train_test['OverallQual'] + housing_train_test['OverallCond']
housing_train_test['Total_Bathrooms'] = (housing_train_test['FullBath'] + (0.5 * housing_train_test['HalfBath']) + housing_train_test['BsmtFullBath'] + (0.5 * housing_train_test['BsmtHalfBath']))
housing_train_test["HighQualSF"] = housing_train_test["1stFlrSF"] + housing_train_test["2ndFlrSF"]
# Creating dummy variables
housing_train_test_dummy = pd.get_dummies(housing_train_test)

With this we conclude the Feature Engineering phase of the data set. Next, we begin training the machine learning models to start with the prediction.

Building the Machine Learning Model and performing Predictions

Machine Learning branches out from the diverse artificial intelligence bucket and concentrates on methodologies of data analysis. It is based on the ideation that systems and algorithms can learn from data if they are trained on a set of given features and patterns. This enables the algorithm to make decisions without or with minimum intervention of humans. Machine Learning evolved from pattern recognition and was built on a theory that computer programs can learn if they are trained and do not need specific output declarations. Growing volumes in data observed every day is the reason for the sudden rise in machine learning interests. It is therefore becoming more and more important for computational processing to become cheaper and run on the cloud.
We start the Machine Learning implementation on this data set below.

# Preparation of test and training data
housing_train_data = housing_train_test_dummy[0:2930]
housing_test_data = housing_train_test_dummy[2930:]
housing_test_data['Id'] = housing_test_id
housing_train_data.shape


>>> (2919, 326)


# Importing libraries for building the machine learning model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, BayesianRidge 
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.metrics import accuracy_score

# We take the natural logarithm of house train input array, element-wise, to standardise the array
housing_sample = housing_train_data[0:1460]
target_log = np.log1p(var_target)
target_log.shape


>>> (1460,)


housing_sample.shape


>>> (1460, 326)


housing_sample.fillna(housing_sample.mean(), inplace=True)
# Train, Test and Split the data for analysis
X_train,X_val,y_train,y_val = train_test_split(housing_sample,target_log,test_size = 0.1,random_state=42)
# We start fitting the data into multiple machine learning algorithms. It is always advised to use more than one algorithm and compare the results, to find the best fit 
# 1. Linear Regression
linear_regression = LinearRegression()
linear_regression.fit(X_train, y_train)
linear_regression_score = linear_regression.score(X_train, y_train)
# 2. Support Vector Machine Regression
support_vector_regressor = SVR()
support_vector_regressor.fit(X_train, y_train)
support_vector_score = linear_regression.score(X_train, y_train)
# 3. Decision Tree Regression
decision_tree_regressor = DecisionTreeRegressor()
decision_tree_regressor.fit(X_train, y_train)
decision_tree_score = linear_regression.score(X_train, y_train)
# 4. Random Forest Regression
random_forest_regressor = RandomForestRegressor(n_estimators=150)
random_forest_regressor.fit(X_train, y_train)
random_forest_score = random_forest_regressor.score(X_train, y_train)
# 5. Bayesian Ridge Regression
bayesian_ridge_regressor = BayesianRidge(compute_score=True)
bayesian_ridge_regressor.fit(X_train, y_train)
bayesian_ridge_score = bayesian_ridge_regressor.score(X_train, y_train)
# 6. Gradient Boost Regression
gradient_boost_regressor = GradientBoostingRegressor()
gradient_boost_regressor.fit(X_train, y_train)
gradient_boost_score = bayesian_ridge_regressor.score(X_train, y_train)
# Comparision of individual accuracy scores
accuracy_scores = []
Used_ML_Models = ['Linear Regression','Support Vector Machines','Decision Trees',
                   'Random Forest Regression','Bayesian Ridge Regression',
                   'Gradient Boost Regression']
accuracy_scores.append(linear_regression_score)
accuracy_scores.append(support_vector_score)
accuracy_scores.append(decision_tree_score)
accuracy_scores.append(random_forest_score)
accuracy_scores.append(bayesian_ridge_score)
accuracy_scores.append(gradient_boost_score)
score_comparisons = pd.DataFrame(Used_ML_Models, columns = ['Regressors'])
score_comparisons['Accuracy on Training Data'] = accuracy_scores
score_comparisons


>>> 
    Regressors	                     Accuracy on Training Data
0	Linear Regression	             0.946865
1	Support Vector Machines	         0.946865
2	Decision Trees	                 0.946865
3	Random Forest Regression         0.982092
4	Bayesian Ridge Regression        0.926543
5	Gradient Boost Regression        0.939485


# We observe that most of the predictions lie in the same range, although Random Forest Regressor is the most accurate on this training data set
# Take the final prediction score of the test variable
final_prediction = random_forest_regressor.predict(X_val)
rfc_score = random_forest_regressor.score(X_train, y_train)
rfc_score


>>> 0.9818293405075341


# Ranking the features and plotting them with their  importance
important_features = random_forest_regressor.feature_importances_
plt.figure(figsize = (15,5))
plt.barh(housing_sample.columns[:15], important_features[:15])
>>>

6-2.png

# Now that we know which features hold the most importance, we will run a few sample predictions on the test data and derive the house prices for a few Id(s)
# Test the predictions on the test data set and derive the prices for a few samples
housing_sample = housing_train_data[0:1459]
housing_sample.fillna(housing_sample.mean(), inplace=True)
Pred_test = random_forest_regressor.predict(housing_sample)
Sale_price_pred = pd.DataFrame(housing_test_id, columns = ['Id'])
Pred_test = np.expm1(Pred_test)
Sale_price_pred ['SalePrice'] = Pred_test 
Sale_price_pred.head()


>>> 
        	Id	SalePrice
0	1461	205511.779437
1	1462	176017.744645
2	1463	223587.840662
3	1464	156579.649258
4	1465	263174.543961


# With this we conclude the analysis of house prices

The aim of this project was to understand what factors are most important in deciding the purchase of a house in Iowa. We also wanted to run the model on test data and predict house prices. Conclusions drawn from the above machine learning model building and analysis:
  1. Successful house price predictions were made using the Random Forest Regression model.
  2. We observed that the Random Forest regressor performed better than the other machine learning algorithms we tried on the data set.
  3. In terms of importance of features, it was seen that the Overall Quality of the house was the most important factor. Following that came the Living Area and the High Quality Surface Built of the house.

Concluding Machine Learning and the Way Forward

Machine Learning works on the notion that computer systems learn from data and can perform predictions based on the input data provided for training. These predictions can also work on data that the model has never seen before, hence make it an intelligent machine. The concept that we started in the first chapter, attributing the build of intelligent machines to solve humane problems has proven helpful in this project. Machine Learning is a large portion of Artificial Intelligence’ implementation in the real world. Majority of AI algorithms running today have ML executions and this project is a kickstart of your journey into the world of machine learning.

In the upcoming chapters we will dive into other aspects of Artificial Intelligence that work on a different kind of data set. We will learn the concept of logic programming in Python that helps in building AI applications. We will also work on data that is based off Natural Languages, Images and Voice. These come under the category of intuitive and cognitive artificial intelligence. Coming up next are:
  1. Logic Programming in Python
  2. Working with natural languages in text and speech
  3. Learning from voice-based data