Streamlining ML Workflow: Tidy Models vs. Scikit-Learn

A Deep Dive into Two Powerful Frameworks for Streamlined Machine Learning Workflows

Introduction:

The data science experiment or a model-building process is a series of steps that are followed in a particular sequence and more often than not with multiple iterations. The focus quickly shifts towards tracking model accuracy and related metrics but there are lots of other aspects which can make the entire process more structured making it not only easy to read the code but also to reproduce the results. In this blog, we will explore both scikit-learn pipelines and Tidymodel workflows with the same dataset. We will first, look at building machine learning models in an unstructured approach and then move on to build the same models using python and then with R.

Table of Contents

  1. Unstructured approach to model building
  2. Data exploration & processing
  3. Model building with pipelines in Python
  4. Model building with pipelines in R
  5. Further exploration/experimentation

Load Python Packages & Data: We will load the packages, if you don’t have the library then install it with a simple command pip install <libraryname>. In this blog, we will be using the HCV dataset

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
import seaborn as sns
import warnings# Load Dataset
df = pd.read_csv("hcvdata.csv")

Exploratory Data Analysis: Let us explore the data to understand the data types of each of the variables in the dataset.

df.describe()
df.dtypesCategory       object
Age int64
Sex object
ALB float64
ALP float64
ALT float64
AST float64
BIL float64
CHE float64
CHOL float64
CREA float64
GGT float64
PROT float64

Let us also find the percentage of missing values on the dataset.

total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)*100
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_dataoutput:
Total Percent
ALP 18 2.926829
CHOL 10 1.626016
ALB 1 0.162602
ALT 1 0.162602
PROT 1 0.162602
Category 0 0.000000
Age 0 0.000000
Sex 0 0.000000
AST 0 0.000000
BIL 0 0.000000
CHE 0 0.000000
CREA 0 0.000000
GGT 0 0.00000

Observations:

  • There are two categorical variables namely Category and Sex, of which Category is our target variable. These two variables will need some processing which we will do in the next section.
  • There are missing values in some of the variables which need to be addressed before we start model building.

Data Processing: We will encode the two categorical variables and impute the NA values with 0 for sake of simplicity.

cat_dict = {'0=Blood Donor':0, '0s=suspect Blood Donor':0, '1=Hepatitis':1,
'2=Fibrosis':2, '3=Cirrhosis':3}df['Category'].map(cat_dict)
df['Category'] = df['Category'].map(cat_dict)df['Sex'] = df['Sex'].map({'m':1,'f':0})
df = df[['Age', 'Sex', 'ALB', 'ALP', 'ALT', 'AST','BIL', 'CHE', 'CHOL', 'CREA', 'GGT', 'PROT','Category']]df.fillna(0,inplace=True)

Data Splitting:

# Features & Labels
X_features = df.drop('Category',axis=1)
y_labels = df['Category']# Train Test Split
X_train,X_test,y_train,y_test = train_test_split(X_features, y_labels,
test_size=0.3,random_state=42)

Model Building in Python: We will build a logistic regression model.

lr = LogisticRegression()
lr.fit(X_train,y_train)# Prediction Accuracy
print("Accuracy:", lr.score(X_test,y_test))Output:Accuracy: 0.9243243243243243

Let us build the same model but this time with pipelines

Model Building with pipelines: There are specific steps in building a model right from loading the data to measuring the final model metrics and each of these steps is an object. The purpose of pipelines is to collate all objects under one structured sequence. This largely increases the code readability and reusability.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifierdf = pd.read_csv("hcvdata.csv")# Train Test Split
X_train,X_test,y_train,y_test = train_test_split(
X_features, y_labels,test_size=0.3,random_state=42)numeric_features = df.select_dtypes(include=['int64', 'float64']).drop(['Unnamed: 0'], axis =1).columnscategorical_features = df.select_dtypes(include=['object']).drop(['Category'], axis=1).columns

Preprocessing: In this step, we will define the kind of transformation that we need to carry out o our numeric_features and categorical_features. Eg: missing value imputation, standardizing the data, or type of encoding, etc.

numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())]) categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value= 2)),
('onehot', OneHotEncoder(handle_unknown='ignore'))])preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])

Now, that we have defined our features and respective transformation to be carried out, we will wrap it under the pipeline in the next section.

Building Pipeline in Python: We will define a pipeline and plug in the objects we have already defined.

rf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', RandomForestClassifier())
])pipe_lr = pipe_lr.fit(X_train, y_train)
print('Accuracy score: ', pipe_lr.score(X_test, y_test))Output:
Accuracy score: 0.9135

Building Multiple Models & Comparision: Let us build Random Forest and KNeigbors Classifiers models leveraging the pipeline that we have already built — Reusability.

classifiers = [KNeighborsClassifier(3),RandomForestClassifier(),
LogisticRegression()]for classifier in classifiers:
pipe = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', classifier)])
pipe.fit(X_train, y_train)
print(classifier)
print("model score: %.3f" % pipe.score(X_test, y_test))Output:
KNeighborsClassifier(n_neighbors=3)
model score: 0.854
RandomForestClassifier()
model score: 0.892
LogisticRegression()
model score: 0.892

We have tried two approaches and the process followed with the pipeline makes workflow structured, sequenced with improved code readability and easy reproducibility.

Model Building in R: Similar to pipelines in python we have workflows in R. In R, the term pipeline is widely used to stitch a sequence of operations together with %>%. In order to avoid ambiguity, the sequence of operations in the context of model building is termed workflows. In the next section, we will try and replicate all the steps in R.

Load Packages & Data:

library(tidymodels)
library(tidyverse)
library(workflows)
library(tune)
library(nnet)
setwd('<Set your working directory here>')
df <- read.csv('hcvdata.csv')

Preprocessing: We will be splitting the data and segregating the numeric and categorical variables.

hcv_split <- initial_split(df, prop = 3/4)
# extract training and testing setshcv_train <- training(hcv_split)
hcv_test <- testing(hcv_split)numeric_features <- df1 %>%
select_if(is.numeric)
names(numeric_features)categorical_features <- df1 %>%
select_if(is.character) %>%
select(Sex)names(categorical_features)

Building Pipeline in R: The terminologies are slightly different in Tidy Models. Let’s familiarize ourselves with some of the key terms.

  • Recipe — A function to define transformation for data preprocessing
  • Juice — A function to extract data from Recipe.
  • Bake — A function to apply all the transformations that were done on the train set to the test set.

Define Recipe: In the recipe, we define the target, predictors, and list of transformations.

# define the recipe
hcv_recipe <-
# which consists of the formula (outcome ~ predictors)
recipe(Category ~ Age + Sex + ALB + ALP + ALT +AST +BIL+CHE+CHOL+CREA+GGT+PROT,
data = df)
# and some pre-processing steps
step_normalize(all_numeric()) %>%
step_knnimpute(all_predictors())

Define Model Specification: In this, we define the nnet and rpart models.

glm_spec <-
multinom_reg(mode = "classification") %>%
set_engine("nnet")
rpart_spec <-
decision_tree(mode = "classification") %>%
set_engine("rpart")

Defining Workflow: We will create a workflow and add a recipe along with a model.

hcv_wf <- workflow() %>%
add_recipe(hcv_recipe) %>%
add_model(rpart_spec)

Final Fit and Model Metrics: We will use our workflow to build our model and track the model metrics. The last_fit() applies the workflow on the entire train set and evaluates on the test set.

final_res <- last_fit(hcv_wf, split = hcv_split)
collect_metrics(final_res)Output:A tibble: 2 x 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy multiclass 0.909 Preprocessor1_Model1
2 roc_auc hand_till 0.793 Preprocessor1_Model1

Note: The purpose of the blog is to understand the pipeline in both Python and R. Hence, a lot of data exploration and feature engineering have been skipped from this blog. For example, the dataset is imbalanced.

df.groupby([df.Category]).size()/len(df)*100Output:0=Blood Donor             86.666667
0s=suspect Blood Donor 1.138211
1=Hepatitis 3.902439
2=Fibrosis 3.414634
3=Cirrhosis 4.878049

The target has 0=Blood Donor class has 86.6% contribution and the remaining four classes contribute to the rest. The dataset will need to be balanced before the model-building stage with under-sampling or oversampling

Under Sampling:

from imblearn.under_sampling import RandomUnderSampler
undersamp = RandomUnderSampler(random_state=1)
X_under, y_under = undersamp.fit_resample(X_features, y_labels)
df.groupby([df.Category]).size()/len(df)*100
y_under.groupby([y_under]).size()/len(y_under)*100Output:
Category
0=Blood Donor 20.0
0s=suspect Blood Donor 20.0
1=Hepatitis 20.0
2=Fibrosis 20.0
3=Cirrhosis 20.0

Over Sampling:

from imblearn.over_sampling  import RandomOverSampler
oversamp = RandomOverSampler(random_state=1)
X_over, y_over = oversamp.fit_resample(X_features, y_labels)
df.groupby([df.Category]).size()/len(df)*100
y_over.groupby([y_over]).size()/len(y_over)*100Output:
Category
0=Blood Donor 20.0
0s=suspect Blood Donor 20.0
1=Hepatitis 20.0
2=Fibrosis 20.0
3=Cirrhosis 20.0

Once the above transformations are finalized, the same pipeline that we had defined earlier can be reused for the model building.

Closing Note:

We saw an unstructured version to start with and then moved to a more sophisticated pipeline version in both R and Python. A code where every logical step is a separate entity is easy to manage/debug and we saw how these entities were stitched together in pipelines/ workflows in a defined sequence making every step seamlessly smooth.

Hope this blog was helpful and you enjoyed reading it. Happy learnings !!!!

You can connect with me on Linkedin

You can find the code on Github

References:

https://www.tidymodels.org/

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

https://www.tmwr.org

Leave a Reply

Your email address will not be published. Required fields are marked *