Exploring MLOps Solutions: DVC Studio vs MLflow

An In-depth Comparison of Features, Capabilities, and Integration of DVC Studio and MLflow

Source: Author

Introduction

The machine learning lifecycle encompasses various stages, each with its unique data setup, ETL process, ML models, and configurable parameters. Additionally, multiple teams and individuals contribute to research and experimentation. Consequently, there arises a need for change tracking and timestamping to facilitate effective monitoring.

Similar to web development, where multiple developers make code changes and integrate them into a larger project, version control tools, and DevOps have simplified the process. So, how can we leverage the benefits of DevOps and apply them to machine learning projects?

Enter MLOps, a discipline that combines the principles of DevOps with the nuances of data science. In MLOps, we not only track code changes but also monitor data, models, parameters, model metrics, plots, and much more. This blog delves into two prominent tracking platforms, DVC Studio, and MLflow, that cater to these MLOps requirements.

Any Prerequisites?

To understand MLOps & DVC, you will need very basic knowledge of model building in python and a GitHub account. I will be using Visual Studio Code as editor; you can use any editor of your choice as long as you are comfortable with it.

We will be using Kaggle’s South Africa Heart Disease dataset. Our objective is to predict chd i.e., coronary heart disease (yes=1 or no=0). A simple binary classification model. To start with, our project folder structure is as below. As we process the files in the later sections, each of these folders/files will be updated.

Although not necessary but please feel free to read the below listed blog to get background on MLOps and DVC.

Bring DevOps To Data Science With MLOps

MLOps | Tracking ML Experiments With Data Version Control

MLOps | Versioning Datasets with Git & DVC

DVC Studio

DVC Studio can be accessed via https://studio.iterative.ai/ and can be connected to various version controls to import our ML projects for tracking with DVC. The DVC UI allows us to work with data and carry out experiments from the web application without making code changes before running experiments.

  • It helps us to manage data and models, run and track experiments, and visualize and share results.
  • Enables us to track our code, experiments, and data all the time.

Setting Up Work Environment

Please refer to the blog — MLOps | Tracking ML Experiments With Data Version Control to set up the project.

DVC setup: If you don’t have DVC Install then install as below. For more information refer install DVC

pip install DVC

To start with, our project folder structure is as below. You can also clone the repository from Github

........
+---
data
| +---processed
| | test_SAHeart.csv
| | train_SAHeart.csv
| ---raw
| SAHeart.csv
+---
data_given
| SAHeart.csv
+---
report
+---
saved_models
| model.joblib
---
src
| get_data.py
| load_data.py
| split_data.py
| train_and_evaluate.py

Using DVC, we can run experiments and track it as well. Please read more on DVC experiments on this blog MLOps | Versioning Datasets with Git & DVC. Here is the sample DVC experiment & output.

This is good but it is not interactive i.e we need to write commands for any further investigation. Many of the common operations like comparing models, plots, any kind of CRUD operations will need a more sophisticated UI than terminal and DVC studio provides exactly that. We will explore this in detail in the next sections.

DVC Studio: Here are the steps to be followed to set up DVC studio.

Step1: Navigate to the URL https://studio.iterative.ai, Sign in with your Github, GitLab, or Bitbucket account. I have used Github in this case.

Step2: Click on Add a view button at the right top of the screen.

Step3: We need to map out Github repo to the studio and we can do this by clicking on configure Git Integration Settings.

Note: This is only done the first time we map the git repo. Once mapped we will be able to see our repo in the dropdown.

Step4: Click on the link to Configure Git integrations settings. This will open the Git integrations section of your profile page. Select the repository and provide access.

Step5: Now, that repository has been mapped, the same would be available to create a view as below.

Step6: Once the repository has been added, click on it to open the tracker.

Compare the models: Select the models and click on Compare

Run the experiments:

  • We can either make code changes, run the model, check for metrics and if satisfied with results push the code to Github, and DVC studio automatically pulls the defined metrics and relevant data from Github.
  • We can directly run experiments from the DVC studio and push the code to Github.

The DVC studio takes away all the hassle of tracking, monitoring the model and corresponding metrics. The data scientists/analysts can focus more on feature engineering and model tuning rather than the operations side of tracking.

Let us take a look at MLflow.

MLflow

MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It helps us manage the experiments and also to carry out a comparative analysis. It also enables data scientists to collaborate and share results with other teams more seamlessly.

Setting Up Work Environment

Lets install the mlflow library with the below code.

pip install mlflow

We will be using the same dataset and build the same model as in the case of DVC in the previous section. To keep things simple for this blog, the project structure for the MLflow case is as below.

Note: We could very well use the same project structure in this case as well.

MLFLOW
| classification.py
| config.yaml
+---Data
| SAHeart.csv
\---mlruns

Load the libraries: Our focus is on mlflow library, rest of them are standard libraries used for splitting data, model building and measuring model metrics.

......
import yaml
import mlflow
import logging
import mlflow.sklearn
import sklearn.metrics as metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from urllib.parse import urlparse
......

Define config.yaml file: Let us define all the parameters in the config file. This would help us to keep the code clean, easy to maintain and experiment with various parameters.

data_directory: "./Data/"
data_name: "SAHeart.csv"
drop_columns: ["sbp"]
target_name: "chd"
test_size: 0.3
random_state: 123
solver: 'saga'

Loading the config file: The below chunk of code loads the Config.yaml.

# folder to load config file
CONFIG_PATH = "./"
# Function to load yaml configuration file
def load_config(config_name):
with open(os.path.join(CONFIG_PATH, config_name)) as file:
config = yaml.safe_load(file)
return config
config = load_config("Config.yaml")

Model building: We will build a logistic regression to predict the chd.

if __name__ == "__main__":
# load data
df = pd.read_csv(os.path.join(config["data_directory"], config["data_name"]))
df = df.dropna()
df = pd.get_dummies(df, columns = ['famhist'], drop_first=True)
df.drop(config['drop_columns'],axis=1, inplace=True)
# Split data into train and test
train, test = train_test_split(df, test_size=config['test_size'], random_state = config['random_state'])
train_x = train.drop(config['target_name'], axis=1)
test_x = test.drop(config['target_name'], axis=1)
train_y = train[config['target_name']]
test_y = test[config['target_name']]
with mlflow.start_run():
model = LogisticRegression(solver= config['solver'], random_state=config['random_state']).fit(train_x, train_y)
train_score = model.score(train_x, train_y) * 100
print(train_score)
test_score = model.score(test_x, test_y) * 100
print(test_score)
predicted_val = model.predict(test_x)
roc_auc = roc_auc_score(test_y, model.predict_proba(test_x)[:, 1])
accuracy = accuracy_score(test_y, predicted_val)
# Generating model metrics
cm = confusion_matrix(test_y, predicted_val)
precision = precision_score(test_y, predicted_val, labels=[1,2], average='micro')
recall = recall_score(test_y, predicted_val, average='micro')
# Logging and tracking model metrics
mlflow.log_param("Train Score", train_score)
mlflow.log_param("Test Score", test_score)
mlflow.log_metric("roc_auc", roc_auc)
mlflow.log_metric("Precision", precision)
mlflow.log_metric("Recall", recall)
tracking_url_type_store = urlparse(mlflow.get_tracking_uri()).scheme
if tracking_url_type_store != "file":
mlflow.sklearn.log_model(model, "model", registered_model_name="Logistic Regression")
else:
mlflow.sklearn.log_model(model, "model")

Monitoring Model Metrics: Typically, the metrics are tracked as below where we would put the data in the tabular form and compare.

Path                Metric             Old       New       Change
reportscores.json Logistic Accuracy 0.62069 0.65517 0.03448
reportscores.json roc_auc 0.65093 0.72764 0.07671
reportscores.json test_score 62.06897 65.51724 3.44828
reportscores.json train_score 71.96532 74.27746 2.31214

This is perfectly fine and serves our purpose but imagine running a huge number of iterations, plotting and tracking metrics. As you would have guessed, it becomes cumbersome and we generally start to deviate from our model objective and start worrying about tracking metrics, plots and their format, etc.

It would be great if we have a tool that does all the track and lets us focus on the code model building. Let us explore this with MLflow UI.

MLflow UI: Once the code in classification.py is run successfully. Model metrics can be tracked and viewed in the UI with below command.

mlflow ui## Here is the output
INFO:waitress:Serving on http://127.0.0.1:5000

Navigate to the URL in the browser and we can see the below screen.

  • It has a list of experiments that we have run with various metrics that were defined to be tracked.
  • It has compare option similar to DVC Studio, whereby by selecting models we can compare their performance/metrics.
  • We can perform basic CRUD operations from the UI.

Conclusion

This blog aims to highlight the setup and features of two powerful tools, DVC Studio and MLflow, which simplify the MLOps process. These open-source platforms offer user-friendly interfaces that streamline the tracking of various aspects typically involved in any ML project or experiment. By utilizing these tools, you can effortlessly monitor and manage crucial elements, eliminating the tediousness associated with traditional approaches. We hope this article has provided valuable insights and wish you the best of luck with your project tracking endeavors!

You can connect with me — Linkedin

You can find the code for reference — Github

References:

https://dvc.org/
https://www.mlflow.org/docs/latest/quickstart.html

Leave a Reply

Your email address will not be published. Required fields are marked *