Data Science -DevOps Part 2

Several tools exist in order to try and solve the problems mentioned in the first part of the article and make our lives easier, some are open source and others are proprietary. Each one brings solutions to different problems (often the same).

In the following, we will present two open source solutions (DVC and MLflow), accompanied by Git, which, combined, solve most of the versioning problems that we data scientists encounter on a daily basis.

We will use Git to manage versions of our code and configuration files, DVC to manage data, pipeline and models, MLflow will be used for deployment and comparison of metrics from different experiments.

 

Git

Git is a distributed version-control system for tracking changes in source code during software development. It is designed for coordinating work among programmers, but it can be used to track changes in any set of files. Its goals include speed, data integrity, and support for distributed, non-linear workflows (Wikipedia)

 

Our developer friends know the beast well, it is their favorite tool sometimes the only one they use to keep the history of changes to their source code. So it’s no surprise the tool we’re going to use to manage the versioning of our code, especially since the other two tools we’re going to use get along perfectly with Git..

 

Git is also ideal because most code repository managers on the web support it (e.g. Github, Gitlab, Bitbucket…), so we can easily deploy our project in remote servers without having to manage maintenance and therefore sleep well at night.

 

In this post, we will not dwell into how Git works, because many articles and tutorials explaining the basics and ingredients of Git already exist, to name a few: Official documentation , Progit 2.

 

Why don’t we limit ourselves to Git alone, why complicate our lives? humm, actually, we’re not going to try to change the role of Git, but rather add new bricks to this already well-stocked building, to house the parts that are intrinsic to the world of data science  (i.e data and models). 

 

Git was not made to handle large files,  all the people who tried to do so had to, sooner or later bite the dust, especially when trying to upload these files to a remote server. Since we are going to juggle with very large data sizes, this is also true for the models. 

 

However, there are solutions that have been created specifically to manage large files with git such as Git-LFS and Git-annex, but these solutions have not been designed for our whimsical world, which is why we will exploit other solutions tailored to coexist with the magicians that we are :).

 

DVC(Data Science Version Control)

DVC is an open source versioning tool for data science, it is agnostic to the different machine learning libraries (e.g Scikit learn, Tensor Flow…..) and programming languages, in other words anything that can be executed with a command line in a terminal can be used with DVC. It sits on top of Git, focusing only on the management of the different versions of data, models but also their links with the source code (One stone three shots:), all this in such an elegant way (even David Beckham would be envious of it), leaving Git the task of versioning the source code and configuration files (e. g. those created by DVC).

Data Science Versionning Control
Fig.2 : dvc Workflow (source : dvc)

Figure 2 shows the general workflow of a versioning done with DVC

 At this point you must certainly think that this is science fiction. So we will see in detail, how DVC achieves this magic trick.  Before going further, I must confess something: DVC looks a lot like Git on its functionalities but especially on the commands it uses (Iphone and Huawei? I let you guess who is who)

Installation of dvc

DVC can be installed through pip install dvc or using apt-get (linux) and homebrew(osx), for further details, please check https://dvc.org/doc/get-started/install

 

1-Data management

 

DVC as his compatriot Git (do you have the right to use that word?), uses commands to track data, there are mainly two commands to tell DVC to manage the versioning of a file or folder :

 

  • dvc add folder/file: this command allows you to manually give DVC the control over input folder or file while asking git to ignore it by adding the file/folder to .gitignore.
  • dvc run command: allows to execute a script(command) with inputs and outputs, after executing the command, dvc takes control of the input files (code and data) and output files (data and/or models)

 

But what’s this control you’ve been telling us about since then?

 

I’m glad you asked, I’m already salivating so much it’s elegant and finely played by DVC (calm your joy), okay, let’s get to the heart of the matter.

 

You remember when I said that dvc was almost a copy of Git, well it’s time to justify that “accusation”.

To manage the different versions of data and models, dvc creates a .dvc folder at the root of the project just like the .git folder created by Git, and as if that wasn’t enough the command to initiate dvc in a project is … dvc init. 

More seriously, all this is done with the objective of making it easy to learn and use dvc, because most people know Git, so it is useless to tire them out with new commands to memorize because unlike models our memory capacities are too ephemeral (fortunately).

 

Where things are different is in the content of the.git and.dvc folders, yes, DVC creators still have a lot of material to sell. In the.dvc folder, we find a cache folder (ignored by Git) containing the different versions of our data and models (i.e. all the folders/files tracked by dvc). 

It is in this folder that DVC comes to recover , when it is possible (because the data can also be saved in a remote server, patience patience we will get there later), the requested data.

When DVC controls a folder/file, it creates a file with the extension .dvc ({folder|file}.dvc) where it stores all the necessary informations to find the file in the .cache folder such as the md5 hash of the file.

In addition to that, he will automatically add the controlled file to the .gitignore to tell his friend Git to let him be responsible for tracking the file in question. 

The question now is how dvc links the files in the.cache folder with the files in our workspace (the folder of our project). After all it is on this folder that we work and we must therefore have our files available. This is where the magic works, for each file tracked by dvc, a link is made between the file and the file placed in the .cache folder (with the right version of course) .

From this two questions arise: 

How does dvc know the correct version of the files to link with the file in our workspace?

 

To answer this question we will analyze the content of the .dvc file created after executing dvc add : dvc add data1.xml

data1.xml.dvc

md5: c864bbaa763461692f917d7390334bee

outs:

– cache: true

  md5: 51efaecfd3502c661ffa70b7f2b753b4

  metric: false

  path: data1.xml

wdir: .

  • The first md5 indicates the hash of the dvc file that was created
  • We then have the outputs that indicate if a given output should be cached, the hash of the file concerned here data1.xml indicated by the path attribute and the working dir(wdir) for the relative path.

From these informations, dvc has the ability to know if the file in the cache is the same as the one in the workspace. But also to manage the versioning of this file, because if we change it and add it with dvc add, the corresponding dvc file will be modified and then, there will be two versions in the cache.

One can recover the old dvc file through Git, and dvc will have the necessary information to retrieve the correct version of the file containing the data.

What is a link and how DVC use it?

A link = A “shortcut”.

Hummm, you’re not helping here. 

Okay, for operating system such as Linux, a link allows you to redirect a file to another one, this link can be done in different ways.

Dvc uses 4 types of links depending on the user’s operating system in this order of preference, unless the user explicitly define a desired link :

  • A reflink: it’s a new type of link supported by the latest operating systems, such as the more recent version of osx, which manages links in several blocks instead of only one per file, this avoids duplicating all the content of a file when it is modified and one still want to keep both versions, so only the block concerned is duplicated.

For example, let’s take a file that occupies 10 blocks, now let’s create a reflink and modify the file, suppose that only block 5 has been modified, we will now have 11 blocks in total. Therefore only one block has been recreated and the two files still contain the same other nine blocks (efficiency they say)

  • A hard link: In this type of link, a new file is created pointing to the same inode as the original file, so only one copy is kept, thus the modification of one affects the other, that means data can be lost accidentally.
  • A symbolic link: almost identical to the hard link, except that the new file instead of pointing to the same inode will point to the original file, and therefore the link here is stronger between the two, because if you delete the original file, the link will no longer be valid and an error will be raised. 
  • The old Ctrl+C/Ctrl+V: This is the last resort of dvc, here as you may suspect, the file is simply copy/pasted to  to the cache folder. This results in slower data recovery from the cache to the workspace but also occupies a lot more memory.

It is therefore not surprising that dvc prefers using this order to manage the links between the files in the cache and in our workspace.

To bounce back again on the similarities of dvc with git (yes, when there’s no more there’s still some), we’ll talk about the other commands that allow us to version the data, we’ll also explore how DVC communicate with remote servers (e.g S3) : 

  • dvc commit ~ git commit
  • dvc push ~ git push
  • dvc pull ~ git pull
  • dvc checkout ~ git checkout

Witness the perfect symmetry with Git!

Let’s take a look under the hood to see what these commands are trying to hide under this Git cover.

dvc push and dvc pull work almost like Git except for the remote servers with which they interact, DVC communicates with file storage servers (surprise of the century), often in the cloud such as S3, Google Cloud Storage but also HDFS or a dedicated server.

The configurations for this communication are made inside the .dvc/config file. This file can be modified by hand or with the command dvc config[option to be modified]

Example of the content of the config file :

[‘remote “myremote”‘]

url = /tmp/dvc-storage

[core]

remote = myremote

We can see that we indicate to dvc the name of the remote repository to use to replicate the data in the cache, the specified path is just a local folder, we could definitely indicate a link in the cloud such that s3:///…..

The dvc checkout(illustrated in Figure 3) command attacks the cache folder to recover the files specified in the .dvc files, which are created by the commands that give control of a file to dvc. 

How dvc link the code to its data ?
Git Checkout – dvc checkout

Fig.3 : brotherhood between code and data (source : dvc)

An important notion is to be explained here, .dvc files are always versioned by Git, this is what allows DVC’s pull and checkout commands to travel through these files to place the right versions of the files (data and models) in the current workspace.

To learn more about these commands, I encourage you to read the dvc documentation (it’s the best documentation I’ve ever read).

2-Pipeline management

In any data science project, there are several steps to execute, very often sequential and repetitive, such as data acquisition, feature creation until the model is obtained.  Figure 4 shows a schematic view of this pipeline.

How to monitoor my pipeline's project in Data Science ?
Fig.4 : reproducibility (source : dvc)

Having a trace of these steps is essential for the reproducibility of experiments, but also for the sharing of experiments between team members, for example each team member can focus on one step using the outputs of the other steps and dvc will automatically manage the dependencies and the sequencing of the steps. Of course as always in the field, magic is just the tip of the iceberg, to accomplish this trick, dvc relies on its run command and therefore this assumes that the execution of each step is executed with the dvc run command mentioned above.

So dvc run, how do you do it?

first let’s analyze the syntax of the command :

dvc run -f Dvcfile  \

-d train.py -d data -d matrix.pkl \

-o model.pkl  \

python train.py matrix.pkl model.pkl

Each time the dvc run command is executed, it receives not only the command (here python train.py) to be executed, but also the dependencies with the -d option and the outputs with the -o option, this allows dvc to take control over the files (dependencies and outputs), but also to save the dependencies and outputs in the resulting.dvc file so that it can reproduce the step or steps with just the dvc repro command (Enjoy the internal beauty of these commands).

dvc run allows the dynamic creation of pipelines, because for each execution you can give the outputs of a previous step (executed with dvc run) as dependencies to another step and so on. dvc automatically tracks these dependencies between steps, which allows it to implicitly create pipelines 🙂

the f option indicates the name of the file that will be created to save the information about the step being executed, if this option is not provided, dvc automatically creates a file with the name of the first output specified by adding the extension .dvc.

P.S: the dependencies and outputs provided to the dvc run command are not directly related to the command given as input to dvc run. these parameters are mainly there to allow dvc to take control of the files used or created by the executed command (script). In other words, the script execution does not take these parameters into account, only dvc is concerned.

N.B: It is recommended to give the outputs and inputs as arguments of the command to be executed, this ensures that there is no error in the dependencies and outputs indicated in the dvc run command, because otherwise, you may not take the control of some files due to misspelling.

Content of the Dvcfile :

cmd: python train.py matrix.pkl model.pkl

deps:

– md5: abe54bd55fd3643ad05705c2a852bec4

  path: train.py

– md5: 2d06b4e4890dd35fcac069d3d91e977f

  path: matrix.pkl

md5: ce99c88bc6102b52d39347ac8292709b

outs:

– cache: true

  md5: 443c2129e5398d14c4c81ca7472cdb41

  metric: false

  path: model.pkl

wdir: .

We can see that dvc saves the command used to execute the step in question, but also the dependencies and outputs with their md5 hash to be able to track changes on these files.

 

dvc repro Dvcfile

To reproduce this step just type the command above and dvc does the rest, taking into account steps well before it, such as the step used to create the features in matrix.pkl.

The icing on the cake is that the dvc repro command analyzes the changes made to the dependencies and if there has been no change, no calculation is performed.  When the.dvc file depends on the output of another step, the repro command rewinds to recreate the complete pipeline and executes only the steps that have seen their dependencies modified since the last execution (computation only if needed)

Displaying a pipeline with dvc pipeline show

dvc pipeline show

 

3-Model management

How to handle differents versions of a model in Data Science ?
Fig.5 : saving and extracting models (source : dvc)

The management of models(Figure 5) is identical to that of data, a .dvc file(managed by Git) is associated with a given model to manage the different versions and the model in question is managed by dvc (saved in the cache). 

For good model management and dependencies between source code and data, it is recommended (“mandatory”) to create a new branch for each experiment to facilitate reproducibility through the different experiments.

Unfortunately, DVC does not manage the deployment aspect, neither à UI to compare the different models built, at least not at the moment, so, let’s make way for MLflow to handle this part.

Let’s give some place for MLFlow…for the next article.

Leave a Reply

Your email address will not be published. Required fields are marked *