N.B. This article will be divided into three parts for ease of reading. The first part will introduce and present the problem, the second will cover Git and DVC and the last part will present MLflow and the conclusion.
As a data scientist, we are constantly working on data science projects. hummm!!! not all the time. Our work therefore consists in transforming the data into solid gold, or even diamonds when luck favours us. Except, before we reach the stage where we receive awards for our hard work, we have to search and even go round in circles, dig into wild, dark, slippery and vast mines. Without the right tools, surviving in this world is an impossible task(Tom Cruise, i see you).
There are so many holes to dig, so many sand to sift, so many corners to explore and especially to remember so that we don’t have to reinvent the wheel, but also learn from our past mistakes. All of that leads us to look for tips and tricks for a better organization in our excavations, this time, taking into account not only the code but also the data and models. After all, we are data scientists, not “miners”.
Let’s see first what we(data scientists) manipulate all day long(Figure 1):
- The data: the cornerstone, the raw material, the very essence of our existence (ok, go easy on it);
- The code: the car that carries the data, transforms it along the way and take it to the promised land;
- The models: the long-awaited fruits of the promised land, which laymen consider as black magic or just as a rag, after all, it depends on the quality of the model, and therefore on us.
- Configuration: The subtle touches that help the code reach the promised land.
Fig.1 : The “four” components of data science(source : DVC)
The million dollar question then arises :
How to orchestrate these four components to have a symphony worthy of Beethoven, even better to travel through time through this symphony?
To answer this question, it might be more appropriate to take a short trip back in time to meet a man named Linus Torvalds. Well, maybe not so far away for historians, but far away for us computer scientists. For those who don’t know Linus, it’s Linux but in human form(if you don’t know linux …). He is also the creator of Git, a source code management tool widely used by our software engineering friends and us too, which allows us to travel through the history of our source code file.
Enough with history, now we’re going to go to a parallel world(not that parallel, after all, the code binds us) to try to learn from their successes but also from their failures- yes, there are always bad experiences going on-.
Software engineers need to version their work, which consists “only” a source code, so they only need to version their code, which makes things a little easier for them. As a result, they can use Git to have a symphony that works perfectly.
Guess who doesn’t have that luck -versioning a work-… Come on, try harder… Haaaa ok, US, the data scientists. Yes, we are the guardians of the three temples (i.e. data, code and models), and as the proverb says, “With great power comes great responsibility”.
Let’s take the example of an IT company (Baamtu) with a team of data scientists, in this case Datamation (quite cool as a name) which works on data science projects (what a scoop!); the members of this team are necessarily working on the three main components of any data science project. They therefore manipulate code, data and models which result from the combination of the first two components.
A new project for WOLOF speech recognition has just been started, and they had the good idea to use Git to version the project. Yes, it’s time to have fun and make our fellow citizens proud of their local language.After several iterations. The project has well progressed, our model is starting to be able to transcribe the audios, but after the last iteration, we realized that the previous model was better than the current one, So we obviously wanted to go back to the previous one.
Fortunately, Git was there to save the day…or not. Indeed, while using Git in our project, we chose to ignore the models because they were too large to be sent to the remote repository (Git is not very well suited for large files).
Total panic within the team. Calm down guys, we still have the code that was used to produce the old model, a git checkout to this branch and the job is done (YOUPI?).
After a tremendous amount of time waiting for the training to end, the model obtained did not have the expected performance. Actually, the data had also been modified along the way by transformation scripts. Guess what, we didn’t version either? The data from the intermediate transformations of course ; for the same reason that the models weren’t versioned.
The pressure is increasing (sweats all over the room). The deadline is approaching and we’re definitely not going to deploy the new model which was not as good as the old one.
“Since no one is expected to accomplish the impossible” (even the magicians of the data), we recovered the original data set without any transformation, went through all the steps of model’s creation (this assumes, of course, that someone in the team can remember exactly all the steps that we’ve been through but also their order), from cleaning the data set to building the final model……
I let you imagine the computation time it took to redo all those processing steps and also : bye bye the respect of the deadline and our credibility within the company.
All that because of the Git versioning(Only). No link has been made between the data, the code and the models produced.
After such an experience, needless to say that reliving it is not an option. So it’s time to take a break and see the different solutions that exist to solve this versioning problem in the world of data science.
We can also mention other versioning problems with data science such as :
- data sharing between data scientists
- model deployment
- Do not recreate steps in the creation of a model if the code and data have not changed for a given step.
- Static and dynamic pipeline creation
- Easy metrics comparison from different iterations
Several tools exist in order to try and solve the problems mentioned above and make our lives easier, some are open source and others are proprietary. Each one brings solutions to different problems (often the same).
In the following, we will present two open source solutions (DVC and MLflow), accompanied by Git, which, combined, solve most of the versioning problems that we data scientists encounter on a daily basis.
We will use Git to manage versions of our code and configuration files, DVC to manage data, pipeline and models, MLflow will be used for deployment and comparison of metrics from different experiments.