
TOWARDSDATASCIENCE.COM
Reducing Time to Value for Data Science Projects: Part 1
Introduction
The Experimentation and development phase of a data science project is where data scientists are meant to shine. Trying out different data treatments, feature combinations, model choices etc. all factor into arriving at a final setup that will form the proposed solution to your business needs. The technical capability required to carry out these experiments and critically evaluate them are what data scientists were trained for. The business relies on data scientists to deliver solutions ready to be productionised as quickly as possible; the time taken for this is known as time to value.
Despite all this I have found from personal experience that the experimentation phase can become a large time sink, and can threaten to completely derail a project before its barely begun. The over-reliance on Jupyter Notebooks, experiment parallelization by manual effort, and poor implementation of software best practises: these are just a few reasons why experimentation and the iteration of ideas end up taking significantly longer than they should, hampering the time taken to begin delivering value to a business.
This article begins a series where I want to introduce some principles that have helped me to be more structured and focussed in my approach to running experiments. The result of this have allowed me to streamline my ability to execute large-scale parallel experimentation, freeing up my time to focus on other areas such as liaising with stakeholders, working with data engineering to source new data feeds or working on the next steps for productionisation. This has allowed me to reduce the time to value of my projects, ensuring I deliver to the business as quickly as possible.
We Need To Talk About Notebooks
Jupyter Notebooks, love them or hate them, are firmly entrenched in the mindset of every data scientist. Their ability to interactively run code, create visualisations and intersperse code with Markdown make them an invaluable resource. When moving onto a new project or faced with a new dataset, the first steps are almost always to spin up a notebook, load in the data and start exploring.
Using a notebook in a clean and clear manner. Image created by author.
While bringing great value, I see notebooks misused and mistreated, forced to perform actions they are not suited to doing. Out of sync codeblock executions, functions defined within blocks and credentials / API keys hardcoded as variables are just some of the bad behaviours that using a notebook can amplify.
Example of bad notebook habits. Image created by author.
In particular, leaving functions defined within notebooks come with a host of problems. They cannot be tested easily to ensure correctness and that best practises have been applied. They also can only be used within the notebook itself and so there is a lack of cross-functionality. Breaking free of this coding silo is critical in running experiments efficiently at scale.
Local vs Global Functionality
Some data scientists are aware of these bad habits and instead employ better practises surrounding developing code, namely:
Develop within a notebook
Extract out functionality into a source directory
Import function for use within the notebook
This approach is a significant improvement compared to leaving them defined within a notebook, but there is still something lacking. Throughout your career you will work across multiple projects and write lots of code. You may want to re-use code you have written in a previous project; I find this is quite common place as there tends to be a lot of overlap between work.
The approach I see in sharing code functionality ends up being the scenario where it is copy+pasted wholesale from one repository to another. This creates a headache from a maintainability perspective, if issues are found in one copy of these functions then there is a significant effort required to find all other existing copies and ensure fixes are applied. This poses a secondary problem when your function is too specific for the job at hand, and so the copy+paste also requires small modifications to change its utility. This leads to multiple functions that share 90% identical code with only slight tweaks.
Similar functions bloat your script for little gain. Image created by author.
This philosophy of creating code in the moment of requirement and then abstracting out into a local directory also creates a longevity problem. It becomes increasingly common for scripts to become bloated with functionality with little to no cohesion or relation to each other.
Storing all functionality into a single script is not sustainable. Image created by author.
Taking time to think about how and where code should be stored can lead to future success. Looking beyond your current project, start considering about what can be done with your code now to make it future-proof. To this end I suggest creating an external repository to host any code you develop with the aim of having deployable building blocks that can be chained together to efficiently answer business needs.
Focus On Building Components, Not Just Functionality
What do I mean by having building blocks? Consider for example the task of carrying out various data preparation techniques before feeding it into a model. You need to consider aspects like dealing with missing data, numerical scaling, categorical encoding, class balancing (if looking at classification) etc. If we focus in on dealing with missing data, we have multiple methods available for this:
Remove records with missing data
Remove features with missing data (possibly above a certain threshold)
Simple imputation methods (e.g. zero, mean)
Advanced imputation methods (e.g. MICE)
If you are running experiments and want to try out all these methods, how do you go about it? Manually editing codeblocks between experiments to switch out implementations is straightforward but becomes a management nightmare. How do you remember which code setup you had for each experiment if you are constantly overwriting? A better approach is to write conditional statements to easily switch between them. Having this defined within the notebook still bring issues around re-usability. The implementation I recommend is to abstract all this functionality into a wrapper function with an argument that lets you choose which treatment you want to carry out. In this scenario no code needs to be changed between experiments and your function is general and can applied elsewhere.
Three methods of switching between different data treatments. Image created by author.
This process of abstracting implementation details will help to streamline your data science workflow. Instead of rebuilding similar functionality or copy+pasting pre-existing code, having a code repository with generalised components allows it to be re-used trivially. This can be done for lots of different steps in your data transform process and then chained together to form a single cohesive functionality:
Different data transformations can be added to create a cohesive pipeline. Image created by author.
This can be extended for not just different data transformations, but for each step in the model creation process. The change in mindset from building functions to accomplish the task at hand vs designing a re-usable multi-purpose code asset is not an easy one. It requires more initial planning about implementation details and expected user interaction. It is not as immediately useful as having code accessible to you within your project. The benefit is that in this scenario you only need to write up the functionality once and then it is available across any project you may work on.
Design Considerations
When structuring this external code repository for use there are many design decisions to think about. The final configuration will reflect your needs and requirements, but some considerations are:
Where will different components be stored in your repository?
How will functionality be stored within these components?
How will functionality be executed?
How will different functionality be configured when using the components?
This checklist is not meant to be exhaustive but serves as a starter for your journey in designing your repository.
One setup that has worked for me is the following:
Have a separate directory per component. Image created by author.
Have a class that contains all the functionality a component needs. Image created by author.
Have a single execution method that carries out the steps. Image created by author.
Note that choosing which functionality you want your class to carry out is controlled by a configuration file. This will be explored in a later article.
Accessing the methods from this repository is straightforward, you can:
Clone the contents, either to a separate repository or as a sub-repository of your project
Turn this centralised repository into an installable package
Easily import and call execution methods. Image created by author.
A Centralised, Neutral Repository Allows More Powerful Tools To Be Built Collaboratively
Having a toolbox of common data science steps sounds like a good idea, but why the need for the separate repository? This has been partially answered above, where the idea of decoupling implementation details from business application encourages us to write more flexible code that can be redeployed in a variety of different scenarios.
Where I see a real strength in this approach is when you don’t just consider yourself, but your teammates and colleagues within your organisation. Imagine the volume of code generated by all the data scientists at your company. How much of this do you think would be truly unique to their project? Certainly some of it of course, but not all of it. The volume of re-implemented code would go unnoticed, but it would quickly add up and become a silent drain on resources.Now consider the alternative where a central location of common data scientist tools are located. Having functionality that covers steps like data quality, feature selection, hyperparameter tuning etc. immediately available to be used off the shelf will greatly speed up the rate at which experimentation can begin.
Using the same code opens up the opportunity to create more reliable and general purpose tools. More users increase the probability of any issues or bugs being detected and code being deployed across multiple projects will enforce it to be more generalised. A single repository only requires one suite of tests to be created, and care can be taken to ensure they are comprehensive with sufficient coverage.
As a user of such a tool, there may be cases where the functionality you require is not present in the codebase. Or alternatively you have a particular technique you like to use that is not implemented. While you could choose to not use this centralised code repository, why not contribute to it? Working together as a team or even as a whole company to actively contribute and build up a centralised repository opens up a whole host of possibilities. Leveraging the strength of each data scientist as they contribute the techniques they routinely use, we have an internal open-source scenario that fosters collaboration among colleagues with the end goal of speeding up the data science experimentation process.
Conclusion
This article has kicked off a series where I address common data science mistakes I have seen that greatly inhibit the project experimentation process. The consequence of this is that the time taken to deliver value is greatly increased, or in extreme cases no value is delivered as the project fails. Here I focussed on ways of writing and storing code that is modular and decoupled from a particular project. These components can be re-used across multiple projects allowing solutions to be developed faster and with greater confidence in the results. Developing such a code repository can be open sourced to all members of an organisation, allowing powerful, flexible and robust tools to be built.
The post Reducing Time to Value for Data Science Projects: Part 1 appeared first on Towards Data Science.
0 التعليقات
0 المشاركات
35 مشاهدة