Sprint 3 — The Docs
This is a continuation of a series that started here. The article is based on my own experience as an ML PM, yours may be different.
During sprint 2, I wrote about the product team from from the perspective of a software company that uses ML to implement specific features. I hope that I managed to convey the following:
The development team writes code to create a maintainable, optimized software product using a process to ensure an acceptable level of quality. These objectives are different from yours.
What does this mean? Well, most likely (in my experience) your model design code will not be in the product. Not that there is anything wrong with python scripts and Jupyter notebooks. But, when you are experimenting, your coding goals are different from the Devs’ goals. If you are handing over your work to be operationalized by another team, you’d better document it before it leaves the safety and security of your IDE.
If you are new to product, likely because you are reading this, then you may be accustomed to sharing your scripts with data scientists or sharing the model inferences via a nice report or database. In these scenarios, commenting your scripts was enough to share with the expert user. However often in product, models are are handed over is to the Dev team as part of a feature. Most Dev Teams know code but are not experts in your field. So just handing them your code only is a kin to saying, “ I am too lazy to explain what I have done… you go figure it out”. Seriously.
Case in point. As a PM working with a DS team, I have asked people to document their work with the same template that I use myself to create specifications. What I wanted was a description of what needed to be built. What I got was an explanation of how to run their code and a view into all the work arounds required due to their lack of tools, knowledge of data engineering and software best practices. In all fairness, my job is to bridge the knowledge gap between DS and Dev, so I expect to assist. My mistake was not appreciating the different points of view when requesting the documentation.
With these gaps between DS and Dev in mind, here are some tips for documenting your work as a data scientist within a product team.
Write docs for yourself.
- Identify the what you tried while working on your solution. Even failures are sources of information.
- Describe why you approached the problem the way you did, what methods were tried, what data was used. It will help you when assessing the next course of action.
- Keep your code neat, variables self explanatory, remove dead code. You may have to put your project away and not come back to it again for several months. Will you know what you were thinking when you wrote it?
- Get into the habit of documenting as you go. Writing for 30 minutes per week is easy when your memory is fresh, doing it all months later is not.
Write docs for your DS peers.
- Share your work. Tips and tricks as well as the latest algorithms. Being transparent within the team helps everyone do better.
- Consider agreeing on how to share, such as a common document format and code repos.
- Make functions out of code that will be useful again and test it. Even just once.
Write docs for the dev team.
- Identify the steps that are executed to get to the solution. Your implementation (the ‘how’) is often only useful for the lines of code that executed training, validation and testing the model. The rest of your experimentation is more about what is calculated, retrieved or manipulated and will be done differently by data engineers.
- Clearly state the purpose of the model, the input data, the features derived from the data, the model training algorithm and parameters, packages used, output format, validation methods, testing methods and baseline performance. Dev will often need to create a training pipeline, so be explicit. This is information that you will have in your own notes anyway.
- Identify when the model will fail. The downstream consumer must get an output even when the model can’t process the case, so consider what to replace the inference with.
Write docs for the QA.
- Store and version the data used for training, validation and testing.
- Provide population statistics of the inputs when the model was trained so that QA can use it in designing test cases and PM can use it to design the monitoring system.
- Describe the model features in terms of their range (when they exist) and give example input data with the expected corresponding features.
- Give examples of when the model will fail and what the output will look like when that happens.
At the end of this process you will generate documents that are useful within the DS team and also to the rest of the product team. As an ML PM, I put together the final specification for Dev, QA and the other PMs in the team. The items listed in these tips are the items that I most often go back to the data scientist for.
If you work for a company where the DS team is the most resource constrained, you will have another project waiting for you. As an ML product manager, I am your partner in the integration of the model into the product. Good documentation makes the hand-off easy.