Practical Tips on How to Introduce Agile Principles into Machine Learning (ML) Pipelines

Practical Tips on How to Introduce Agile Principles into Machine Learning (ML) Pipelines
Source: Bigstock

Unless you have been living under a rock, you certainly must have heard of ML. If you are a software developer, you know for a fact that machine learning is a hot trend right now.

As developers, it is important to use the tried-and-tested agile methodology in machine learning, so that as new data and research methods come to the fore, we can react accordingly and build highly reliable models.

In this article, we discuss 3 ways of introducing agile principles into the ML pipeline. We also discuss ways to reduce risk and help ML engineers work seamlessly with researchers.

Before building an ML solution, you want to make sure that you are making the right step. Let’s start with that.

Before Building an ML-Based Product…

The following principles help reduce risk when building a ML-based product. Remember that ML is not the product you sell. For example. users are interested in buying self-driving cars, but not the ML algorithm powering it.

Don’t Build ML for the Sake of It

There is a lot of hype around ML both from tech companies and the media. Let’s see a few of these.

According to Manish Singhal, “products that don’t use AI or ML will die a natural death.” As a business owner, if you come across Singhal’s statement, you might find yourself thinking of introducing ML to your business as soon as possible.

An article published by Accenture states that AI “will be a key point of distinction for your business versus competitors, and so must be added to business leaders’ strategic agenda.”

And that’s not all…

According to a 2016 article by Narrative Science, 62% of companies expected to be using ML in their product by 2018. Whichever way you look at it, 62% is a massive number showing commercial interest in ML.

Additionally, Python has risen to the most popular programming language according to the 2018 programming language spectrum rankings. In part this is  because of its ML libraries and capabilities.

With such hype, it’s advisable to first think things through and determine that you need an ML-based solution. This way, you’ll avoid getting into a bubble where companies build ML solutions for the sake of it. Remember the dot.com bubble of 2000 and how it burst?

Do You Need ML in Your MVP to Test Product-Market Fit?

You don’t want to spend a fortune hiring machine learning engineers and setting up infrastructure if your product won’t sell. If you already have a way to collect/synthesize data, wait till your product shows signs of market success before investing in ML engineers and infrastructure.

Is Your ML Solution Mission Critical?

Self-driving cars are mission critical, and requires the dependability of the algorithms before you can even approach customers. Mission critical applications could cost people’s lives and/or fortunes if things go wrong.

On the other hand, a tag name recommendation service for photos like Instagram is not mission critical. The user can ignore the suggested tags or use some or all of the suggestions. Whether or not they use the recommended tags does not put their/other people’s lives at risk.

That brings us to the next important point…

Use Exceptional Code Quality Standards in ML Solutions

ML is simply a branch of software engineering. You have to apply the same quality standards normally used when developing other software products. These are things like clean code, testing and modularity.

You may have a team of researchers who lack an engineering mindset. Even if you have engineers doing the research, the fact remains that your team will be doing a lot of open-ended research.

Now think about it…

With research, an ML engineer may wake up at night with an idea. Rather than start with the coding best practices like writing tests, they simply start to hack something out to see if it works.

The thing is that you need to approach all your research and products with an engineering mindset from the start. In an ML pipeline, only a very small percentage of the code is for prediction and training.

There will be many other activities in the periphery such as data collection, data clean up, testing and monitoring.

Components that are used over and over again in experiments should be tested so that when there are several running experiments, time is saved by not having to build everything from scratch. This gives certainty that every  project meets the minimum benchmarks.

Avoid Machine Learning Anti-Patterns

Eliminate Dead Experimental Code

Knight Capital lost $450 million because they forgot to delete some experimental code. Traders sniffed this and began to trade, and the company lost all its fortune in 45 minutes!

If you want to avoid a fate similar or worse, then get rid of dead code from your ML codebase.

Configuration Debt

A lot of time goes to writing configuration for machine learning systems. It’s important to ensure that configuration code is manageable, testable and easily changeable.

Glue Code and Pipeline Jungles

This concerns how you build your ML pipeline. There is the likelihood that the pipeline will become overly complex. If your system is characterized by pipeline jungles and glue code, then it indicates that you have integration problems.

It is also a sign that your research is overly separated from your engineering roles. To deal with this, make sure that your researchers approach their roles with an engineering mindset. This brings us to our last, but very important section:

Adopt Agile Thinking In ML

Have Research Sprints, Just Like You Have Coding Sprints

Have 2-week development sprints tied together with research sprints. At the beginning of the sprint, the researchers should have a hypothesis that they want to prove.

They should spend the sprint time trying to prove or disprove their hypotheses. The goal is to find out whether the hypothesis is worth pursuing or it should be discarded.

Now:

One advantage of tying the research and development in a sprint is that you get closer and better communication between the researchers and developers.

Also, at the end of the sprint, the teams gets to present some cool demos, even if things didn’t work out. It also helps the team define checkpoints so that they kill off unproductive research as early as possible.

Build > Measure > Learn

In his New York Times bestseller The Lean Start, Eric Ries gives the formula for massive growth of lean startups: Build, Measure, Learn. Now, Tensorflow by Google is a very popular tool in the ML ecosystem.

However, should you use Tensorflow in your ML project? Probably not. Based on the project, some ML solutions companies may be perfectly comfortable using simple rule-based solutions, like Regex expressions.

Other companies may use traditional off-the-shelf libraries. For example, Stanford has a very good natural processing (NLP) library that’s written in Java. Apache also has a NLP library called OpenNLP. With some configuration and training, these can perform really well in some tasks.

If the above two categories of machine learning tools don’t suffice, you can turn to deep learning and sophisticated ML pipelines like TensorFlow. Whether or not you use a certain tool or approach should depend on what you measure and learn.

Conclusion

In this article, we have introduced three ways of introducing agile principles in your ML pipelines: Introduce agile thinking in ML, avoid ML anti-patterns, use exceptional code quality standards.

What do you think about it?