Software Engineering Practices Applied to Machine Learning


Photo by Ricardo Gomez Angel on Unsplash

Todays Tuesday thoughts come from this great article on Software Engineering Practices Applied to Machine Learning by Pavlos Mitsoulis Ntompos and Stefano Bonetti. This is a great in-depth learn, happy reading!


Are you planning to productionise a new machine learning powered system? Or maybe you’re looking to step up your system by scrapping that naïve heuristic logic with a shiny new machine learning model?

While you are allowed to start fantasising about the weirdest, most powerful ML models that will pave your way to victory, you cannot ignore the fact that these models will still be part of a software system. No matter how smart your code is, it will still need to be deployed and maintained.

Design patterns, automated testing, production monitoring and alerting are only a few of the common traits of high quality, enterprise-grade software systems. Too often in machine learning these traits are neglected, leading to considerable issues concerning maintainability and extensibility. This post is going to touch on three aspects of rolling out an enterprise-grade system, and how they can be applied to ML-powered ones.

To make this more concrete, let’s set up an example.

At HomeAway, we onboard partners (i.e. property owners) who want to list their properties on our website. For marketing purposes, we want to predict our partner’s LTV. Ever heard of LTV? Nope, it’s not a TV channel. This is how Wikipedia defines it:

In marketing, […] life-time value (LTV) is a prediction of the net profit attributed to the entire future relationship with a customer. The prediction model can have varying levels of sophistication and accuracy, ranging from a crude heuristic to the use of complex predictive analytics techniques.

In our case, LTV is defined practically as “how much money a partner will make us in its first year”. Because of our business’ seasonality, this is a good indicator of a total LTV.

Accurately predicting LTV is important for our digital marketing strategy, to understand how well our marketing campaigns are performing. We cannot wait for a year or so to get a reading on how much money we made out of a newly listed property, so we need to get our best prediction and use that instead.

So we want to build a service that predicts our LTV.

Design and development

At HomeAway we believe in evolutionary architecture. This means that we like to start with targeting the simplest possible solution, implement it end-to-end and deploy it to production.

BEWARE: this means making sacrifices in terms of the quantity of the solution, not in terms of its quality.

Our first cut will be fully tested and monitored. We just want to strip down the requirements to the bare minimum to be able to go to production as soon as possible. This will enable us to get valuable feedback from our clients first to drive the following iterations.

If we consider our LTV predictor as an example, the question we need to ask ourselves is: what can we do to deliver an LTV as soon as possible? One possibility is to serve static values first. Easy. For example, we could start with a rule-based solution that takes the new listing, checks out its location, and looks up its LTV from a static table.

So our requirements have shrunk considerably, and they involve no ML.

Rule-based system

There are two advantages in architecting the system this way:

  1. we are forced to think about a contract for our service now, even if we are in the very early stages of our product;
  2. we have separated the web service layer from the prediction layer, to allow them to evolve independently.

Once we have deployed our first cut to production, we might start thinking about making it smarter. This is obviously where machine learning comes into the picture.

ML-powered system

It’s important to note that both prediction and training services need features to operate. However, they need them in quite different ways. Prediction needs current features in real time, while training needs historical features in batch mode.

One solution to this problem is to get features from different sources depending on the need; however, we found this has potential risks. Namely, data inconsistency. In large organisations data representation can vary quite wildly. And HomeAway is no exception. Whether it is countries represented with their full names or their ISO code, or different flavours of property types, we faced this issue many times.

Data inconsistency in features

This can become a tricky problem to manage, as you need an extra layer of code that converts and encodes all these features to a unified format. The risk here is that every single team (and therefore every single application) will find its own way of representing data, introducing a massive effort duplication.

So an ideal solution would be to encapsulate this functionality in a feature service, which is able to provide a unified view of features for both training and prediction phases. This would make sure we have the feature engineering code in one place, tested, and well modularised — so if any other system needs features, it can get them from here.

However, especially when the data gets big, it can become an issue to serve the same data in realtime and in batch mode efficiently. Event-driven feature engineering can help to overcome this challenge. It consists in sourcing domain events from the core of your platform, processing them in a universal feature engineering module, and then storing them in a variety of forms to fit all needs. A Cassandra-like fast cache would be useful at prediction time, while a distributed file system like S3 would be useful at training time.

Event-driven feature engineering pipeline


Continuous integration and continuous delivery are widely used practices in software engineering. A variety of tools and platforms are available to provide CI/CD capabilities with different flavours. However all CI/CD pipelines share the same, broad, steps:

  • all code changes are committed to some flavour of version control system, such as Git;
  • a CI/CD tool detects the new change, runs automated tests and builds all relevant artifacts;
  • if tests pass, the artifacts are deployed to all relevant environments and finally, pending verification, to production.

CI/CD bestows great advantages to software systems, with the ultimate objective of boosting the confidence of delivering new features without incurring in bugs and regressions, therefore increasing the delivery pace. And obviously testing is one of the cornerstones of a healthy CI/CD pipeline: the fewer tests, the lower confidence in released features.

In ML-powered systems particular care has to be paid when setting up a CI/CD pipeline. Testing feature engineering code is more akin to classic unit testing — and therefore more straightforward to achieve. On the other hand, testing ML models might prove trickier. However, a machine learning model is still software, and like all software, it needs to be tested.

© 2018 Disney

By testing ML models, we are asking the following question: “is my new model better than the baseline?”. Our simple approach to answer this question goes as follows:

  1. select a hold-out dataset to be used for model tests. This dataset will have to be reasonably small to avoid slowing down the CI/CD pipeline too much;
  2. calculate accuracy for both the new and the baseline model by making predictions against the hold-out dataset;
  3. assert on the new accuracy to be higher than the baseline.
At HomeAway we are currently using Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) as accuracy metrics in our LTV predictor’s model tests.
def test_ml_models(hold_out_dataset):
 # Mean Absolute Error and Root Mean Square Error For Baseline Model
 baseline_mae, baseline_rmse = _error_for_baseline(hold_out_dataset)
 # Mean Absolute Error and Root Mean Square Error For New Model
 new_model_mae, new_model_rmse = _error_for_new(hold_out_dataset)
 assert new_model_mae < baseline_mae
 assert new_model_rmse < baseline_rmse


Once your system is in production, you have to keep it monitored. This is because there might still be issues that haven’t been caught by your tests, there could be memory leaks, network outages, breaking changes from upstream and/or downstream. We want to be the first to know if any of these things happen.

An effective monitoring goes hand-in-hand with alerting. Anything out of the ordinary should result in an alert to the relevant team. Alerts should be actionable, meaning it should be clear what to do when an alert is received.

By definition, the results produced by a machine learning powered system are going to be less deterministic in comparison with a classic software system. For this reason, it can be harder to meaningfully monitor predictions.

One approach, simple but effective, is to track the max, mean and min predicted values that we observe in production. Going back to our LTV predictor example, this would mean tracking the predicted LTV values. Alerts can be set up for when the LTV is negative or exceeds a specific LTV max threshold.

Monitoring machine learning predictions

One scenario that could lead to a predicted LTV value higher than the max threshold is a change in feature distribution. For example, let’s assume the HomeAway website was changed to allow properties to be on-boarded with a higher number of bedrooms. Such change would propagate through our Feature Engineering logic and then to our model. Monitoring the distribution of the features can help detect these misalignments as soon as possible, potentially avoiding inaccurate predictions.

Monitoring features distribution


  1. build your solution incrementally, start simple with a rule-based approach and go to production with it. This will give you a first baseline to compare against while evolving to machine learning. If you don’t, you might spend a lot of time designing a model without knowing if it’s production ready;
  2. in an ML-powered system, be sure you test how your model compares with your previous iteration, and make this test part of your automated pipeline. If you don’t, you will have zero confidence in what you’re deploying;
  3. set up monitoring and alerting. These are crucial to catch issues as soon as possible. If you don’t, you’ll never know if your system is really working or not.

For more insights on productionising ML-powered systems, make sure you check out this awesome paper from Google.

As a final side note, we really think it’s important for data scientists to gain a production mindset, because ML models lose a lot of their potential if they are not production-ready. Conversely, software engineers should be exposed to machine learning. Especially getting visibility of the feature engineering is a very feasible first step that we encourage. There is much to be gained from having interdisciplinary teams of engineers and data scientists, to reap what’s best from both worlds."

This article was written by Pavlos Mitsoulis Ntompos and Stefano Bonetti and originally posted on Medium