Building Machine Learning Powered Applications (O'Reilly) Book Review
Most every book about machine learning is either for business people who know nothing about machine learning, or for developers who want to augment their machine learning skills Building Machine Learning Powered Applications (buy on Amazon or read on O’Reilly Safari) from Emmanuel Ameisen is not in either of these categories. This is a book about overseeing the building of a machine learning product. It’s ideal for product managers, as well as software developers who know how to train a model but want to understand what comes next.
I feel the need to reiterate: this is not a book where you’ll be coding. There is code there, and early on in the book it appears that you’ll be building a project in the code. You will not, so don’t try to follow along with the included code. I lost time doing this.
Which is too bad, because it’s a great book that I wish I had finished sooner. Ameisen walks you through the creation of an ML-driven product, starting with creating a plan, then building a working pipeline, iterating on your models, and deploying and monitoring the application. Through this, he identifies four steps: identifying the right approach, building an initial prototype without ML, iterating on the models, and then deployment. If it sounds like building any other product, it is.
Identifying the right approach
Alright, let’s get started with our machine learning powered application. Or, well, maybe not. Ameisen quickly points out that you should be sure that ML will even be a part of your product. Sometimes, ML can not do what you want it to do. ML may not be at the level yet that you think it is, and it would need to be for your product. More commonly, ML isn’t going to power your product because you need data upfront, and you just don’t have the data. If you want to build a product that automatically builds follow-up emails after a sales call, you are not going to find a data set of follow-up emails along with quality ratings from the recipients. Of course, you shouldn’t end your project simply because you don’t have access to precisely the data you want. Ameisen points out that gathering data is a big part of creating an ML-driven product, and we “rarely… find the exact [data we want].”
You will be able to find other data, which fit into four levels:
- Labeled data that is a fit for what we
- Weakly labeled data, or data that approximates what we want
- Unlabeled data
- No data
The more labeled and the more “inside” our target use case, the more useful the data is going to be.
Another reason why we might not use ML in a product is because ML is not necessary for the product. Ameisen says to never use machine learning when a set of rules will suffice. And this also touches on what I thought was the most useful thing I took from the entire book, via an interview with an ML professional: anything you are thinking of doing with ML, do it manually for an hour first. You will discover whether ML is necessary in doing so. If ML is necessary, you will have a much better idea for the type of data you have, what features you can select, and what kind of models you should explore.
The book also goes into the metrics you should be using to determine if the product is working. You need business, model, freshness, and speed metrics. The business goal is the most important one. If the business metrics aren’t looking good, no one cares how fast you can train the model. The problem, of course, is that you won’t have business metrics when doing the training, so you need model metrics that approximate the business ones.
Building and iterating
Ameisen recommends to always start off “lackluster by design.” Use a heuristic that you are certain will not be your final stop, but is a place to start. For example, to determine the number of trees in satellite imagery, how about counting the number of green pixels. The heuristic is meant to guide you to the correct approach, and it will help inform the datasets that you need. (Datasets because these will be iterative. Just like you start off with a lackluster prediction, you should start with a fairly lackluster data set before you spend time getting a large dataset that might not be what you really need.)
Again, here Ameisen recommends starting manually. Manually label the data, acting “as the model.” By doing this, you will start to pick up patterns. Are negative tweets generally short and positive tweets generally long? That would inform whether you have a feature representing tweet length.
Once you have the data, you’ll need more of it. Ameisen provides some recommendations, such as looking at the examples in your validation set where the model is the least confident in its predictions. Add similar examples to the training set. Or, train your model on a very small percentage (e.g. 1%) of your data. Then create another model that takes an equal number of labeled and unlabeled examples. The model should then predict whether the example is labeled or not. If it predicts the example is not labeled, this indicates an example that is different from your existing training set and will have the largest impact on future trainings. (Of course, even with these tips, you still need to randomly select for the validation set.)
Model training and evaluation
Choosing a model, even if your manual work and data exploration points you to a specific class of models, can be difficult. On Hugging Face alone, there are over seven and a half thousand models. Because of this, Ameisen recommends not choosing a lot of models and testing them against each other. This will take too much time, and different models have different assumptions and different inputs. Instead, he recommends to start as simple as possible. Your first model will probably not be the one you end with, so don’t jump to the most complex. Additionally, you need a model that is easy to debug, especially early on. Complex, esoteric models are also unlikely to have a significant community behind them—important for when you run into problems. Finally, any model you choose needs to be easily deployable and factor in prediction latency, training time, and concurency support.
Here Ameisen recommends to create a table, scoring each model on a 1 to 5 scale. (5 being best.)
Model Name | Ease of Implementation | Understandibility | Deployability | Total Simplicity Score | |||
---|---|---|---|---|---|---|---|
Well Understood | Vetted Implementation | Easy to Debug | Easy to Extract Feature Importance | Inference Time | Training Time | ||
Model One | 3/5 | 4/5 | 4/5 | 4/5 | 2/5 | 2/5 | 19 |
Model Two | 2/5 | 1/5 | 3/5 | 3/5 | 5/5 | 3/5 | 17 |
When evaluating the model quality, everyone knows to split away a subset of the data for a validation set. What people do less often, however, is to further split away data for a test set. A test set is data that you run on after you have done all of your manual changes, such as adjusting the number of epochs, or feature engineering. Keep 70% of your data for training, save 20% for the validation set, and 10% for the test set.
You need this test set because the changes you make to the model and training are their own hyperparameters and are a type of training on its own. The test set represents what might come through from your customers once you go into production, and will expose whether you have unknowingly overfitted to the validation set through your process of human learning.
And this brings up the second most important point from the book: “The more pleasantly surprised you are by the performance of your model on your test data, the more likely you are to have an error in your pipeline.”
Here, Ameisen recommends the “top-k method” to debug results. Pick a number k, which is a generally manageable number to debug. For a single person project, this might be between 10 and 15. Then choose the k best performing, k worst performing, and k most uncertain predictions. Look at the predictions and visualize them. Were there features in the best performers that you can bring to the worst? Are the worst performers missing key features?
The most common errors come about in a way that unit tests won’t capture: data is in a poor format for the model. An example from the book is a value that you expect to be a number, but is often a string or null. Still more errors are harder to debug, such as data that is outside of an expected range, such as an age of 150. This is such a common problem, that an MIT study showed 3.4% of labels were incorrect across common datasets, like MNIST, ImageNet, and others.
Deploying and monitoring
Creating a working model from good data is where most books end. Building Machine Learning Powered Applications, meanwhile, still has a third of the book remaining. In the final four chapters Ameisen covers final rounds of validation before deployment, how and where to deploy, building a robust production environment, continuous integration, systemizing retraining, and monitoring in production.
How you deploy is driven by a number of different factors, unique to your model and to your application. Is it important that your users get predictions right away? Then you’ll need run-time predictions. Or, can you share predictions at a set time each day? Then you’ll want to batch predictions. (Of course, if you have enough traffic, you can do a “stream-style” batching, where you batch up N predictions and run them together, but that needs enough traffic not to force users to wait. You probably don’t have that level of traffic.)
There’s also the consideration of where to deploy the model. Deploying on a server is the easiest, and probably the most common. But for sensitive data, or where prediction latency is a major consideration, you can also deploy on the client. This increases deployment complexity, as you are training the model elsehwere before sending it to the client, and reduces how powerful the model can be, but is an option nonetheless. This can be done either on the device, or in the browser, though libraries such as tensorflow.js. Client-side predictions can also be combined with federated learning, where there’s a base model, and each client has its own fine-tuned model. Think, for example, of keyboard predictions.
Ameisen also goes into how to build safeguards around your models. Jeremy Howard of fast.ai talks about how people often want to get to 100% computer-driven predictions, but that most often, it’s good enough for the computer to do 90% of the work and a human to do the remaining 10%. This is an example of a “human in the loop” safeguard. This isn’t always possible to do before sending predictions to the end user, however. In those cases, there are other approaches, and you should look at the inputs, the outputs, and the confidence.
For inputs, check to see that all of the necessary features are present and the values are valid. If the inputs aren’t valid, the model should not run. Run a heuristic, perhaps, but don’t run the model. For outputs, again check that they are within valid ranges.
Finally, is the deployment and monitoring. Monitoring is important for the reason that monitoring product performance overall is important, but is also necessary to catch feedback loops and feature drift, where the kinds of data or the values coming in to the model are different from what you initially used to train the model.
Conclusion
Again, Ameisen’s Building Machine Learning Powered Applications is a different kind of ML book. As such, it can reach a wide audience and do so well. The software engineer who knows how to train models will learn from best practices on deployment and monitoring, while the product manager is going to learn how to identify the right approach. The book occupies a useful place within the overall machine learning education that has exploded recently, and is worth a read.