top of page
Search

Practical Machine Learning for Industry

Updated: Apr 4

Machine learning has become a significant staple of our daily lives. We have access to technologies such as ChatGPT [1], a chatbot that has revolutionized the way we access information, and even before that, we already saw the adoption of Large Language Models to enhance the results of traditional search engines [2], or the original problem that gave way to the Transformers: the use in automatic machine translation [3]. We also have image-generation tools [4] that have gone as far as winning art competitions [5], and, most recently, the generation of life-like video [6].

Nevertheless, there's a harsh reality check when trying to apply machine learning to our projects, whether in industry or academia, and it's facing the operational costs it carries [7], both from training and inference. And even though the fancy APIs from OpenAI or Google might sound exciting, there are still many scenarios where more classical tools like Scikit-Learn [8] or the library amazingly optimized for CPU, FastText [9] can be more than enough for reaching a competitive performance over specific tasks as some, much more extensive and costlier to train and maintain, deep learning models [10, 11].

In this article, I will share with you some of my experience working, both as a researcher and as a freelance machine learning engineer and data scientist, on how sometimes a classical machine learning model makes the most sense in the long run.



Machine Learning Vision
"Machine Learning Vision" - CC0 Public Domain

Performance vs. Maintainability


The cost of reaching State-of-the-Art


When talking about the essential properties of a machine learning model, the phrase "state-of-the-art" (a.k.a. SOTA) is usually among the most desired ones. The model with the best performance is typically the one that wins the competition [12], or has a higher chance of getting accepted as a research paper [13]. However, do you really need a SOTA model? Have you considered what you are giving up when you are fixed on performing at that level? [14]

Remember that "there is no such thing as a free lunch", and in order for you to reach the top performance in anything, you will most likely be required to give something in return.

Let's take a classic example of what you are giving up when pushing for the SOTA results: the winner of the "Otto Group Product Classification Challenge" [15]. The model was an ensemble composed of 3 stacked layers: The first layer was composed of 33 models that resulted in 33 meta-features; these meta-features were fed to the three models that composed the second layer of stacks, which outputs where part of a weighted mean in the 3rd and final layer. Now, of course, this model won the tournament. It was basically crafted to reach the best result possible for that competition in particular, without any regard for other uses the model might have.

Other examples are large language models, i.e., the technology behind ChatGPT or Gemini [16]. These are the holy grail of machine learning models nowadays. Still, even these modern marvels of engineering have their critics, both from the standpoint of how they are trained [17] and how well they are for generalizing outside of their training data [18].

In my personal experience, when I was working as an industry researcher studying different representation learning techniques for products in an e-commerce company, after investing many hours of research into different complex techniques, I came to the harsh conclusion that sometimes simpler models like FastText or even Bag-of-Words can achieve very impressive results, especially when dealing with real-world data [11].


The importance of an iterative development


A couple of years ago, when machine learning was becoming more substantial and important as part of the technology industry, but before becoming the juggernaut we have today, Andrew Ng published a free e-book called "Machine Learning Yearning" [19]. In it, Ng provided a series of sound principles to follow when doing machine learning, especially in industry.

Among the principles stated by Ng, I learned that one of the most important ones is to have an iterative process when developing machine learning models. This iterative process helps improve any models you are serving, which is a core property to have if you want to keep yourself ahead of anyone who is offering a similar product to the one you are offering. What is the point of having a model that makes the perfect predictions on your evaluation dataset if your real-world data is constantly changing? If you keep yourself preoccupied with having the best F1-score, you lose focus on what's essential for many industries, i.e., reaching it first.

An iterative development that allows you to keep improving the models you have gives you leverage over your competitors. To achieve that, sometimes losing a couple of performance points in the evaluation data is worth it. Moreover, the evaluation data rarely reflects the real data, and even if it does, it can be for a brief period of time.


So, what should be our primary focus while doing machine learning in the industry?


Maintainability might not sound as cool as having "state-of-the-art" performance, but for most scenarios, a maintainable model is just a few points behind performance-wise from the SOTA model. You will end up winning because you will have less of a development nightmare when dealing with an iterative process.

Simplicity is another critical feature to aspire to in an industry setting for machine learning. Stacked models or neural networks with complex architectures might be the ones that win competitions with a predefined set of results. Still, when you are building something from scratch, simple white box models such as a Decision Tree can help you understand your data better [20].

Finally, before starting on your machine learning journey, you have to be honest about your situation and your limitations. Suppose you are working on a personal project or even a startup. In that case, you usually don't have access to the armies of engineers and human resources that some tech giants such as Google or Microsoft have. You have to be honest about what technology you have available and what you can accomplish with that technology, and you have to understand that sometimes having less of something doesn't require precisely that you'll be worse, just that you need to adapt to the needs you and your business have at that moment.

The rest of this article will explain some of the tools and techniques I use and the order of preference I use them in my day-to-day work, especially when dealing with machine learning engineering for freelance clients. I have a preference for using open-source tools that I can set up for free, but many of them have counterparts as services in cloud solutions.


Strategies for Machine Learning


When looking for a solution to a machine learning problem, there are many ways to improve our results: more/better quality data, pre-processing (feature engineering, normalization, reduction, etc.), model hyperparameters optimization, more training time, etc. So, what should we focus on?


Dataset Selection


First, we require data for training, which is essential to starting to develop our models. However, we also need data for development and evaluation.

The training data is usually the more significant portion of the data. Sometimes, it takes work to gather a lot of data that is required for our purposes. The good thing is that training data doesn't have to be curated to mimic the actual world data we expect our model to work with, so sometimes, we can go ahead with more generic data for training.

The development data, also called dev or validation data, is what we use to adjust the model's hyperparameters, perform error analysis, or perform ablation tests.

The evaluation data, or test data, is the one we use to get the final model results, and it should go without saying but we should never use it to make model decisions. It should be used in the last stage. The evaluation and development data should come from the same distribution and should be as close to the actual world data we are using our model with [21]. This data should also be more carefully curated, to the point that in some cases, it is even helpful to annotate examples by ourselves to build a high-quality dataset for evaluation, tools like Doccano [22] or Label Studio [23].


Metrics


Metrics give us numbers that are useful when assessing the performance of our models. Of course, there are a lot of metrics available, and depending on the problem, some of them are better suited than others. A good rule of thumb is to optimize over a single metric, like accuracy or F1-score, and at most have a couple of secondary metrics to be satisfied, like the latency or some minimum values for some specific metrics. Scikit-Learn offers a great variety of metrics to choose from [24]. Some metrics I optimize for are:

  • Binary Classification: The F1-score is good when the data is not highly unbalanced. If there's a significant imbalance, I tend to go with the Average Precision Score, which is more resistant to significant imbalance (but requires probabilistic scores).

  • Multi-class Classification: Depending on the importance of the minority classes in the final performance, either the F1-Score macro average or the balanced accuracy are good metrics.

  • Regression: I usually use the root mean squared error, which is in the same units as the problem labels.

  • Ranking/Scoring: The AUCROC is a good metric when the most crucial part is to score the positive values over the negative values.

I avoid the use of accuracy because of its problems with class imbalance.

I also visualize the results. For classification, I usually use a heatmap of the confusion matrix. For regression, I plot the distribution of the ground truth vs. the distribution of the predictions.


Dummy Baselines


Dummy baselines are critical. If our model can't outperform a most common class baseline or a random baseline, then there's something fundamentally wrong with it. Scikit-Learn offers a list of Dummy Baselines we can use to have a lower bound for model performance.


First Iteration


The first iteration should always begin with a basic model. We should go with something as simple as possible to see where we are standing. Models like Logistic Regression, Naive Bayes, Decision Trees, and linear SVMs, all available in Scikit Learn, are usually my starting points for table data.

If I am working with text classification or some other form that requires text data, I try FastText. It's extremely simple to set up and works extremely fast without needing large hardware. Sometimes, I also use Spacy [25] or Gensim [26] because they provide solutions that are very well thought out for industry applications.

When dealing with more complex models that require neural networks (like computer vision or language generation), I work with libraries that give me less flexibility but better out-of-the-box solutions: Hugging Face [27], Lightning [28], and Keras [29] are all good options.


Hyperparameter Selection


When dealing with different hyperparameters, we need an excellent tool to help us keep track of our experiments and the parameters associated with them. I have already written about MLFlow [30]; it's my tool of choice. I usually combine it with both random and grid searches for hyperparameter selection.

I first use random search to explore a larger space of hyperparameters and compare them with MLFlow's scatter plots to discard those with significantly worse performance. Once the hyperparameter space has been reduced, I do a grid search to choose the best hyperparameters overall.


Bias and Variance


Bias and variance are the two errors we need to reduce when doing machine learning.

The first thing to control is bias. A high bias means the model is not even able to predict over its training data. This means that the problem is too complex for the model or that there is some problem with the data. Suppose I have a hard time reducing this type of error. In that case, I usually check with a large enough model (e.g., a multilayer perceptron or some model with polynomial features) over a small percentage of the dataset (around 10%) and try to get an error as close to zero as possible. If this fails, there's a clear problem with the model or the data that requires revision.

High variance is usually a more complex problem that cannot be solved perfectly. It concerns the ability of the model to generalize. The variance error is the error over the validation data. If it has a high variance, it means the model is memorizing the training data and not generalizing it to unseen data. It usually requires simpler models, regularization, or more data.


Error Analysis


The final step in the iteration of machine learning development is error analysis. Once you reach a plateau in whatever metrics you are optimizing, you need to start having a closer look at what the model is lacking that makes it fail. This is usually best done with some manually inspected data, what Ng calls the "Eyeball Development Dataset", a data composed of no more than 100 elements that are usually wrongly predicted by the model. You can use the error given by your model and these predictions and choose those predictions with a more significant error to check for problems manually. It can be that the data is bad, the labels are wrong, or you can see a pattern that explains why the model is constantly wrong with its predictions of the elements in the Eyeball Dataset.

I usually use MLFlow's artifacts for this, writing both the actual labels and the predicted labels and the data itself (e.g., the text, the features, the images, etc.) and checking what is going on.


Keep Iterating


The previous are the basic steps of the machine learning process. Once you reach the final, you start over and keep iterating. You release the models and, if possible, do A/B testing on your usage scenarios to see how well they are working. You can use that information to improve your models further for the next iteration.


Final Thoughts


Keeping a machine learning model up and ready is a continuous task, very similar to software development. A model is not going to be valid forever; in fact, it can be outdated quite fast, so developing good habits for keeping your models updated is crucial if you wish to succeed in your business.

Reaching state-of-the-art sounds nice, but it can often become a burden, especially if you don't have the time or resources to keep up with larger companies that basically own the majority of the Internet's infrastructure. Most of the time, offering a good enough performance at a fraction of the cost and that it's more maintainable will be a much better solution.



References

[1] OpenAI. (2022, November 30). Introducing ChatGPT. OpenAI Blog. https://www.openai.com/blog/chatgpt

[2] Nayak, P. (2019). Understanding searches better than ever before. https://blog.google/products/search/search-language-understanding-bert/

[3] Uszkoreit, J. (2017) Transformer: A Novel Neural Network Architecture for Language Understanding. https://blog.research.google/2017/08/transformer-novel-neural-network.html

[4] Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (p./pp. 10684--10695).

[5] Roose, K. (2022). An A.I.-Generated Picture Won an Art Prize. Artists Aren’t Happy. https://www.nytimes.com/2022/09/02/technology/ai-artificial-intelligence-artists.html

[6] OpenAI. Sora. Creating video from text. https://openai.com/sora

[7] Knight, W. (2023). OpenAI’s CEO Says the Age of Giant AI Models Is Already Over. https://www.wired.com/story/openai-ceo-sam-altman-the-age-of-giant-ai-models-is-already-over/

[8] Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., ... & Varoquaux, G. (2013). API design for machine learning software: experiences from the scikit-learn project. arXiv preprint arXiv:1309.0238.

[9] Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the association for computational linguistics, 5, 135-146.

[10] Cardellino, C., & Carrascosa, R. (2021). A Study on Multiple Tasks for e-Commerce Marketplaces. The International FLAIRS Conference Proceedings, 34. https://doi.org/10.32473/flairs.v34i1.128469

[11] Cardellino, C., & Carrascosa, R. (2022). A Study on Title Encoding Methods for e-Commerce Downstream Tasks. The International FLAIRS Conference Proceedings, 35. https://doi.org/10.32473/flairs.v35i.130550

[12] "The Netflix Prize". Archived from the original on 2009-09-24. Retrieved 2012-07-09.

[13] Beckham, C. (2022). The obsession with SOTA needs to stop https://beckham.nz/2022/08/02/sota.html

[14] Britz, D. (2020) Replication Issues in AI research https://dennybritz.com/posts/ai-replication-issues/

[16] Pichai S., Hassabis D. (2024). Our next-generation model: Gemini 1.5 https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/

[17] Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021, March). On the dangers of stochastic parrots: Can language models be too big?🦜. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency (pp. 610-623).a

[18] Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., & Zhang, C. (2022). Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646.

[19] Ng, A. (2017). Machine learning yearning. URL: http://www.mlyearning.org/ (96), 139, 30.

[20] Milla, D. (2017). Introduction to Decision Trees (Titanic dataset). https://www.kaggle.com/code/dmilla/introduction-to-decision-trees-titanic-dataset

[21] Assawiel, N. (2018). What to do when your training and testing data come from different distributions. https://www.freecodecamp.org/news/what-to-do-when-your-training-and-testing-data-come-from-different-distributions-d89674c6ecd8/

[22] Doccano: Open source annotation tool for machine learning practitioners. https://github.com/doccano/doccano

[23] Label Studio: Open source data labeling platform. https://labelstud.io/

[24] Scikit-Learn. Metrics and scoring: quantifying the quality of predictions. https://scikit-learn.org/stable/modules/model_evaluation.html

[25] SpaCy. Industrial-Strength Natural Language Processing in Python. https://spacy.io/

[26] Gensim. Topic Modelling for Humans. https://radimrehurek.com/gensim/

[27] Hugging Face Inc. https://huggingface.co/

[28] Lightning AI. https://lightning.ai/

[29] Keras: Deep learning for humans. https://keras.io/

[30] Cardellino C. (2024). Keeping Track of Experiments with MLFlow. https://www.transcendent-ai.com/post/keeping-track-of-experiments-with-mlflow Transcendant AI.

Comments


bottom of page