Your MLOps Process is Probably Broken

5 min readNov 29, 2022

Even with a perfect stack of MLOps tools, teams still struggle to deliver ML products. So if tools aren’t the only piece of the puzzle, what does that leave? In my last post, I argue the remaining pieces are Culture and Process.

Let’s dive into the MLOps process — in particular, some of the foundational pieces that most teams get wrong. What does a successful MLOps process look like, and how can individual ML practitioners help to build that process?

A scientist breaking down a wall — In short — get ready to break down some walls (image generated by DALL-E)

TL;DR

Start with a product, not a model
Survey the data in production, not in your warehouse
Start simple — with data and models
Partner with engineers

Let’s dive into each of these pillars in more detail.

Start with an ML product

Perhaps the most important practice that enables successful ML projects is to design a product, not a model. One of the biggest pitfalls I’ve seen across dozens of companies is to hand “projects” to data teams, instead of involving them in the product design phase.

To build a successful ML product, three stakeholders need to be involved in designing the product:

PM / Business Stakeholder: What does success look like?
ML Person: What is (likely) possible with ML?
Product Engineer: What is feasible, and what are the constraints?

Most important: these three personas must be highly aligned throughout the development of an ML product. When ML projects fail, it’s typically due to an alignment problem!

Some examples of poorly aligned teams:

ML person optimizes for model accuracy (instead of business outcomes!)
Projects started that may not be feasible to solve with ML
The model doesn’t meet performance constraints in production
The features are challenging or impossible to compute in production

Some examples of good alignment:

ML person understands the tradeoffs of accuracy vs. time to market
Monitoring built on day zero to ensure business outcomes are consistently measured
Engineer helps ML person understand the production data landscape
Model SLAs are clearly defined and measured

Survey the data in production

This is an example of “good alignment” in the last section, but it deserves its own section. Almost all ML builders I’ve seen start their ML projects with a survey of the available data. The problem? They typically survey the data that is available for training, not the data that will be available in production.

Some might ask — shouldn’t all of the data available for training be available in a production system?

Most of the time, the answer is yes, but with a bunch of asterisks. How quickly is that data available? How fresh is that data? How much preprocessing needs to be done on the prod data to make it consumable? Who owns that data?

So many ML projects stall out because of issues with production data. I’ve seen over and over again that there is a huge disconnect between the ML person and the product engineer. An example of two innocuous features with dramatically different requirements:

A user’s home zip code: probably dirt simple to use in production. Query a database.
A user’s average location in the last five minutes: probably a PITA! User location data is on a Kafka Stream? How fresh does it need to be? Streaming aggregations are hard! Probably training / serving skew issues!

Spend extra time understanding what data is available in production, and what constraints apply to that data. It’ll save you months delivering a functional solution.

Production is never as easy to navigate as a data warehouse (image generated by DALL-E)

Start simple

Probably the most common ML advice, but it’s good advice. Start with a simple solution.

My addition — most folks will tell you to start with a simple model, but it’s equally important to start with simple data! To play off of the example above:

A user’s average location in the last five minutes: hard
A user’s most recent location: probably much easier!

You may discover when building your model that a model built with “a user’s average location in the last five minutes” produces a more accurate model, but you may be able to get a model built with “a user’s most recent location” into production two weeks earlier.

You probably take that trade every time. You can always build a V2 with the fancier feature, and it will be much easier to build incremental improvements than to ship something complicated the first time.

Fast and simple is always the right starting point (image generated by DALL-E)

Partner with engineers

Odds are if you’re a data scientist, not all of the questions posed above are obvious to answer. When I was last hands-on building ML models, I had no idea what streaming data was (let alone how to think about it).

The solution is to become friends with engineers and work with them throughout the development of an ML model. Building any software project cannot be done in a silo, and ML in a silo is even worse. Engineers can help with “what data is available,” “what constraints should I know about,” “what SLAs are feasible,” and more.

Work with an engineer early and often. You’ll build projects way faster.

Conclusion

These steps aren’t a comprehensive view of an MLOps process — there are a lot of moving pieces that lead to success (code reviews, CI/CD, monitoring, …). This is a starting point. As mentioned above, most of the ML failures I’ve seen are alignment issues. These process guidelines are primarily meant to help you align your team for success.

You need a strong foundation to build an exceptional MLOps practice.

David Hershey is an investor at Unusual Ventures, where he invests in machine learning and data infrastructure. David started his career at Ford Motor Company, where he started their ML infrastructure team. Recently, he worked at Tecton and Determined AI, helping MLOps teams adopt those technologies. If you’re building a data or ML infrastructure company, reach out to David on LinkedIn.