Observability in Large Language Models

In engineering, we often hear about observability. Unfortunately in the market, it’s kind of a word that every company that sells observability tools sort of has their definition for, and it can be a little bit confusing. In this article, I gonna borrow insights from this podcast on Observability for Large Language Models. And unpack the core principle of observability to get a better understanding of this topic. While at it, share about the evolution as we transition into the AI (LLM) era.

Observability in general

Observability is the ability to understand the state of a system without having to change that system. It’s about asking questions about what’s going on and continually getting answers that help you narrow down the behavior that you’re seeing.

In layman’s terms, when your system is running on production. How do you check the health of your system? More importantly, how do you debug the application for errors or weird behaviors?

The general principle is that when you’re debugging code and it’s easy to reproduce something on your local machine, that’s great. You have your code and fancy debugger to help you get more information.

But what if you can’t do that? That’s what happens if you have a production running application. The most common tools to achieve those are: LoggingMetricsTracing, and Profiling.

Traditional observability icing

Just having these 4 alone is insufficient. As a developer, the more tools we have for debugging the better. Because the job of debugging and finding the production root cause is so darn difficult.

That’s where you see different products in the market and try to solve observability in their specific domain. For example, serverless applications, container applications, or framework-specific (like JVM) and more.

Observability in large language model (LLM)

LLM is a whole new beast of its own. It is a black box that you submit text to and let the box do its magic. The input and output of this box are so random. JSON, paragraph, code, bullet points, and the list goes on.

Now, imagine in a production environment where the logs are spamming non-stop, and you have no way to properly filter to reproduce or debug. For example, how do you differentiate malicious intent requests? As of today’s state of engineering, we have not reached a point where we can unpack step by step what is happening there.

In large language models that sort of everything is in a sense unreproducible, non-debug gable, and non-deterministic in its outputs.

The solution to LLM observability

The fact that it is impossible to test input and output because they are so random. QA has to be dealt with on the fly. A good engineering practice that helps to maintain stability in a state of chaos. Be comfortable with incrementality and fast releases, as they are much more important (in case of a bad release, you can recover quickly).

Because of the nature of LLM, this is the perfect fit to use Observability due to the lack of QA. What are all the things that are meaningful upstream to a large language model that potentially influence the call? And then what are all the things that happen downstream and what can you do with that output? All of this has to be dealt with on productions.

Monitor for regression or establish a process to test for regression. The model itself is non-deterministic. It’s very easy to regress something that was previously working without you necessarily knowing it upfront. At least that is the best you can do in the form of QA. The worst thing that could happen when you’re delivering some value is to go backward.

Differentiating bot vs human input. Most of the traffic happening within our network is bots. By filtering out that noise, it can drastically reduce the amount of information needed to be processed for debugging.