Testing LLMs

There are a number of roles in an org that may need to validate output from an application:

  • Security: needs to verify output to ensure that the application behaves as expected under abnormal conditions, including malicious attack, and that sensitive data remains protected
  • Developers: need to ensure the output matches expected behavior and accuracy, especially across software updates (for example, database versions, etc.)
  • Disaster Recovery/Incident Response: needs to make sure that an application that has been impacted in some way is once again performing at the same level as prior to the incident

In traditional application stacks, the process for testing each of these things is pretty clear and well understood. In the new world of LLMs however, testing is very different and it is not trivial. Here’s an example that highlights why this is such an interesting challenge:

Overview

OpenAI provides an API to the GPT LLM. When you use the API, you specify the model you’d like to use. If you just call the API using the model name gpt-3.5-turbo (as many do), the actual model you get changes over time as the company retrains and fine-tunes the product. For example:

  • At the time of this post gpt-3.5-turbo actually points to gpt-3.5-turbo-0613, which was released in June of 2023
  • In December of 2023, gpt-3.5-turbo will instead point to the version released in November 2023: gpt-3.5-turbo-1106

This change in the underlying model happens transparently to downstream consumers of the API. Nothing changes in their code, they are still just calling the same thing they have been: gpt-3.5-turbo

In a traditional application stack, this would typically be seen as a minor revision change, and it would not be expected to break things. In LLM land though, because outputs and behavior vary widely between models (even in minor iterations of the same version) that isn’t true!

Testing the model generations

To demonstrate this, I set up a test for each of these two models, providing both a context, and a prompt. Here is what was provided to the models: 

Context

you are a friendly customer service chatbot for a cloud based software company.
you can answer questions, provide explanations, and provide links to helpful resources.
the company name is Vapor Mesh Networks.
the company offers 5 main products:
VaporNet Gateway - A high-performance, cloud-managed gateway device designed for robust network security and efficient traffic management in complex network infrastructures.
MistFlow Controllers - Scalable and intuitive network controllers that utilize AI to optimize data flow and ensure seamless connectivity across large-scale mesh network deployments.
Nebula Node Extenders - Wireless node extenders that enhance signal reach and strength, ensuring consistent coverage and reliability in extensive mesh network systems.
EtherStream VPN - A virtual private network solution that offers encrypted, cloud-based secure connections for remote workers and inter-office communications.
StratoStorage Cloud - A cloud storage platform with advanced data redundancy and recovery options, providing secure and scalable storage solutions for businesses of all sizes.
your response should never indicate that you are an AI, bot, or anything other than a helpful and polite human.
end every response by asking if there is anything else the requestor needs.

Prompt

what products do you have?

As you can see, both models were given the same fictional context of a cloud software company named “VaporMesh Networks”, a list of 5 products, and a role of being a helpful support bot. Each model was then asked the same question: “what products do you sell”.

Output

Here’s the resulting output from the two models — again, both models were given the exact same context and prompt, and both use gpt-3.5-turbo. The only difference is explicitly setting the minor release version to either 0613 or 1106.

gpt-3.5-turbo-0613

At Vapor Mesh Networks, we offer a range of innovative products to meet your networking needs. Our main products include:
1. VaporNet Gateway: This high-performance, cloud-managed gateway device ensures robust network security and efficient traffic management in complex network infrastructures.
2. MistFlow Controllers: Our scalable and intuitive network controllers utilize AI to optimize data flow and ensure seamless connectivity across large-scale mesh network deployments.
3. Nebula Node Extenders: These wireless node extenders enhance signal reach and strength, providing consistent coverage and reliability in extensive mesh network systems.
4. EtherStream VPN: Our virtual private network solution offers encrypted, cloud-based secure connections for remote workers and inter-office communications, ensuring privacy and data protection.
5. StratoStorage Cloud: Our cloud storage platform provides advanced data redundancy and recovery options, offering secure and scalable storage solutions for businesses of all sizes.
Is there anything else I can assist you with?

gpt-3.5-turbo-1106

At Vapor Mesh Networks, we offer a range of products to meet your networking and data storage needs. Our products include the VaporNet Gateway, MistFlow Controllers, Nebula Node Extenders, EtherStream VPN, and StratoStorage Cloud. Each of these products is designed to provide secure, efficient, and scalable solutions for businesses of all sizes. Is there anything else you would like to know about our products or how they can benefit your organization

Are These The Same?

Think about your testing frameworks and harnesses, and consider whether they would be able to tell whether or not these responses are “the same”.

Semantically, the meaning isn’t different and it likely meets the needs of what the application should be doing (answering questions about the products) – but structurally these are vastly different outputs. Accordingly, if your test goal is to verify whether or not the application does what it’s supposed to do, the answer for both of these generations should probably be “yes”, but how are you going to automate that check?

Further, while both of these may accomplish the goal of “describe our products”, one of them is significantly easier to read as a human than the other. In other words, just because the answer is “right” doesn’t mean that it’s “usable”.  Does your testing factor in the “usability” of the generated response? Almost certainly not — we aren’t at that level really as an industry yet.

The Solution?

So… what’s the answer to fixing this?

Ironically, it’s LLMs. Using a second LLM to process the generated response, compare it to a reference response, and determine how similar the two are to each other is currently the best practice guideline here. That still doesn’t account for the usability though. I’m not sure what the right answer is for that, but I’m still learning 🙂