How to test if your LLM performance deteriorates?

New mobile apps to keep an eye on

Hendrerit enim egestas hac eu aliquam mauris at viverra id mi eget faucibus sagittis, volutpat placerat viverra ut metus velit, velegestas pretium sollicitudin rhoncus ullamcorper ullamcorper venenatis sed vestibulum eu quam pellentesque aliquet tellus integer curabitur pharetra integer et ipsum nunc et facilisis etiam vulputate blandit ultrices est lectus eget urna, non sed lacus tortor etamet sed sagittis id porttitor parturient posuere.

Lorem ipsum dolor sit amet consectetur rhoncus ullamcorper ullamcorper

Mauris aliquet faucibus iaculis dui vitae ullamco

Posuere enim mi pharetra neque proin vulputate blandit ultrices

Posuere enim mi pharetra neque pellentesque aliquet tellus proindi

What new social media mobile apps are available in 2023?

Sollicitudin rhoncus ullamcorper ullamcorper venenatis sed vestibulum eu quam pellentesque aliquet tellus integer curabitur pharetra integer et ipsum nunc et facilisis etiam vulputate blandit ultrices est lectus vulputate eget urna, non sed lacus tortor etamet sed sagittis id porttitor parturient posuere.

Posuere enim mi pharetra neque proin vulputate blandit ultrices

Use new social media apps as marketing funnels

Eget lorem dolor sed viverra ipsum nunc aliquet bibendum felis donec et odio pellentesque diam volutpat commodo sed egestas aliquam sem fringilla ut morbi tincidunt augue interdum velit euismod eu tincidunt tortor aliquam nulla facilisi aenean sed adipiscing diam donec adipiscing ut lectus arcu bibendum at varius vel pharetra nibh venenatis cras sed felis eget.

Lorem ipsum dolor sit amet consectetur fringilla ut morbi tincidunt.

Mauris aliquet faucibus iaculis dui vitae ullamco neque proin vulputate interdum.

Posuere enim mi pharetra neque proin bibendum felis donec et odio.

Posuere enim mi pharetra neque proin aliquam mauris at viverra id mi eget.

“Eget lorem dolor sed viverra ipsum nunc aliquet bibendum felis donec et odio pellentesque diam volutpat.”

Try out Twitter Spaces or Clubhouse on iPhone

Nisi quis eleifend quam adipiscing vitae aliquet bibendum enim facilisis gravida neque velit euismod in pellentesque massa placerat volutpat lacus laoreet non curabitur gravida odio aenean sed adipiscing diam donec adipiscing tristique risus amet est placerat in egestas erat imperdiet sed euismod nisi.

What app are you currently experimenting on?

Last week I heard two people involved in LLM product development, claiming that “LLMs deteriorate over time”. Have you experienced similar issues? I had to dig further and see if other reports on that.

The first mention of large language model deterioration was from July 2023, where 3 computer scientists (two from Stanford and one from Berkeley) performed tests on GPT4 and GPT 3.5 - first in March and then in June, and saw model performance drop.

Last month, another article on the topic was released, again suggesting newer LLMs perform worse compared to older models.

Why LLMs may deteriorate over time?

In summary the reasons for supposed model deterioration might be found in:

The training data
The parameters defining the model

New models vs Old models

Now, new models can be affected by both the training data and the model parameters. Old models, however, can suffer only if model parameters are being played with.

Here's what Lauren Leffer says, source:

“Unlike in a traditional computer program, where each line of code serves a clear purpose, developers of generative AI models often cannot draw an exact one-to-one relationship between a single parameter and a single corresponding trait. This means that modifying the parameters can have unexpected impacts on the AI’s behavior.”

Yes, one cannot bid on LLMs being deterministic. Being able to communicate like humans and change wording, so they sound more natural, is basically one of the best feature in their list.

Is there anything you can do about it?

As LLM performance depends a lot on the input, one cannot be sure how their apps, using certain large language models will perform over time, unless they run tests.

The tests can include 20-25 questions, which are important for your app/business case, and you can run those questions at certain periods of time and record the answers. Then, by comparing answers from different tests, you will be able to see if the model still does a good job or you need to change your prompts, or even switch to a new LLM. Check here for more info on LLM test.