Thanks to prompt engineering best practices like few-shot learning and chain-of thought reasoning, we’re able to get better outputs from LLMs.
Thanks to these best practices, prompts are getting longer, a whole lot longer. Output quality and prompt length seem to have a linearly correlated relationship. This is great for API providers like OpenAI and Anthropic, and better outputs are good for you. However, longer prompts don’t help out your wallet.
Towards the end of last year, the research team from Microsoft released a paper: LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. The paper looked to solve the problem of really long prompts by compressing them, with the goal of keeping the same level of quality for the outputs. Better outputs, lower prices, Papa Johns.
The headline takeaway was that they were able to compress prompts by as high as a factor of 20, while experiencing minimal degradation in the output quality. Let’s see how they do it.
How LLMLingua works - A high-level overview
At a very high level, LLMLingua essentially works like this:
Funny, but true. Spoken/written English is naturally very verbose. This verbosity translates over to our writing and our prompt engineering. LLMLingua compresses prompts by removing unnecessary and repetitive information.
It’s a little more technical than that, so let’s take a deeper look.
How LLMLingua works - an example
Prompt compression works be removing unnecessary and repetitive tokens.
Here’s an example from the paper
Original prompt: 253 tokens
Question: Sam bought a dozen boxes, each with 30 highlighter pens inside, for $10 each box. He rearranged five of these boxes into packages of six highlighters each and sold them for $3 per package. He sold the rest of the highlighters separately at the rate of three pens for $2. How much profit did he make in total, in dollars?Let’s think step by step
Sam bought 12 boxes x $10 = $120 worth of highlighters.
He bought 12 * 30 = 360 highlighters in total.
Sam then took 5 boxes × 6 highlighters/box = 30 highlighters.
He sold these boxes for 5 * $3 = $15
After selling these 5 boxes there were 360 - 30 = 330 highlighters remaining.
These form 330 / 3 = 110 groups of three pens.
He sold each of these groups for $2 each, so made 110 * 2 = $220 from them.
In total, then, he earned $220 + $15 = $235.
Since his original cost was $120, he earned $235 - $120 = $115 in profit.
The answer is 115
Now let’s look at the compressed prompt
Compressed Prompt:
143 tokens, Compression rate: 43%
: Sam bought a dozen boxes each 30 highl pens inside, $10 each. He reanged five of boxes into of six each $3 per. He sold the thelters separately at the of three $2. much make total,
Lets think step
bought boxes x0 oflters
He 2 3ters in
Sam then boxes 6lters/box 0ters
He sold these boxes 5
Afterelling these boxes there 36030lters
ese00 of three
sold groups2 each so made *2 $20 from
In total, he015
Since his he $ - $120 = $115 in profit.
The answer is 115
These prompts performed within 1% of each other at scale. The difference is that the compressed prompt costs 43% less to run.
Now let’s talk about how the compression works.
How LLMLingua works, technically
Step 1: Compress the demonstrations (sentence-level compression)
The compression framework starts by focusing on the in-context demonstrations. A small LLM is used to compute the perplexity of these demonstrations.
Perplexity is a measurement of how well a model could predict a certain sequence. High perplexity means it was less predictable, low perplexity means it was more predicable for the model, based on its training.
The demonstrations that have a higher perplexity are removed, and the lower perplexity demonstrations are compressed at a configurable rate.
Step 2: Compress the instructions (token-level compression)
Now that we’ve handled the demonstrations, next up is the instruction part of the prompt.
The instructions are generally more important than the demonstrations. A prompt can survive without in-context demonstrations, but not without instructions.
Because of this, rather than compressing the instructions on a sentence-level, like the demonstrations, the instructions are compressed on a token-level.
Here’s how it works in practice:
Divide the whole prompt, comprising the demonstrations and instructions, into segments.
Iterate over the tokens with a small LLM and calculate the conditional probabilities (very similar to perplexity)
Remove the tokens that most surprised the small LLM
Continue until all segments have been analyzed
Combine all the segments to produce the final, compressed prompt
Step 0: Model alignment
This step happens before the compression, but I put it here because it is important to understand how the compression works in order to understand this step.
The small LLM used to calculate the conditional probabilities and perplexities is tuned to be aligned with the larger LLM that will eventually run the prompt.
The goal is to ensure consistency between the small and larger models in terms of perplexity calculations and compression decisions."
Experiment results
For a deeper dive into the experiments and results, check out our full analysis on the PromptHub blog here. If video is your preferred format, we also put together a video dissecting prompt compression here:
A few takeaways
Even at high compression ratios (20x in some cases), the compressed prompts were able to perform at similar levels (within 5%) to the uncompressed prompt.
The compressed prompts tended to work better on reasoning tasks compared to summarization tasks
The compression algorithm was the most effective, in regards to the final performance of the compressed prompt, when using Claude-v1.3 rather than GPT-3.5-Turbo-0301
Once you start to compress above 15-20x, performance falls off a cliff
Wrapping up
Running LLMs in production isn’t cheap, especially if you want to use the best models like GPT-4 and Claude 2. Compressing prompts is one way to reduce your LLM provider bill, while retaining similar performance levels.
It’s possible that you could just wait 3 months, and I’m sure token costs from Anthropic and OpenAI will drop, but for those who can’t wait, compression is something to explore!