Pre-prompt checklist: Success criteria, test cases, and evals
Everything you need to do before writing your first prompt.
When working with teams on writing prompts for LLM features or products, I found, often, that they just jumped right into writing prompts. While prompt engineering is obviously a key part of delivering a great LLM experience, diving straight into it can actually make it harder to succeed.
After all how will you know if the prompt is working well if you haven’t defined what success looks like?
This post will be all about what you should do before writing your first prompt, specifically, defining success criteria and developing test cases and evaluations.
Defining Success Criteria
When starting with a new LLM project, it’s tempting to dive straight into prompt engineering. But without defining what success looks like, it's going to be hard or impossible to measure progress or know when you've actually hit the mark. Success criteria are the benchmarks that guide your project, ensuring you stay aligned with your goals.
The best way to define solid success criteria is by using the SMART framework—making them Specific, Measurable, Achievable, Relevant, and Time-bound.
Specific: Rather than "Improve chatbot experiences," "Increase the accuracy of chatbot responses to customer inquiries by 20%."
Measurable: Quantitative metrics > Qualitative
Achievable: Within the skillset of current models
Relevant: Related to the value derived by the user of your application
Time-bound: By X date, we want to be at Y level of success.
Clear success criteria are important because they allow you to:
Track progress more effectively and objectively.
Set clear goals for your team.
Measure whether a prompt is performing as desired.
By establishing these early on, you not only save time, but you also provide a clear direction for everyone involved in the project. When the direction is clear, it makes it a lot easier for everyone to help.
Examples of success criteria
Here are a few examples of good and bad success criteria, including an example from the prompt generator we recently launched at PromptHub.
PromptHub prompt generator
Chatbot to help internal users to make order requests to vendors
More examples are in our full blog post here.
Other common success criteria
Error Rate
User Satisfaction
Accuracy
Consistency
Task Fidelity
Relevance and Coherence
Context Utilization
Coherence
Tone and Style
PII Preservation
Latency
Cost
Developing test cases
After establishing your success criteria, the next step is to create test cases that simulate real-world scenarios for your model. These test cases are designed to validate whether the model performs well under various conditions and edge cases.
Test cases should include a range of inputs:
Standard Scenarios: Common “happy path” use cases that your LLM will handle regularly.
Edge Cases: Unusual or challenging inputs, such as ambiguous or overly complex inputs, to stress-test the model and cover your bases.
Negative Tests: Scenarios where the model is expected to fail gracefully, like incomplete or irrelevant data.
For example, let’s say you're building a customer service chatbot:
A standard test case might involve handling a simple return request.
An edge case could be a long, rambling question full of irrelevant details.
A negative test might be a sarcastic comment or a nonsensical query.
The key with test cases is to continuously expand them based on real-world inputs you encounter in production. This approach helps identify where your prompts are failing and provides a path to iterate and re-test, ensuring they succeed in the future.
You can use LLMs to help generate test cases! Using a PromptHub form we built, you can generate test cases that cover happy paths and edge cases. Substack doesn’t like embedded code, so you can try it out by opening it up in new tab via this link.
Running Evaluations
Once you’ve defined success criteria and developed test cases, the next step is evaluating your LLM's performance. Evaluations (or "evals") help measure how well the model meets your success criteria and highlight areas for improvement.
Every evaluation consists of four key parts:
Input Prompt: The prompt that’s fed to the model.
Model Output: The response generated by the model.
Golden Answer: The ideal or expected output.
Score: A numerical grade based on how well the model output matches the golden answer.
Types of Evaluation Methods
Code-Based Grading: Fast and reliable - best for tasks with clear right or wrong answers.
Human Grading: The most flexible but time-consuming - best for subjective and/or creative tasks that require human judgment.
LLM-Based Grading: Automated and scalable - ideal for tasks like tone, coherence, or context evaluation. For more on using LLMs are evaluators check out our other post here: Can You Use LLMs as Evaluators? An LLM Evaluation Framework
Example evaluations
You can check out a variety of evaluation types, paired with specific use cases and grading methods in our full blog post here.
Conclusion
Building effective prompts goes beyond just writing and iterating. To truly set your LLM project up for success, it's essential to first define clear success criteria, develop test cases that stress-test the model, and run thorough evaluations to measure performance.
This structured approach, will help make sure that your LLM is aligned with the goals you’ve set up and is able to handle what the real-world throws at it.
Success with LLMs is a continuous process— things change quickly and you should regularly revisit your criteria, test cases, and evals.
In the video around 10:40 there is a phrase that. really hit home for me in my personal and client work - "Failure to interate to reach the threshold success". That's an important concept that I think most of us just don't 'grok' the importance of yet. I think clients assume AI is a magic box that is supposed instantly pop an answer out. You have provided a comprehensive framework in this article to encourage clients to slow down and think through what they are doing. I have no doubt the end results are better. Thanks for this.
In the video around 10:40 there is a phrase that. really hit home for me in my personal and client work - "Failure to iterate to reach the threshold success". That's an important concept that I think most of us just don't 'grok' the importance of yet. I think clients assume AI is a magic box that is supposed instantly pop an answer out. You have provided a comprehensive framework in this article to encourage clients to slow down and think through what they are doing. I have no doubt the end results are better. Thanks for this.