Prompt Engineering
Process of writing the prompt to get the best result from the LLM/AI based on the requirements
** Translating from human language to AI language **
Evaluating and Iterating Prompts
Why we need to evaluate prompts
Proprietary LLMs update frequently which might change the behavior of the previous working prompt.
Same prompt != same results
For the above reasons it's always good to take an iterative approach and evaluate prompts.
Evaluation Methods
LLM as Judge
We can pass the response of one LLM to another LLM to evaluate the response from the prompt. For example, lets say we give the prompt to generate 5 most popular pharmaceutical companies.
prompt_generation = """
Generate a table with the 5 most popular pharmaceutical companies and their foundation years.
The response should include only the table, with no additional text.
Use the following example format:
---
Company | Foundation Year
Microsoft | 1975
Apple | 1976
Google | 1998
Amazon | 1994
Meta | 2004
---"""
# Making the API call
response = client.responses.create(
 model="gpt-5",
 instructions="You are a helpful assistant.",
 input=prompt_generation,
 reasoning={'effort':'minimal'},
)
generated_response = response.output_text
print(generated_response)
It gives this response:
Company | Foundation Year
Pfizer | 1849
Johnson & Johnson | 1886
Roche | 1896
Novartis | 1996
Merck & Co. | 1891
You can use another LLM as judge
prompt_check_table_new = """Your task is to verify if a given table matches the exact format and structure of a provided example table.
Here's an example of the format that the table should have:
---
Company | Foundation Year
Microsoft | 1975
Apple | 1976
Google | 1998
Amazon | 1994
Meta | 2004
---
Table to Check:
{table_to_check}
Instructions:
- The table to check should match the structure, headers, and format of the Example Table exactly.
- The column names must be "Company" and "Foundation Year".
- The values in each row should have the company names and their corresponding foundation years.
- If the given table matches the example table in all these aspects, write "Yes".
- Write "No" if there are any differences in structure, headers, or if any company/year is missing or incorrect.
Only respond with "Yes" or "No".
"""
formatted_input = prompt_check_table_new.format(table_to_check=generated_response)
# using LLM as a Judge to check the format
response = client.responses.create(
 model="gpt-5",
 instructions="You are a strict judge. Evaluate inputs based on the given criteria and provide only the required response",
 input=formatted_input,
 reasoning={'effort':'minimal'},
)
print(response.output_text)
you have to evaluate it using different use case and try to catch the edge cases with temperature variations
Version Management
Another best practice in evaluating and iterating prompts is maintaining a version history of all the prompts you use. By keeping track of each version, you can easily revert to an earlier version if a new one proves less effective in specific cases.
A straightforward method for versioning prompts is using Git, a version control system. With Git, you can track changes to your prompts, annotate why each change was made, and roll back to a previous version if needed. However, a downside of this approach is that it doesn’t inherently integrate the evaluations and feedback alongside the prompt versions.
There are other tool like LangSmith by Langchain which are specifically meant for this purpose.