Prompt Engineering

Process of writing the prompt to get the best result from the LLM/AI based on the requirements

** Translating from human language to AI language **

Evaluating and Iterating Prompts

Why we need to evaluate prompts

Proprietary LLMs update frequently which might change the behavior of the previous working prompt.

Same prompt != same results

For the above reasons it's always good to take an iterative approach and evaluate prompts.

Evaluation Methods

LLM as Judge

We can pass the response of one LLM to another LLM to evaluate the response from the prompt. For example, lets say we give the prompt to generate 5 most popular pharmaceutical companies.

prompt_generation = """

Generate a table with the 5 most popular pharmaceutical companies and their foundation years.

The response should include only the table, with no additional text.

Use the following example format:

---

Company | Foundation Year

Microsoft | 1975

Apple | 1976

Google | 1998

Amazon | 1994

Meta | 2004

---"""

  
  

# Making the API call

response = client.responses.create(

  model="gpt-5",

  instructions="You are a helpful assistant.",

  input=prompt_generation,

  reasoning={'effort':'minimal'},

)

  

generated_response = response.output_text

print(generated_response)

It gives this response:


Company | Foundation Year
Pfizer | 1849
Johnson & Johnson | 1886
Roche | 1896
Novartis | 1996
Merck & Co. | 1891


You can use another LLM as judge

prompt_check_table_new = """Your task is to verify if a given table matches the exact format and structure of a provided example table.

  

Here's an example of the format that the table should have:

---

Company | Foundation Year

Microsoft | 1975

Apple | 1976

Google | 1998

Amazon | 1994

Meta | 2004

---

  

Table to Check:

{table_to_check}

  

Instructions:

- The table to check should match the structure, headers, and format of the Example Table exactly.

- The column names must be "Company" and "Foundation Year".

- The values in each row should have the company names and their corresponding foundation years.

- If the given table matches the example table in all these aspects, write "Yes".

- Write "No" if there are any differences in structure, headers, or if any company/year is missing or incorrect.

  

Only respond with "Yes" or "No".

  

"""

  

formatted_input = prompt_check_table_new.format(table_to_check=generated_response)

  
  

# using LLM as a Judge to check the format

response = client.responses.create(

  model="gpt-5",

  instructions="You are a strict judge. Evaluate inputs based on the given criteria and provide only the required response",

  input=formatted_input,

  reasoning={'effort':'minimal'},

)

print(response.output_text)

you have to evaluate it using different use case and try to catch the edge cases with temperature variations

Version Management

Another best practice in evaluating and iterating prompts is maintaining a version history of all the prompts you use. By keeping track of each version, you can easily revert to an earlier version if a new one proves less effective in specific cases.

A straightforward method for versioning prompts is using Git, a version control system. With Git, you can track changes to your prompts, annotate why each change was made, and roll back to a previous version if needed. However, a downside of this approach is that it doesn’t inherently integrate the evaluations and feedback alongside the prompt versions.

There are other tool like LangSmith by Langchain which are specifically meant for this purpose.