7 min read
Evaluator-Optimizer LLM Workflow: A Pattern for Self-Improving AI Systems

Article: Evaluator-Optimizer Workflow, Jupyter notebook

What the article covers

The Anthropic Cookbook provides code and guides designed to help developers build with Claude, offering copy-able code snippets that you can easily integrate into your own projects.

”Building Effective Agents Cookbook” - Reference implementation for Building Effective Agents by Erik Schluntz and Barry Zhang.

Evaluator-Optimizer Workflow: In this workflow, one LLM call generates a response while another provides evaluation and feedback in a loop.

This workflow is particularly effective when we have: Clear evaluation criteria Value from iterative refinement

The two signs of good fit are: LLM responses can be demonstrably improved when feedback is provided The LLM can provide meaningful feedback itself

My Thoughts

Context

I’ve been diving into the Anthropic Cookbook lately, exploring different patterns for building AI systems.

The Cookbook itself is a goldmine of practical implementations, providing ready-to-use code for building with Claude.

The Evaluator-Optimizer pattern caught my attention because it represents something I’ve been thinking about: how do we move from static prompt engineering to truly dynamic AI systems?

What makes this particular pattern fascinating is how it approaches the challenge of creating self-improving AI systems.

Key Insight

The key insight here is elegant in its simplicity: use one LLM to generate solutions and another to evaluate them in a continuous feedback loop until a success criteria is passed.

This separation of concerns is more powerful than it might first appear.

  1. Pattern Recognition: This represents a shift from static to dynamic AI implementations
  2. Strategic Value: Enables scalable, self-improving systems while maintaining control
  3. Implementation Path: Start simple, evolve with confidence
  4. Future Direction: Framework for building learning systems, not just response systems

Let’s take a look at the key components needed to make the Evaluator Optimizer LLM Workflow work.

Key Components

  1. Generator: Create solutions based on the task and initial examples (if given) and later previous feedback
  2. Evaluator: Assesses solutions against explicit criteria
  3. Feedback Loop: Enables iterative refinement
  4. Success Criteria: Defines when to exit the loop

Though the components seem straightforward, implementing them effectively requires careful consideration of several key technical factors.

Technical Implementation Framework

The Evaluator-Optimizer pattern requires careful attention to both immediate technical requirements and long-term operational considerations.

Here’s a comprehensive framework that covers key implementation aspects:

  1. Rigorous Loop run and exit criteria
  • Implement circuit breakers to prevent infinite loops
  • Implement circuit breakers to prevent resource exhaustion (each LLM generation & evaluation costs time and money)
  • The explicit separation of generation and evaluation prompts
  1. Clear generation guidelines
  • LLM needs to be given context for what needs to be generated while encouraged to be creative
  • Examples can be given initially (no examples = zero shot, 1 example = one shot, 2+ examples = few shot)
  • Structure generator prompts to encourage exploration within bounded constraints
  1. Clear evaluation criteria
  • LLM needs to be given context for how to evaluate (and what not to add, like trying to generate a solution itself)
  • Examples can be given initially (no examples = zero shot, 1 example = one shot, 2+ examples = few shot)
  • Evaluation Criteria must be both machine-readable and business-relevant
  1. Clear success conditions and failure recovery
  • Enable graceful degradation (stopping if no solution or loop until a solution is found)
  • Embed Quality Assurance direction into the system architecture rather than applied after (this can be human oversight and intervention or autonomous LLM oversight and intervention)
  • Fall back to simpler responses when optimal solutions aren’t found
  1. Measuring Success
  • Monitoring and Observability to ensure system quality
  • Memory management of the Chain of Thought
  • Keeping track of experiments (both immediate improvements and long-term learning patterns)
  1. Long Term Evaluation
  • Are generated solutions similar enough that answers can be cached/saved
  • As business goals evolve, how do the solutions evolve
  • Does the evaluator hallucinate success and/or failure (Type 1 and type 2 errors)

While these technical considerations above are crucial, we must also carefully evaluate the operational costs:

  1. Cost Evaluation
  • Monetary: Cost of running a loop of LLM API Calls
  • Time: Latency of letting the loop run its course (the fast and slow improvement cycles)
  • Failure: How costly is a non-successful loop?
    • Direct costs (wasted compute)
    • Indirect costs (incorrect solutions making it to production)
    • Opportunity costs (time spent in failed loops)

The real power of this pattern becomes apparent when we consider its broader applications.

Strategic Implications

  1. Technical Evolution:
  • Self-improving systems are now feasible at the application level
  • Shift from static to dynamic AI implementations
  • Decreased focus on implementation and increased focus on goal description and achievement
  1. Governance & Control:
  • Autonomous quality control with reduced human oversight
  • Natural checkpoints for audit trails and governance
  • Progressive automation pathways (start with human oversight and remove it slowly as trust in the solution grows)
  1. Business Matters:
  • This pattern enables AI systems that can improve autonomously while maintaining alignment with business objectives
  • Organizations can implement progressive automation strategies that start simple and grow in sophistication
  • The explicit separation of generation and evaluation creates natural points for human oversight and intervention
  • Teams can focus on defining success criteria rather than perfecting prompts, leading to more scalable AI implementations
  • The pattern provides a framework for balancing innovation (generator) with consistency (evaluator)

Key Takeaways for AI Engineers

  • Quality assurance can be embedded directly into the system architecture rather than applied as an afterthought
  • The evaluator-optimizer pattern can be applied at multiple scales ranging from single responses to entire conversation flows
  • System design should account for both “fast” (single iteration) and “slow” (multiple iteration) improvement cycles
  • Engineers should focus on designing robust evaluation criteria rather than perfect generation prompts
  • The pattern enables graceful degradation so that systems can fall back to simpler responses when optimal solutions aren’t found

Forward-Looking Applications

With this implementation framework in mind, let’s explore some practical applications of this pattern:

Example Use Cases

  • Autonomous code review and improvement
  • Content generation with quality guarantees
  • Self-improving chatbot responses
  • Automated documentation refinement

Integration Opportunities

  • CI/CD pipelines for LLM applications
  • Quality assurance automation
  • Progressive system improvement
  • Governance and audit trails

Personal Notes

When I recently implemented this pattern, I found that the key to success wasn’t in perfecting the prompts but in designing comprehensive evaluation criteria.

This matches a broader pattern in AI engineering: focus on clearly defining success and letting the system explore paths to achieve it.