> ## Documentation Index
> Fetch the complete documentation index at: https://docs.beam.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Optimize Outputs

> AI-powered prompt optimization that learns from failures and automatically rewrites prompts to improve accuracy from 5% to 100% without code

Optimize Outputs uses AI to analyze failed task outputs, identify patterns, and automatically rewrite prompts for improved accuracy—transforming underperforming tools into high-accuracy agents.

<iframe src="https://app.supademo.com/embed/cmhofy0rw0suvck4d4nmf39r1" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style={{width: "100%", height: "450px"}} />

## Understanding Optimize Outputs

Watch AI agents learn from failures and fix themselves automatically—no code required.

**Learning Hub** - Tracks tool performance across all workflow nodes, identifying underperforming tools below accuracy thresholds

**Feedback-Driven Optimization** - Mark what went wrong in failed outputs, and AI uses examples to rewrite prompts with better context

**Automatic Prompt Rewriting** - AI analyzes failures, identifies patterns, rewrites prompts with clearer instructions and structured logic

**Validation Testing** - Automatically retests new prompts against same failed cases to verify improvement before deployment

**Key Benefit**: Transform 5% accuracy tools to 100% accuracy in \~30 seconds by providing AI with feedback on what went wrong.

## Accessing Learning Hub

Monitor tool performance and identify optimization opportunities across your agent workflows.

<Steps>
  <Step title="Navigate to Learning">
    Open your agent and click "Learning" in left sidebar to access Learning Hub dashboard.
  </Step>

  <Step title="Review Tool Performance">
    View accuracy scores for all tools in your workflow. Tools below 90% threshold highlighted as needing optimization.
  </Step>

  <Step title="Identify Underperforming Tools">
    Locate tools with low accuracy scores (e.g., "Debt Reminder Tier Classifier: 5%"). Compare with high-performing tools (e.g., "Email Content Classifier: 100%") to understand potential.
  </Step>

  <Step title="Access Optimization">
    Click "Optimize" button next to underperforming tool to begin improvement process.
  </Step>
</Steps>

<AccordionGroup>
  <Accordion title="Performance Metrics">
    **Accuracy Score:**

    * Percentage of tool outputs meeting evaluation criteria
    * Based on [Evaluation Framework](/04-observability-analytics/evaluation-framework/evaluation-framework) validation
    * Updated with each task execution

    **Accuracy Thresholds:**

    * 90-100%: Excellent performance
    * 70-89%: Good, minor optimization beneficial
    * 50-69%: Moderate issues, optimization recommended
    * Below 50%: Significant problems, optimization critical

    **Execution Count:**

    * Number of times tool has run
    * Larger sample size = more reliable accuracy metric
    * Minimum 5-10 executions for meaningful optimization
  </Accordion>

  <Accordion title="Comparing Tool Performance">
    **High vs Low Performers:**

    * 100% accuracy tools show optimization works when configured correctly
    * Low accuracy tools (5-20%) indicate prompt issues, not capability limits
    * Similar tasks performing differently = prompt quality difference

    **Optimization Priority:**

    1. Critical workflow tools with \<50% accuracy
    2. High-volume tools with 50-89% accuracy
    3. Recently added tools needing calibration
    4. Tools with declining accuracy trends
  </Accordion>
</AccordionGroup>

## Reviewing Failed Outputs

Examine failed task executions to understand what went wrong before optimization.

<Frame>
  <img src="https://mintcdn.com/beamai/YDqllBKSmU7636m6/04-observability-analytics/optimize-outputs/CDYgW1i62rOsgPJd3-8s4.jpg?fit=max&auto=format&n=YDqllBKSmU7636m6&q=85&s=31d8cbdaf237b893fb63437e4b575ede" alt="Optimize tool modal showing list of failed outputs all at 0% accuracy for Debt Reminder Tier Classifier" width="2560" height="1440" data-path="04-observability-analytics/optimize-outputs/CDYgW1i62rOsgPJd3-8s4.jpg" />
</Frame>

**Failed Outputs List:**

* Each row represents single task execution
* Shows task ID, description, thumbs up/down rating, accuracy score (0%), timestamp
* All failures displayed for pattern identification
* Checkbox for selecting outputs to provide feedback

<Steps>
  <Step title="Open Optimization Modal">
    Click "Optimize" button in Learning Hub for underperforming tool to open feedback interface.
  </Step>

  <Step title="Review Failed Executions">
    Examine list of failed outputs showing 0% accuracy. Identify common patterns across failures.

    **Example Pattern:**
    Seven classification attempts, all failed (0% accuracy), all attempting same task: "Classify tier of debt reminder based on days-past-due, balance, and reminders"
  </Step>

  <Step title="Select Examples for Feedback">
    Choose 3-5 representative failures covering different scenarios. Mix of good and bad outputs provides better learning signal.

    **Pro Tip:** Select diverse failures rather than identical ones for AI to learn broader patterns
  </Step>
</Steps>

<AccordionGroup>
  <Accordion title="Output Review Criteria">
    **What to Look For:**

    * Consistent error patterns (same mistake repeated)
    * Missing information in outputs
    * Incorrect classifications or extractions
    * Hallucinations (AI making up data not in input)
    * Format issues (wrong structure, missing fields)

    **Select Representative Failures:**

    * Different input scenarios
    * Various error types
    * Edge cases and common cases
    * Recent and older failures
  </Accordion>

  <Accordion title="Filter Options">
    **"Select all good outputs":**

    * Check outputs that were correct
    * Helps AI learn what success looks like
    * Provides positive examples alongside failures

    **"Select all bad outputs":**

    * Quickly select all failed cases
    * Useful when all outputs have same issue
    * Uncheck outliers that failed for different reasons

    **Individual Selection:**

    * Choose specific mix of good and bad
    * Recommended: 60-70% bad, 30-40% good
    * Provides balanced learning signal
  </Accordion>
</AccordionGroup>

## Providing Feedback

Mark what went wrong in failed outputs so AI can learn and improve prompts.

<Frame>
  <img src="https://mintcdn.com/beamai/tUbNiSLV6K1eNRa9/04-observability-analytics/optimize-outputs/vTC6UsPxNTyISRa9nxGeL.jpg?fit=max&auto=format&n=tUbNiSLV6K1eNRa9&q=85&s=9b21d70db8d2c885ffbbe2d8f155b67b" alt="Feedback interface showing selected outputs with hallucination feedback and improved prompt on right side" width="2560" height="1440" data-path="04-observability-analytics/optimize-outputs/vTC6UsPxNTyISRa9nxGeL.jpg" />
</Frame>

<Steps>
  <Step title="Select Failed Output">
    Click checkbox next to failed output to review details. Left panel shows output details, right panel shows improved prompt being generated.
  </Step>

  <Step title="Review Output Content">
    Examine actual output produced (e.g., "DebtTier: Active"). Compare against expected correct output.
  </Step>

  <Step title="Choose Feedback Type">
    Select between "Feedback" (explain what's wrong) or "Ideal output" (provide correct answer):

    **Feedback (Recommended):** Explain the error - "Hallucinations with output not being accurate"

    **Ideal Output:** Provide exact correct output for this input
  </Step>

  <Step title="Tag Error Type">
    Select error category from tags:

    * **Data loss in execution**: Information from input missing in output
    * **Missing task inputs**: Required data not provided to tool
    * **Missing context**: Tool lacks background information needed
    * **Incorrect memory lookup**: Wrong reference data retrieved
    * **Hallucinations**: AI inventing data not in input (common for low-accuracy tools)
  </Step>

  <Step title="Repeat for Multiple Examples">
    Provide feedback on 3-5 selected outputs. More diverse examples = better optimization.

    **Status:** "3 selected" shows AI has enough examples to identify patterns
  </Step>
</Steps>

<AccordionGroup>
  <Accordion title="Effective Feedback Strategies">
    **Be Specific:**

    * "Output classified as 'Active' but should be 'Tier 2' based on days past due"
    * Not just "Wrong classification"

    **Explain Why It's Wrong:**

    * "Missing consideration of balance amount in tier determination"
    * "Hallucinating 'Active' status not present in input data"

    **Provide Context:**

    * "Tool should use days\_past\_due, balance, and reminders count"
    * "Classification requires comparing against tier thresholds"

    **Use Error Tags Correctly:**

    * Hallucinations: AI making up data
    * Data loss: Correct data ignored
    * Missing context: Need domain knowledge added to prompt
    * Missing inputs: Input schema incomplete
  </Accordion>

  <Accordion title="Ideal Output vs Feedback">
    **Ideal Output (Best for):**

    * Extraction tasks (exact values to extract)
    * Classification (correct category)
    * Structured data output (fill in JSON)
    * Clear right/wrong answers

    **Feedback (Best for):**

    * Explaining reasoning errors
    * Complex decision logic
    * Nuanced improvements
    * Process problems vs output problems

    **Combination Approach:**

    * Provide ideal output for 1-2 examples
    * Explain what's wrong via feedback for others
    * Gives AI both target and reasoning
  </Accordion>

  <Accordion title="Common Error Patterns">
    **Hallucinations (Most Common):**

    * AI inventing classifications not in input
    * Making up field values
    * Creating data from assumptions

    **Data Loss:**

    * Ignoring key input fields
    * Missing important context
    * Overlooking edge case data

    **Logic Errors:**

    * Wrong decision criteria
    * Misunderstanding task requirements
    * Incorrect priority/weighting

    **Format Issues:**

    * Wrong output structure
    * Missing required fields
    * Incorrect data types
  </Accordion>
</AccordionGroup>

## AI Optimization Process

Watch AI analyze failures, rewrite prompts, and validate improvements automatically.

**Optimization Steps:**

1. **Analysis (\~10 seconds):** AI reviews all selected outputs and feedback to identify failure patterns
2. **Prompt Rewriting (\~15 seconds):** Generates improved prompt with better role context, structured logic, and clear output requirements
3. **Validation Testing (\~5 seconds):** Automatically retests new prompt against same failed cases
4. **Results Display:** Shows before/after comparison with accuracy improvements

<AccordionGroup>
  <Accordion title="What AI Changes in Prompts">
    **Role Context Added:**

    * "You are a skilled debt classification specialist..."
    * Provides domain expertise framing
    * Sets expectations for task complexity

    **Structured Classification Logic:**

    * Breaks down decision process into steps
    * Defines exact criteria for each tier
    * Specifies how to weigh different factors

    **Output Requirements:**

    * Exact format specifications
    * Required fields clearly listed
    * Data type expectations (string, number, boolean)
    * Validation rules embedded

    **Edge Case Handling:**

    * What to do when data missing
    * How to handle boundary conditions
    * Fallback logic defined

    **Before (Vague):**
    "Classify debt tier based on account information"

    **After (Precise):**
    "You are a skilled debt classification specialist. Analyze overdue accounts to determine appropriate tier. Extract days\_past\_due (numeric), balance (amount), reminders (count). Classify into: Tier 1 (0-30 days), Tier 2 (31-60 days), Tier 3 (61+ days). Output JSON with debt\_tier field."
  </Accordion>

  <Accordion title="Validation Testing">
    **Automatic Retest:**

    * Runs new prompt on same inputs that previously failed
    * Compares outputs against evaluation criteria
    * Calculates new accuracy scores
    * Shows improvement percentage

    **Example Results:**

    * Original: 0% accuracy on 3 test cases
    * Optimized: 100% accuracy on same 3 cases
    * Improvement: +100 percentage points

    **What Gets Tested:**

    * All outputs you provided feedback on
    * Additional recent failures if available
    * Diverse input scenarios
    * Edge cases from feedback
  </Accordion>

  <Accordion title="Optimization Success Indicators">
    **Strong Improvement:**

    * Accuracy jumps 50+ percentage points
    * All test cases now passing
    * Clear pattern recognition visible
    * Prompt significantly more detailed

    **Moderate Improvement:**

    * Accuracy increases 20-49 points
    * Most test cases passing
    * Some edge cases still failing
    * May need additional feedback iteration

    **Minimal Improvement:**

    * Accuracy increases \<20 points
    * Many test cases still failing
    * Pattern unclear or complex
    * Need more diverse feedback examples

    **Action on Minimal Improvement:**

    * Add more feedback examples (aim for 5-10)
    * Include diverse error types
    * Provide ideal outputs
    * Consider if tool has right input data
  </Accordion>
</AccordionGroup>

## Applying Optimizations

Deploy improved prompts to production after validating accuracy improvements.

<Frame>
  <img src="https://mintcdn.com/beamai/YDqllBKSmU7636m6/04-observability-analytics/optimize-outputs/6khwc-oXlrdHaHyvtDkE3.jpg?fit=max&auto=format&n=YDqllBKSmU7636m6&q=85&s=49655bfbc277090a69444cac1ea93457" alt="Flow builder showing Optimisation applied success message with Publish button to deploy changes" width="2560" height="1440" data-path="04-observability-analytics/optimize-outputs/6khwc-oXlrdHaHyvtDkE3.jpg" />
</Frame>

<Steps>
  <Step title="Review Optimization Results">
    Examine improved prompt and validation test results. Verify accuracy improvement meets expectations (target: 90%+).
  </Step>

  <Step title="Click Apply Optimizations">
    Click "Apply optimizations" button to update workflow with new prompt. Changes saved but not yet live.
  </Step>

  <Step title="Review in Flow Builder">
    Return to Flow builder. Green "Optimisation applied" message confirms changes saved successfully.
  </Step>

  <Step title="Publish Changes">
    Click "Publish" button to deploy improved prompt to production. New prompt takes effect on next agent run.

    **Important:** Changes not live until published. Test in E2E environment first if available.
  </Step>

  <Step title="Monitor Performance">
    Track tool accuracy in Learning Hub after deployment. Verify improvement persists with new production data.
  </Step>
</Steps>

<AccordionGroup>
  <Accordion title="Pre-Publishing Checklist">
    **Verify Improvements:**

    * ✅ Accuracy increased significantly (ideally 90%+)
    * ✅ All test cases passing
    * ✅ Prompt changes logical and clear
    * ✅ No unintended side effects visible

    **Test Safely:**

    * Run manual test task with new prompt
    * Verify output format unchanged
    * Check integration compatibility
    * Test with edge case inputs

    **Rollback Plan:**

    * Document original prompt before publishing
    * Monitor first 10-20 production tasks closely
    * Revert if accuracy drops unexpectedly
    * Iterate with more feedback if needed
  </Accordion>

  <Accordion title="Post-Publishing Monitoring">
    **First 24 Hours:**

    * Check Learning Hub for updated accuracy
    * Review first 10 task outputs manually
    * Monitor for new error patterns
    * Compare pre/post optimization metrics

    **First Week:**

    * Track daily accuracy trends
    * Analyze any failures with new prompt
    * Gather feedback from users/reviewers
    * Iterate if accuracy below target

    **Ongoing:**

    * Weekly accuracy reviews
    * Monthly optimization opportunities check
    * Quarterly full prompt review
    * Continuous feedback collection
  </Accordion>

  <Accordion title="Multiple Optimization Iterations">
    **When to Re-Optimize:**

    * Initial optimization improved but still \<90% accuracy
    * New failure patterns emerge over time
    * Input data characteristics changed
    * Business rules updated

    **Iterative Process:**

    1. Deploy first optimization
    2. Monitor for 1-2 weeks
    3. Collect new failure examples
    4. Run optimization again with fresh feedback
    5. Repeat until target accuracy achieved

    **Diminishing Returns:**

    * First optimization: Often 40-60% improvement
    * Second optimization: 10-20% improvement
    * Third+ optimization: \<10% improvement
    * Consider if prompting limits reached
  </Accordion>
</AccordionGroup>

## Optimization Best Practices

<AccordionGroup>
  <Accordion title="Select Quality Feedback Examples">
    **Diversity Matters:**

    * Include different error types
    * Cover various input scenarios
    * Mix recent and older failures
    * Representative of production data

    **Quantity Guidelines:**

    * Minimum: 3 examples (basic optimization)
    * Recommended: 5-7 examples (good optimization)
    * Maximum useful: 10-15 examples (comprehensive)
    * Beyond 15: Diminishing returns

    **Balance Good and Bad:**

    * 60-70% bad outputs (what to fix)
    * 30-40% good outputs (what to preserve)
    * Helps AI maintain good behavior while fixing issues
  </Accordion>

  <Accordion title="Write Clear, Actionable Feedback">
    **Good Feedback Examples:**

    * "Output classified as 'Active' but input shows 45 days past due, should be Tier 2"
    * "Missing consideration of balance amount - only used days\_past\_due"
    * "Hallucinating 'paid' status not present in input data"

    **Poor Feedback Examples:**

    * "Wrong" (too vague)
    * "Bad output" (not actionable)
    * "Doesn't work" (no specific guidance)

    **Feedback Template:**

    * What's wrong: "Output shows X"
    * Why it's wrong: "But input indicates Y"
    * What's needed: "Should classify as Z based on \[criteria]"
  </Accordion>

  <Accordion title="Validate Before Publishing">
    **Test in E2E Environment:**

    1. Apply optimizations (don't publish)
    2. Copy prompt to E2E agent version
    3. Run test dataset (see [Test Datasets](/03-running-operations/debugging-testing/test-datasets/test-datasets))
    4. Verify 90%+ accuracy maintained
    5. Publish to production if validated

    **Manual Spot Checks:**

    * Run 5-10 manual test tasks
    * Review outputs for correctness
    * Check for format consistency
    * Verify edge case handling

    **Rollback Readiness:**

    * Save original prompt version
    * Document changes made
    * Have revert process ready
    * Monitor closely post-deployment
  </Accordion>

  <Accordion title="Iterate Based on Results">
    **Strong Improvement (90%+ accuracy):**

    * Publish and monitor
    * Document what worked
    * Apply learnings to other tools

    **Moderate Improvement (70-89% accuracy):**

    * Run second optimization with more examples
    * Add ideal outputs for clarity
    * Test again before publishing

    **Minimal Improvement (\<70% accuracy):**

    * Review if tool has right inputs
    * Check if task too complex for single prompt
    * Consider workflow redesign
    * Consult [Debug Tools](/03-running-operations/debugging-testing/debug-tools/debug-tools)

    **No Improvement:**

    * Verify feedback quality and diversity
    * Check if evaluation criteria correct
    * Review if fundamental data missing
    * May need human-in-the-loop (see [Automation Modes](/03-running-operations/task-management/automation-modes/automation-modes))
  </Accordion>
</AccordionGroup>

## Integration with Evaluation Framework

Optimize Outputs works seamlessly with [Evaluation Framework](/04-observability-analytics/evaluation-framework/evaluation-framework) for continuous quality improvement.

**Connected Workflow:**

1. **Evaluation Framework** defines validation criteria for outputs
2. **Task executions** generate accuracy scores against criteria
3. **Learning Hub** aggregates scores to identify low-performing tools
4. **Optimize Outputs** uses failed evaluations as feedback for improvement
5. **Improved prompts** increase future evaluation scores
6. **Analytics** track improvement trends over time

**Benefits:**

* Automated quality measurement
* Data-driven optimization
* Quantifiable improvements
* Continuous learning loop

## Next Steps

<CardGroup cols={2}>
  <Card title="Evaluation Framework" icon="clipboard-check" href="/04-observability-analytics/evaluation-framework/evaluation-framework">
    Set evaluation criteria to measure optimization success
  </Card>

  <Card title="Test Datasets" icon="vial" href="/03-running-operations/debugging-testing/test-datasets/test-datasets">
    Validate optimized prompts with test datasets
  </Card>

  <Card title="Rerunning Tasks" icon="rotate" href="/03-running-operations/debugging-testing/rerunning-tasks/rerunning-tasks">
    Rerun failed tasks after optimization to demonstrate improvement
  </Card>

  <Card title="Task Executions" icon="chart-line" href="/03-running-operations/task-management/task-executions/task-executions">
    Monitor improved accuracy in production task executions
  </Card>
</CardGroup>
