Skip to main content
Optimize Outputs uses AI to analyze failed task outputs, identify patterns, and automatically rewrite prompts for improved accuracy—transforming underperforming tools into high-accuracy agents.

Understanding Optimize Outputs

Watch AI agents learn from failures and fix themselves automatically—no code required. Learning Hub - Tracks tool performance across all workflow nodes, identifying underperforming tools below accuracy thresholds Feedback-Driven Optimization - Mark what went wrong in failed outputs, and AI uses examples to rewrite prompts with better context Automatic Prompt Rewriting - AI analyzes failures, identifies patterns, rewrites prompts with clearer instructions and structured logic Validation Testing - Automatically retests new prompts against same failed cases to verify improvement before deployment Key Benefit: Transform 5% accuracy tools to 100% accuracy in ~30 seconds by providing AI with feedback on what went wrong.

Accessing Learning Hub

Monitor tool performance and identify optimization opportunities across your agent workflows.
1

Navigate to Learning

Open your agent and click “Learning” in left sidebar to access Learning Hub dashboard.
2

Review Tool Performance

View accuracy scores for all tools in your workflow. Tools below 90% threshold highlighted as needing optimization.
3

Identify Underperforming Tools

Locate tools with low accuracy scores (e.g., “Debt Reminder Tier Classifier: 5%”). Compare with high-performing tools (e.g., “Email Content Classifier: 100%”) to understand potential.
4

Access Optimization

Click “Optimize” button next to underperforming tool to begin improvement process.
Accuracy Score:
  • Percentage of tool outputs meeting evaluation criteria
  • Based on Evaluation Framework validation
  • Updated with each task execution
Accuracy Thresholds:
  • 90-100%: Excellent performance
  • 70-89%: Good, minor optimization beneficial
  • 50-69%: Moderate issues, optimization recommended
  • Below 50%: Significant problems, optimization critical
Execution Count:
  • Number of times tool has run
  • Larger sample size = more reliable accuracy metric
  • Minimum 5-10 executions for meaningful optimization
High vs Low Performers:
  • 100% accuracy tools show optimization works when configured correctly
  • Low accuracy tools (5-20%) indicate prompt issues, not capability limits
  • Similar tasks performing differently = prompt quality difference
Optimization Priority:
  1. Critical workflow tools with <50% accuracy
  2. High-volume tools with 50-89% accuracy
  3. Recently added tools needing calibration
  4. Tools with declining accuracy trends

Reviewing Failed Outputs

Examine failed task executions to understand what went wrong before optimization.
Optimize tool modal showing list of failed outputs all at 0% accuracy for Debt Reminder Tier Classifier
Failed Outputs List:
  • Each row represents single task execution
  • Shows task ID, description, thumbs up/down rating, accuracy score (0%), timestamp
  • All failures displayed for pattern identification
  • Checkbox for selecting outputs to provide feedback
1

Open Optimization Modal

Click “Optimize” button in Learning Hub for underperforming tool to open feedback interface.
2

Review Failed Executions

Examine list of failed outputs showing 0% accuracy. Identify common patterns across failures.Example Pattern: Seven classification attempts, all failed (0% accuracy), all attempting same task: “Classify tier of debt reminder based on days-past-due, balance, and reminders”
3

Select Examples for Feedback

Choose 3-5 representative failures covering different scenarios. Mix of good and bad outputs provides better learning signal.Pro Tip: Select diverse failures rather than identical ones for AI to learn broader patterns
What to Look For:
  • Consistent error patterns (same mistake repeated)
  • Missing information in outputs
  • Incorrect classifications or extractions
  • Hallucinations (AI making up data not in input)
  • Format issues (wrong structure, missing fields)
Select Representative Failures:
  • Different input scenarios
  • Various error types
  • Edge cases and common cases
  • Recent and older failures
“Select all good outputs”:
  • Check outputs that were correct
  • Helps AI learn what success looks like
  • Provides positive examples alongside failures
“Select all bad outputs”:
  • Quickly select all failed cases
  • Useful when all outputs have same issue
  • Uncheck outliers that failed for different reasons
Individual Selection:
  • Choose specific mix of good and bad
  • Recommended: 60-70% bad, 30-40% good
  • Provides balanced learning signal

Providing Feedback

Mark what went wrong in failed outputs so AI can learn and improve prompts.
Feedback interface showing selected outputs with hallucination feedback and improved prompt on right side
1

Select Failed Output

Click checkbox next to failed output to review details. Left panel shows output details, right panel shows improved prompt being generated.
2

Review Output Content

Examine actual output produced (e.g., “DebtTier: Active”). Compare against expected correct output.
3

Choose Feedback Type

Select between “Feedback” (explain what’s wrong) or “Ideal output” (provide correct answer):Feedback (Recommended): Explain the error - “Hallucinations with output not being accurate”Ideal Output: Provide exact correct output for this input
4

Tag Error Type

Select error category from tags:
  • Data loss in execution: Information from input missing in output
  • Missing task inputs: Required data not provided to tool
  • Missing context: Tool lacks background information needed
  • Incorrect memory lookup: Wrong reference data retrieved
  • Hallucinations: AI inventing data not in input (common for low-accuracy tools)
5

Repeat for Multiple Examples

Provide feedback on 3-5 selected outputs. More diverse examples = better optimization.Status: “3 selected” shows AI has enough examples to identify patterns
Be Specific:
  • “Output classified as ‘Active’ but should be ‘Tier 2’ based on days past due”
  • Not just “Wrong classification”
Explain Why It’s Wrong:
  • “Missing consideration of balance amount in tier determination”
  • “Hallucinating ‘Active’ status not present in input data”
Provide Context:
  • “Tool should use days_past_due, balance, and reminders count”
  • “Classification requires comparing against tier thresholds”
Use Error Tags Correctly:
  • Hallucinations: AI making up data
  • Data loss: Correct data ignored
  • Missing context: Need domain knowledge added to prompt
  • Missing inputs: Input schema incomplete
Ideal Output (Best for):
  • Extraction tasks (exact values to extract)
  • Classification (correct category)
  • Structured data output (fill in JSON)
  • Clear right/wrong answers
Feedback (Best for):
  • Explaining reasoning errors
  • Complex decision logic
  • Nuanced improvements
  • Process problems vs output problems
Combination Approach:
  • Provide ideal output for 1-2 examples
  • Explain what’s wrong via feedback for others
  • Gives AI both target and reasoning
Hallucinations (Most Common):
  • AI inventing classifications not in input
  • Making up field values
  • Creating data from assumptions
Data Loss:
  • Ignoring key input fields
  • Missing important context
  • Overlooking edge case data
Logic Errors:
  • Wrong decision criteria
  • Misunderstanding task requirements
  • Incorrect priority/weighting
Format Issues:
  • Wrong output structure
  • Missing required fields
  • Incorrect data types

AI Optimization Process

Watch AI analyze failures, rewrite prompts, and validate improvements automatically. Optimization Steps:
  1. Analysis (~10 seconds): AI reviews all selected outputs and feedback to identify failure patterns
  2. Prompt Rewriting (~15 seconds): Generates improved prompt with better role context, structured logic, and clear output requirements
  3. Validation Testing (~5 seconds): Automatically retests new prompt against same failed cases
  4. Results Display: Shows before/after comparison with accuracy improvements
Role Context Added:
  • “You are a skilled debt classification specialist…”
  • Provides domain expertise framing
  • Sets expectations for task complexity
Structured Classification Logic:
  • Breaks down decision process into steps
  • Defines exact criteria for each tier
  • Specifies how to weigh different factors
Output Requirements:
  • Exact format specifications
  • Required fields clearly listed
  • Data type expectations (string, number, boolean)
  • Validation rules embedded
Edge Case Handling:
  • What to do when data missing
  • How to handle boundary conditions
  • Fallback logic defined
Before (Vague): “Classify debt tier based on account information”After (Precise): “You are a skilled debt classification specialist. Analyze overdue accounts to determine appropriate tier. Extract days_past_due (numeric), balance (amount), reminders (count). Classify into: Tier 1 (0-30 days), Tier 2 (31-60 days), Tier 3 (61+ days). Output JSON with debt_tier field.”
Automatic Retest:
  • Runs new prompt on same inputs that previously failed
  • Compares outputs against evaluation criteria
  • Calculates new accuracy scores
  • Shows improvement percentage
Example Results:
  • Original: 0% accuracy on 3 test cases
  • Optimized: 100% accuracy on same 3 cases
  • Improvement: +100 percentage points
What Gets Tested:
  • All outputs you provided feedback on
  • Additional recent failures if available
  • Diverse input scenarios
  • Edge cases from feedback
Strong Improvement:
  • Accuracy jumps 50+ percentage points
  • All test cases now passing
  • Clear pattern recognition visible
  • Prompt significantly more detailed
Moderate Improvement:
  • Accuracy increases 20-49 points
  • Most test cases passing
  • Some edge cases still failing
  • May need additional feedback iteration
Minimal Improvement:
  • Accuracy increases <20 points
  • Many test cases still failing
  • Pattern unclear or complex
  • Need more diverse feedback examples
Action on Minimal Improvement:
  • Add more feedback examples (aim for 5-10)
  • Include diverse error types
  • Provide ideal outputs
  • Consider if tool has right input data

Applying Optimizations

Deploy improved prompts to production after validating accuracy improvements.
Flow builder showing Optimisation applied success message with Publish button to deploy changes
1

Review Optimization Results

Examine improved prompt and validation test results. Verify accuracy improvement meets expectations (target: 90%+).
2

Click Apply Optimizations

Click “Apply optimizations” button to update workflow with new prompt. Changes saved but not yet live.
3

Review in Flow Builder

Return to Flow builder. Green “Optimisation applied” message confirms changes saved successfully.
4

Publish Changes

Click “Publish” button to deploy improved prompt to production. New prompt takes effect on next agent run.Important: Changes not live until published. Test in E2E environment first if available.
5

Monitor Performance

Track tool accuracy in Learning Hub after deployment. Verify improvement persists with new production data.
Verify Improvements:
  • ✅ Accuracy increased significantly (ideally 90%+)
  • ✅ All test cases passing
  • ✅ Prompt changes logical and clear
  • ✅ No unintended side effects visible
Test Safely:
  • Run manual test task with new prompt
  • Verify output format unchanged
  • Check integration compatibility
  • Test with edge case inputs
Rollback Plan:
  • Document original prompt before publishing
  • Monitor first 10-20 production tasks closely
  • Revert if accuracy drops unexpectedly
  • Iterate with more feedback if needed
First 24 Hours:
  • Check Learning Hub for updated accuracy
  • Review first 10 task outputs manually
  • Monitor for new error patterns
  • Compare pre/post optimization metrics
First Week:
  • Track daily accuracy trends
  • Analyze any failures with new prompt
  • Gather feedback from users/reviewers
  • Iterate if accuracy below target
Ongoing:
  • Weekly accuracy reviews
  • Monthly optimization opportunities check
  • Quarterly full prompt review
  • Continuous feedback collection
When to Re-Optimize:
  • Initial optimization improved but still <90% accuracy
  • New failure patterns emerge over time
  • Input data characteristics changed
  • Business rules updated
Iterative Process:
  1. Deploy first optimization
  2. Monitor for 1-2 weeks
  3. Collect new failure examples
  4. Run optimization again with fresh feedback
  5. Repeat until target accuracy achieved
Diminishing Returns:
  • First optimization: Often 40-60% improvement
  • Second optimization: 10-20% improvement
  • Third+ optimization: <10% improvement
  • Consider if prompting limits reached

Optimization Best Practices

Diversity Matters:
  • Include different error types
  • Cover various input scenarios
  • Mix recent and older failures
  • Representative of production data
Quantity Guidelines:
  • Minimum: 3 examples (basic optimization)
  • Recommended: 5-7 examples (good optimization)
  • Maximum useful: 10-15 examples (comprehensive)
  • Beyond 15: Diminishing returns
Balance Good and Bad:
  • 60-70% bad outputs (what to fix)
  • 30-40% good outputs (what to preserve)
  • Helps AI maintain good behavior while fixing issues
Good Feedback Examples:
  • “Output classified as ‘Active’ but input shows 45 days past due, should be Tier 2”
  • “Missing consideration of balance amount - only used days_past_due”
  • “Hallucinating ‘paid’ status not present in input data”
Poor Feedback Examples:
  • “Wrong” (too vague)
  • “Bad output” (not actionable)
  • “Doesn’t work” (no specific guidance)
Feedback Template:
  • What’s wrong: “Output shows X”
  • Why it’s wrong: “But input indicates Y”
  • What’s needed: “Should classify as Z based on [criteria]”
Test in E2E Environment:
  1. Apply optimizations (don’t publish)
  2. Copy prompt to E2E agent version
  3. Run test dataset (see Test Datasets)
  4. Verify 90%+ accuracy maintained
  5. Publish to production if validated
Manual Spot Checks:
  • Run 5-10 manual test tasks
  • Review outputs for correctness
  • Check for format consistency
  • Verify edge case handling
Rollback Readiness:
  • Save original prompt version
  • Document changes made
  • Have revert process ready
  • Monitor closely post-deployment
Strong Improvement (90%+ accuracy):
  • Publish and monitor
  • Document what worked
  • Apply learnings to other tools
Moderate Improvement (70-89% accuracy):
  • Run second optimization with more examples
  • Add ideal outputs for clarity
  • Test again before publishing
Minimal Improvement (<70% accuracy):
  • Review if tool has right inputs
  • Check if task too complex for single prompt
  • Consider workflow redesign
  • Consult Debug Tools
No Improvement:
  • Verify feedback quality and diversity
  • Check if evaluation criteria correct
  • Review if fundamental data missing
  • May need human-in-the-loop (see Automation Modes)

Integration with Evaluation Framework

Optimize Outputs works seamlessly with Evaluation Framework for continuous quality improvement. Connected Workflow:
  1. Evaluation Framework defines validation criteria for outputs
  2. Task executions generate accuracy scores against criteria
  3. Learning Hub aggregates scores to identify low-performing tools
  4. Optimize Outputs uses failed evaluations as feedback for improvement
  5. Improved prompts increase future evaluation scores
  6. Analytics track improvement trends over time
Benefits:
  • Automated quality measurement
  • Data-driven optimization
  • Quantifiable improvements
  • Continuous learning loop

Next Steps