Understanding Test Datasets
Test datasets enable systematic quality validation across representative scenarios. Test Input - Sample data representing real workflows (invoices, emails, forms, documents) Expected Output - Ground truth results agent should produce for each input Evaluation Criteria - Node-level validation rules measuring extraction and classification accuracy (see Evaluation Framework) Batch Execution - Run all test cases together via webhook to measure overall agent performanceCreating Test Datasets
Build datasets covering common scenarios and edge cases to ensure comprehensive validation.1
Collect Test Inputs
Gather 10-50 representative examples covering standard cases (70%), edge cases (20%), and error scenarios (10%). Use anonymized production data or synthetic test cases.
2
Define Expected Outputs
Specify ground truth for each input. Use successful production task outputs, manually define correct results, or run initial tests and correct the outputs. See Evaluation Framework for creating expected outputs from task executions.
3
Set Node Evaluation Criteria
Configure validation rules for critical workflow steps. Define field-specific checks like “invoice_amount must be a number” or “priority must be High/Medium/Low”. See Evaluation Framework for detailed criteria setup.
4
Organize Dataset
Group related test cases into named collections. Example: “Invoice Processing - Standard Cases” or “Customer Inquiry - Edge Cases”.
Test Coverage Strategy
Test Coverage Strategy
Standard Cases (70%):
- Most common workflow paths
- Typical data formats
- Expected user inputs
- Boundary values (zero, negative, very large)
- Missing or incomplete data
- Unusual formats
- Invalid inputs requiring error handling
- Malformed data
- Out-of-range values
- Small agents: 10-20 test cases
- Medium agents: 20-50 test cases
- Critical workflows: 50-100 test cases
Expected Output Best Practices
Expected Output Best Practices
Be Precise:Example Classification:
- Match exact field names from workflow output
- Use specific values, not ranges
- Correct data types (string, number, boolean)
- Proper date/currency formats
Running Test Datasets
Execute datasets via webhook to batch-process all test cases and measure accuracy.1
Configure Webhook Trigger
Set up webhook trigger for your agent. See Triggers & Webhooks for detailed setup.
2
Prepare Batch Payload
Structure dataset as JSON array with test inputs and expected outputs.
3
Execute via Webhook
Send HTTP POST request to trigger agent with entire dataset batch.Webhook URL Format:Example Batch Request:
4
Monitor Results
View execution results in Task Executions. Review accuracy scores, identify failures, and analyze discrepancies.
Interpreting Accuracy Scores
Interpreting Accuracy Scores
95-100% Match: Agent performing as expected - ready for production85-94% Match: Minor issues - investigate failures, may need prompt optimizationBelow 85%: Significant problems - review evaluation criteria, update prompts, or verify expected outputs are correctFailure Analysis:
- Which specific test cases failed?
- Are failures clustered (e.g., all date extraction errors)?
- Random failures or consistent patterns?
- Review failed tasks using Debug Tools
Webhook Automation Benefits
Webhook Automation Benefits
CI/CD Integration:
- Run datasets automatically before deployment
- Block releases if accuracy drops below threshold
- Generate test reports for stakeholders
- Daily runs on critical datasets
- Weekly comprehensive validation
- Track accuracy trends over time
- Trigger from external systems
- Integrate with monitoring tools
- Automate regression testing
Iterative Improvement
Use test dataset results to systematically improve agent performance. Improvement Workflow:- Baseline: Run dataset → Record accuracy scores
- Identify Issues: Review failures → Determine root causes
- Optimize: Update prompts, criteria, or workflow (see Optimize Outputs)
- Retest: Rerun dataset → Validate improvements
- Deploy: Publish if accuracy meets targets (95%+)
Common Optimization Scenarios
Common Optimization Scenarios
Prompt Optimization:
- Failed test: Agent extracting wrong invoice date
- Action: Use Optimize Outputs to improve prompt
- Retest: Run dataset to verify fix
- Result: Date extraction accuracy improves from 82% to 96%
- Issue: False failures on valid outputs
- Action: Adjust criteria thresholds in Evaluation Framework
- Retest: Validate adjusted criteria
- Result: Reduce false failures while maintaining quality
- Pattern: Multiple tests failing with similar outputs
- Action: Review expected outputs for errors
- Fix: Update incorrect expected values
- Result: Test accuracy reflects true agent performance
Dataset Maintenance
Dataset Maintenance
Quarterly Updates:
- Add new test cases from production issues
- Remove outdated scenarios
- Update expected outputs if requirements changed
- Expand dataset for new features
- Add regression tests for bug fixes
- Archive deprecated test cases
- Track dataset changes over time
- Link versions to agent releases
- Document what changed and why
Best Practices
Start Small, Expand Strategically
Start Small, Expand Strategically
Week 1: Create 10-15 core test cases covering common scenariosWeeks 2-4: Add 10-20 edge cases discovered during initial testingMonths 2-3: Expand to 50+ cases including all production issuesOngoing: Add test for every new bug or production failure
Combine Manual and Automated Testing
Combine Manual and Automated Testing
Manual Review:
- Initial dataset creation
- Complex edge case validation
- Expected output verification
- Webhook-triggered batch runs
- Scheduled regression testing
- CI/CD pipeline integration
- Continuous monitoring
Track Trends Over Time
Track Trends Over Time
Metrics to Monitor:
- Overall dataset accuracy (weekly)
- Per-test-case pass rates
- Failure patterns and clusters
- Accuracy trends (improving/degrading?)
- Detecting performance drift
- Validating optimization impact
- Demonstrating ROI to stakeholders
- Identifying systematic issues