Skip to main content
Test datasets validate agent accuracy by processing multiple test cases simultaneously and comparing results against expected outputs.

Understanding Test Datasets

Test datasets enable systematic quality validation across representative scenarios. Test Input - Sample data representing real workflows (invoices, emails, forms, documents) Expected Output - Ground truth results agent should produce for each input Evaluation Criteria - Node-level validation rules measuring extraction and classification accuracy (see Evaluation Framework) Batch Execution - Run all test cases together via webhook to measure overall agent performance

Creating Test Datasets

Build datasets covering common scenarios and edge cases to ensure comprehensive validation.
1

Collect Test Inputs

Gather 10-50 representative examples covering standard cases (70%), edge cases (20%), and error scenarios (10%). Use anonymized production data or synthetic test cases.
2

Define Expected Outputs

Specify ground truth for each input. Use successful production task outputs, manually define correct results, or run initial tests and correct the outputs. See Evaluation Framework for creating expected outputs from task executions.
3

Set Node Evaluation Criteria

Configure validation rules for critical workflow steps. Define field-specific checks like “invoice_amount must be a number” or “priority must be High/Medium/Low”. See Evaluation Framework for detailed criteria setup.
4

Organize Dataset

Group related test cases into named collections. Example: “Invoice Processing - Standard Cases” or “Customer Inquiry - Edge Cases”.
Standard Cases (70%):
  • Most common workflow paths
  • Typical data formats
  • Expected user inputs
Edge Cases (20%):
  • Boundary values (zero, negative, very large)
  • Missing or incomplete data
  • Unusual formats
Error Cases (10%):
  • Invalid inputs requiring error handling
  • Malformed data
  • Out-of-range values
Recommended Sizes:
  • Small agents: 10-20 test cases
  • Medium agents: 20-50 test cases
  • Critical workflows: 50-100 test cases
Be Precise:
  • Match exact field names from workflow output
  • Use specific values, not ranges
  • Correct data types (string, number, boolean)
  • Proper date/currency formats
Example Invoice Extraction:
{
  "invoice_number": "INV-2025-001234",
  "invoice_date": "2025-01-15",
  "amount_due": 1500.00,
  "vendor_name": "Acme Corp"
}
Example Classification:
{
  "category": "Billing Inquiry",
  "priority": "High",
  "requires_human_review": false
}

Running Test Datasets

Execute datasets via webhook to batch-process all test cases and measure accuracy.
1

Configure Webhook Trigger

Set up webhook trigger for your agent. See Triggers & Webhooks for detailed setup.
2

Prepare Batch Payload

Structure dataset as JSON array with test inputs and expected outputs.
3

Execute via Webhook

Send HTTP POST request to trigger agent with entire dataset batch.Webhook URL Format:
https://api.beamstudio.ai/webhooks/{agent_id}/{trigger_id}
Example Batch Request:
curl -X POST https://api.beamstudio.ai/webhooks/{agent_id}/{trigger_id} \
  -H "Content-Type: application/json" \
  -d '{
    "dataset": "invoice-processing-q1-2025",
    "test_cases": [
      {"invoice_pdf": "...", "expected": {"amount": 1500, "vendor": "Acme"}},
      {"invoice_pdf": "...", "expected": {"amount": 2300, "vendor": "TechCo"}},
      {"invoice_pdf": "...", "expected": {"amount": 890, "vendor": "SupplyCo"}}
    ]
  }'
4

Monitor Results

View execution results in Task Executions. Review accuracy scores, identify failures, and analyze discrepancies.
95-100% Match: Agent performing as expected - ready for production85-94% Match: Minor issues - investigate failures, may need prompt optimizationBelow 85%: Significant problems - review evaluation criteria, update prompts, or verify expected outputs are correctFailure Analysis:
  • Which specific test cases failed?
  • Are failures clustered (e.g., all date extraction errors)?
  • Random failures or consistent patterns?
  • Review failed tasks using Debug Tools
CI/CD Integration:
  • Run datasets automatically before deployment
  • Block releases if accuracy drops below threshold
  • Generate test reports for stakeholders
Scheduled Testing:
  • Daily runs on critical datasets
  • Weekly comprehensive validation
  • Track accuracy trends over time
Programmatic Control:
  • Trigger from external systems
  • Integrate with monitoring tools
  • Automate regression testing

Iterative Improvement

Use test dataset results to systematically improve agent performance. Improvement Workflow:
  1. Baseline: Run dataset → Record accuracy scores
  2. Identify Issues: Review failures → Determine root causes
  3. Optimize: Update prompts, criteria, or workflow (see Optimize Outputs)
  4. Retest: Rerun dataset → Validate improvements
  5. Deploy: Publish if accuracy meets targets (95%+)
Prompt Optimization:
  • Failed test: Agent extracting wrong invoice date
  • Action: Use Optimize Outputs to improve prompt
  • Retest: Run dataset to verify fix
  • Result: Date extraction accuracy improves from 82% to 96%
Evaluation Criteria Tuning:
  • Issue: False failures on valid outputs
  • Action: Adjust criteria thresholds in Evaluation Framework
  • Retest: Validate adjusted criteria
  • Result: Reduce false failures while maintaining quality
Expected Output Corrections:
  • Pattern: Multiple tests failing with similar outputs
  • Action: Review expected outputs for errors
  • Fix: Update incorrect expected values
  • Result: Test accuracy reflects true agent performance
Quarterly Updates:
  • Add new test cases from production issues
  • Remove outdated scenarios
  • Update expected outputs if requirements changed
After Major Changes:
  • Expand dataset for new features
  • Add regression tests for bug fixes
  • Archive deprecated test cases
Version Control:
  • Track dataset changes over time
  • Link versions to agent releases
  • Document what changed and why

Best Practices

Week 1: Create 10-15 core test cases covering common scenariosWeeks 2-4: Add 10-20 edge cases discovered during initial testingMonths 2-3: Expand to 50+ cases including all production issuesOngoing: Add test for every new bug or production failure
Manual Review:
  • Initial dataset creation
  • Complex edge case validation
  • Expected output verification
Automated Execution:
  • Webhook-triggered batch runs
  • Scheduled regression testing
  • CI/CD pipeline integration
  • Continuous monitoring

Next Steps