> ## Documentation Index
> Fetch the complete documentation index at: https://docs.beam.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Test Datasets

> Validate agent performance at scale by consolidating test inputs, defining expected outputs, and batch-testing workflows

Test datasets validate agent accuracy by processing multiple test cases simultaneously and comparing results against expected outputs.

## Understanding Test Datasets

Test datasets enable systematic quality validation across representative scenarios.

**Test Input** - Sample data representing real workflows (invoices, emails, forms, documents)

**Expected Output** - Ground truth results agent should produce for each input

**Evaluation Criteria** - Node-level validation rules measuring extraction and classification accuracy (see [Evaluation Framework](/04-observability-analytics/evaluation-framework/evaluation-framework))

**Batch Execution** - Run all test cases together via webhook to measure overall agent performance

## Creating Test Datasets

Build datasets covering common scenarios and edge cases to ensure comprehensive validation.

<Steps>
  <Step title="Collect Test Inputs">
    Gather 10-50 representative examples covering standard cases (70%), edge cases (20%), and error scenarios (10%). Use anonymized production data or synthetic test cases.
  </Step>

  <Step title="Define Expected Outputs">
    Specify ground truth for each input. Use successful production task outputs, manually define correct results, or run initial tests and correct the outputs. See [Evaluation Framework](/04-observability-analytics/evaluation-framework/evaluation-framework) for creating expected outputs from task executions.
  </Step>

  <Step title="Set Node Evaluation Criteria">
    Configure validation rules for critical workflow steps. Define field-specific checks like "invoice\_amount must be a number" or "priority must be High/Medium/Low". See [Evaluation Framework](/04-observability-analytics/evaluation-framework/evaluation-framework) for detailed criteria setup.
  </Step>

  <Step title="Organize Dataset">
    Group related test cases into named collections. Example: "Invoice Processing - Standard Cases" or "Customer Inquiry - Edge Cases".
  </Step>
</Steps>

<AccordionGroup>
  <Accordion title="Test Coverage Strategy">
    **Standard Cases (70%):**

    * Most common workflow paths
    * Typical data formats
    * Expected user inputs

    **Edge Cases (20%):**

    * Boundary values (zero, negative, very large)
    * Missing or incomplete data
    * Unusual formats

    **Error Cases (10%):**

    * Invalid inputs requiring error handling
    * Malformed data
    * Out-of-range values

    **Recommended Sizes:**

    * Small agents: 10-20 test cases
    * Medium agents: 20-50 test cases
    * Critical workflows: 50-100 test cases
  </Accordion>

  <Accordion title="Expected Output Best Practices">
    **Be Precise:**

    * Match exact field names from workflow output
    * Use specific values, not ranges
    * Correct data types (string, number, boolean)
    * Proper date/currency formats

    **Example Invoice Extraction:**

    ```json theme={null}
    {
      "invoice_number": "INV-2025-001234",
      "invoice_date": "2025-01-15",
      "amount_due": 1500.00,
      "vendor_name": "Acme Corp"
    }
    ```

    **Example Classification:**

    ```json theme={null}
    {
      "category": "Billing Inquiry",
      "priority": "High",
      "requires_human_review": false
    }
    ```
  </Accordion>
</AccordionGroup>

## Running Test Datasets

Execute datasets via webhook to batch-process all test cases and measure accuracy.

<Steps>
  <Step title="Configure Webhook Trigger">
    Set up webhook trigger for your agent. See [Triggers & Webhooks](/03-running-operations/task-management/triggers-webhooks/triggers-webhooks) for detailed setup.
  </Step>

  <Step title="Prepare Batch Payload">
    Structure dataset as JSON array with test inputs and expected outputs.
  </Step>

  <Step title="Execute via Webhook">
    Send HTTP POST request to trigger agent with entire dataset batch.

    **Webhook URL Format:**

    ```
    https://api.beamstudio.ai/webhooks/{agent_id}/{trigger_id}
    ```

    **Example Batch Request:**

    ```bash theme={null}
    curl -X POST https://api.beamstudio.ai/webhooks/{agent_id}/{trigger_id} \
      -H "Content-Type: application/json" \
      -d '{
        "dataset": "invoice-processing-q1-2025",
        "test_cases": [
          {"invoice_pdf": "...", "expected": {"amount": 1500, "vendor": "Acme"}},
          {"invoice_pdf": "...", "expected": {"amount": 2300, "vendor": "TechCo"}},
          {"invoice_pdf": "...", "expected": {"amount": 890, "vendor": "SupplyCo"}}
        ]
      }'
    ```
  </Step>

  <Step title="Monitor Results">
    View execution results in [Task Executions](/03-running-operations/task-management/task-executions/task-executions). Review accuracy scores, identify failures, and analyze discrepancies.
  </Step>
</Steps>

<AccordionGroup>
  <Accordion title="Interpreting Accuracy Scores">
    **95-100% Match**: Agent performing as expected - ready for production

    **85-94% Match**: Minor issues - investigate failures, may need prompt optimization

    **Below 85%**: Significant problems - review evaluation criteria, update prompts, or verify expected outputs are correct

    **Failure Analysis:**

    * Which specific test cases failed?
    * Are failures clustered (e.g., all date extraction errors)?
    * Random failures or consistent patterns?
    * Review failed tasks using [Debug Tools](/03-running-operations/debugging-testing/debug-tools/debug-tools)
  </Accordion>

  <Accordion title="Webhook Automation Benefits">
    **CI/CD Integration:**

    * Run datasets automatically before deployment
    * Block releases if accuracy drops below threshold
    * Generate test reports for stakeholders

    **Scheduled Testing:**

    * Daily runs on critical datasets
    * Weekly comprehensive validation
    * Track accuracy trends over time

    **Programmatic Control:**

    * Trigger from external systems
    * Integrate with monitoring tools
    * Automate regression testing
  </Accordion>
</AccordionGroup>

## Iterative Improvement

Use test dataset results to systematically improve agent performance.

**Improvement Workflow:**

1. **Baseline**: Run dataset → Record accuracy scores
2. **Identify Issues**: Review failures → Determine root causes
3. **Optimize**: Update prompts, criteria, or workflow (see [Optimize Outputs](/04-observability-analytics/optimize-outputs/optimize-outputs))
4. **Retest**: Rerun dataset → Validate improvements
5. **Deploy**: Publish if accuracy meets targets (95%+)

<AccordionGroup>
  <Accordion title="Common Optimization Scenarios">
    **Prompt Optimization:**

    * Failed test: Agent extracting wrong invoice date
    * Action: Use [Optimize Outputs](/04-observability-analytics/optimize-outputs/optimize-outputs) to improve prompt
    * Retest: Run dataset to verify fix
    * Result: Date extraction accuracy improves from 82% to 96%

    **Evaluation Criteria Tuning:**

    * Issue: False failures on valid outputs
    * Action: Adjust criteria thresholds in [Evaluation Framework](/04-observability-analytics/evaluation-framework/evaluation-framework)
    * Retest: Validate adjusted criteria
    * Result: Reduce false failures while maintaining quality

    **Expected Output Corrections:**

    * Pattern: Multiple tests failing with similar outputs
    * Action: Review expected outputs for errors
    * Fix: Update incorrect expected values
    * Result: Test accuracy reflects true agent performance
  </Accordion>

  <Accordion title="Dataset Maintenance">
    **Quarterly Updates:**

    * Add new test cases from production issues
    * Remove outdated scenarios
    * Update expected outputs if requirements changed

    **After Major Changes:**

    * Expand dataset for new features
    * Add regression tests for bug fixes
    * Archive deprecated test cases

    **Version Control:**

    * Track dataset changes over time
    * Link versions to agent releases
    * Document what changed and why
  </Accordion>
</AccordionGroup>

## Best Practices

<AccordionGroup>
  <Accordion title="Start Small, Expand Strategically">
    **Week 1**: Create 10-15 core test cases covering common scenarios

    **Weeks 2-4**: Add 10-20 edge cases discovered during initial testing

    **Months 2-3**: Expand to 50+ cases including all production issues

    **Ongoing**: Add test for every new bug or production failure
  </Accordion>

  <Accordion title="Combine Manual and Automated Testing">
    **Manual Review:**

    * Initial dataset creation
    * Complex edge case validation
    * Expected output verification

    **Automated Execution:**

    * Webhook-triggered batch runs
    * Scheduled regression testing
    * CI/CD pipeline integration
    * Continuous monitoring
  </Accordion>

  <Accordion title="Track Trends Over Time">
    **Metrics to Monitor:**

    * Overall dataset accuracy (weekly)
    * Per-test-case pass rates
    * Failure patterns and clusters
    * Accuracy trends (improving/degrading?)

    **Use Trends For:**

    * Detecting performance drift
    * Validating optimization impact
    * Demonstrating ROI to stakeholders
    * Identifying systematic issues
  </Accordion>
</AccordionGroup>

## Next Steps

<CardGroup cols={2}>
  <Card title="Evaluation Framework" icon="clipboard-check" href="/04-observability-analytics/evaluation-framework/evaluation-framework">
    Configure node-level validation criteria for test datasets
  </Card>

  <Card title="Triggers & Webhooks" icon="webhook" href="/03-running-operations/task-management/triggers-webhooks/triggers-webhooks">
    Set up webhook triggers for batch dataset execution
  </Card>

  <Card title="Task Executions" icon="chart-line" href="/03-running-operations/task-management/task-executions/task-executions">
    Monitor test dataset execution results and accuracy
  </Card>

  <Card title="Optimize Outputs" icon="sparkles" href="/04-observability-analytics/optimize-outputs/optimize-outputs">
    Use dataset failures to guide prompt optimization
  </Card>
</CardGroup>
