GitHub Actions Integration¶

Run automated tests for your voice agents in CI/CD pipelines with the LayerCode Gym GitHub Action.

Overview¶

The LayerCode Gym GitHub Action enables you to:

Test multiple personas in parallel for maximum speed
Scripted conversations for deterministic regression testing
Automated judging with LLM-based quality evaluation
Track quality over time with historical test results
Continuous regression testing on every commit
Zero setup complexity - just configure and run

Quick Start¶

1. Set Up Secrets¶

Go to your repository: Settings → Secrets and variables → Actions → New repository secret

Add these required secrets:

Secret Name	Description	Where to Get
`SERVER_URL`	Your backend server URL	Your infrastructure
`LAYERCODE_AGENT_ID`	LayerCode agent ID	LayerCode Dashboard
`OPENAI_API_KEY`	OpenAI API key	OpenAI Platform

Optional but recommended:

Secret Name	Description
`LOGFIRE_TOKEN`	LogFire token for observability

2. Create Workflow File¶

Create .github/workflows/test-agent.yml:

name: Test Voice Agent
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest

    # IMPORTANT: Prevent concurrent runs
    concurrency:
      group: layercode-gym-${{ secrets.LAYERCODE_AGENT_ID }}
      cancel-in-progress: false

    steps:
      - uses: actions/checkout@v4

      - name: Test agent
        uses: ./.github/actions/layercode-gym-test
        with:
          personas: |
            - background: You are a potential customer
              intent: Learn about pricing and features
          judge-enabled: true
          judge-criteria: |
            - Did the agent provide clear pricing information?
          server-url: ${{ secrets.SERVER_URL }}
          layercode-agent-id: ${{ secrets.LAYERCODE_AGENT_ID }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}

3. Run Your Tests¶

Push your code and watch the tests run automatically!

Configuration¶

Conversation Types¶

The action supports two types of conversations that can be mixed:

AI Personas (Dynamic)¶

Personas define users with AI-driven responses. Each persona has:

background: Who the user is (role, context, characteristics)
intent: What they want to achieve

personas: |
  - background: You are a 35-year-old small business owner interested in AI
    intent: Learn how voice AI can help your customer service

  - background: You are a technical developer evaluating APIs
    intent: Understand integration requirements and documentation

Tips for writing good personas:

Be specific about who the user is
Include relevant context (frustrations, goals, technical level)
Make intents clear and actionable
Avoid making personas too similar

Scripted Messages (Deterministic)¶

Scripted conversations send exact message sequences for reproducible tests:

personas: |
  - messages:
      - Hello, I need to check my account balance
      - My account number is 12345
      - Thank you, goodbye

Use cases for scripted conversations:

Regression testing specific scenarios
Testing exact edge cases
Reproducing reported issues
Load testing with known inputs

Mixed Configurations¶

Combine both types in a single test run:

personas: |
  # Dynamic persona for exploratory testing
  - background: You are a frustrated customer
    intent: Get a refund for a defective product

  # Scripted conversation for regression testing
  - messages:
      - Hello, what are your business hours?
      - Do you have weekend support?
      - Thanks

  # Another dynamic persona
  - background: You are a technical user
    intent: Ask about API rate limits

Judge Criteria¶

The judge evaluates whether conversations meet your quality standards. Each criterion is a yes/no question that gets evaluated independently—all must pass for the conversation to pass.

Good criteria are: - Specific, measurable yes/no questions - Focused on agent behavior - Clear pass/fail conditions

Examples:

# Simple criteria
judge-criteria: |
  - Did the agent provide accurate product information?

# Detailed criteria
judge-criteria: |
  - Did the agent greet the user professionally?
  - Did the agent answer all questions with accurate information?
  - Did the agent offer next steps (demo, documentation, contact)?
  - Did the agent maintain a helpful and empathetic tone throughout?

# Domain-specific criteria
judge-criteria: |
  - Did the agent mention API authentication methods?
  - Did the agent provide documentation links?
  - Did the agent explain rate limits correctly?
  - Did the agent avoid providing incorrect technical details?

Audio Storage¶

By default, audio storage is disabled in the GitHub Action for faster CI execution. The judge only needs text transcripts, not audio files.

To enable audio storage (requires ffmpeg on the runner):

- uses: ./.github/actions/layercode-gym-test
  with:
    store-audio: true  # Enable audio file storage
    # ... other config

Full Configuration Options¶

See the Action README for complete documentation of all inputs and outputs.

Use Cases¶

Regression Testing¶

Run tests on every push to catch regressions. Use scripted conversations for deterministic tests:

name: Regression Tests
on: [push, pull_request]

jobs:
  regression:
    runs-on: ubuntu-latest
    concurrency:
      group: layercode-gym-${{ secrets.LAYERCODE_AGENT_ID }}
      cancel-in-progress: false

    steps:
      - uses: actions/checkout@v4
      - uses: ./.github/actions/layercode-gym-test
        with:
          personas: |
            # Scripted regression tests for known scenarios
            - messages:
                - Hello, I want to check my order status
                - Order number 12345
                - Thanks

            # AI persona for dynamic testing
            - background: Returning customer
              intent: Check order status

            - background: New user
              intent: Create account
          judge-enabled: true
          judge-criteria: |
            - Did the agent handle the request correctly?
            - Did the agent avoid errors during the conversation?
          server-url: ${{ secrets.SERVER_URL }}
          layercode-agent-id: ${{ secrets.LAYERCODE_AGENT_ID }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}

Nightly Quality Checks¶

Run comprehensive tests daily:

name: Nightly Quality Checks
on:
  schedule:
    - cron: '0 2 * * *'  # 2 AM daily

jobs:
  quality:
    runs-on: ubuntu-latest
    concurrency:
      group: layercode-gym-${{ secrets.LAYERCODE_AGENT_ID }}
      cancel-in-progress: false

    steps:
      - uses: actions/checkout@v4
      - uses: ./.github/actions/layercode-gym-test
        with:
          personas: |
            - background: Happy customer
              intent: Provide positive feedback

            - background: Angry customer
              intent: Demand refund

            - background: Confused user
              intent: Understand product features

            - background: Price-sensitive shopper
              intent: Find cheapest option

            - background: Enterprise buyer
              intent: Discuss bulk pricing
          max-turns: 10
          judge-enabled: true
          judge-criteria: |
            - Did the agent adapt tone to user's emotional state?
            - Did the agent provide accurate information?
            - Did the agent offer appropriate solutions?
            - Did the agent maintain professionalism?
          model: anthropic:claude-sonnet-4-5  # Use best model for quality checks
          server-url: ${{ secrets.SERVER_URL }}
          layercode-agent-id: ${{ secrets.LAYERCODE_AGENT_ID }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
          logfire-token: ${{ secrets.LOGFIRE_TOKEN }}

Pre-Deployment Validation¶

Require tests to pass before deploying:

name: Deploy Pipeline
on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    concurrency:
      group: layercode-gym-${{ secrets.LAYERCODE_AGENT_ID }}
      cancel-in-progress: false

    steps:
      - uses: actions/checkout@v4
      - name: Run pre-deployment tests
        uses: ./.github/actions/layercode-gym-test
        with:
          personas: |
            - background: Critical user journey 1
              intent: Complete purchase

            - background: Critical user journey 2
              intent: Get support

            - background: Critical user journey 3
              intent: Update account
          judge-enabled: true
          judge-criteria: |
            - Did all critical paths complete successfully?
            - Did the agent avoid errors during the conversation?
          fail-on-judge-failure: true  # Block deployment on failure
          server-url: ${{ secrets.SERVER_URL }}
          layercode-agent-id: ${{ secrets.LAYERCODE_AGENT_ID }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}

  deploy:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to production
        run: echo "Deploying..."

A/B Testing Different Agent Versions¶

Test changes against baseline:

name: A/B Test Agent Changes

on: [pull_request]

jobs:
  test-baseline:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          ref: main  # Test against main branch

      - uses: ./.github/actions/layercode-gym-test
        id: baseline
        with:
          personas: ${{ env.TEST_PERSONAS }}
          judge-enabled: true
          # ... other config

  test-changes:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4  # Test with PR changes

      - uses: ./.github/actions/layercode-gym-test
        id: changes
        with:
          personas: ${{ env.TEST_PERSONAS }}
          judge-enabled: true
          # ... other config

  compare:
    needs: [test-baseline, test-changes]
    runs-on: ubuntu-latest
    steps:
      - name: Compare results
        run: |
          echo "Baseline: ${{ needs.test-baseline.outputs.conversations-passed }} passed"
          echo "Changes: ${{ needs.test-changes.outputs.conversations-passed }} passed"

Best Practices¶

1. Use Concurrency Control¶

Always add concurrency control to prevent webhook conflicts:

concurrency:
  group: layercode-gym-${{ secrets.LAYERCODE_AGENT_ID }}
  cancel-in-progress: false

LayerCode agents support one webhook at a time. The concurrency group ensures tests don't overlap.

2. Start Small, Scale Up¶

Begin with a few core personas and add more over time:

# Week 1: Core happy paths
personas: |
  - background: Happy customer
    intent: Complete a purchase

  - background: New user
    intent: Learn about the product

# Week 2: Add edge cases
personas: |
  - background: Happy customer
    intent: Complete a purchase
  - background: Frustrated customer
    intent: Request a refund

# Week 3: Add scripted regression tests
personas: |
  - background: Happy customer
    intent: Complete a purchase
  - messages:
      - Hello, I want to cancel my order
      - Order number 12345

3. Use Different Models for Different Tests¶

Fast CI (PR checks): openai:gpt-4o-mini
Nightly quality: anthropic:claude-sonnet-4-5
Critical path: openai:gpt-4o

# Fast PR checks
- uses: ./.github/actions/layercode-gym-test
  with:
    model: openai:gpt-4o-mini  # Fast and cheap
    max-turns: 3

# Comprehensive nightly
- uses: ./.github/actions/layercode-gym-test
  with:
    model: anthropic:claude-sonnet-4-5  # High quality
    max-turns: 10

4. Enable LogFire for Production Tests¶

Set LOGFIRE_TOKEN for deep observability:

Real-time conversation monitoring
Performance metrics
Debugging failed tests
Historical trends

- uses: ./.github/actions/layercode-gym-test
  with:
    logfire-token: ${{ secrets.LOGFIRE_TOKEN }}  # Enable observability
    # ... other config

5. Create Separate Jobs for Different Test Suites¶

Organize tests into logical groups:

jobs:
  test-happy-paths:
    # Test successful scenarios

  test-error-handling:
    # Test edge cases and errors

  test-security:
    # Test authentication, authorization

  test-performance:
    # Test with long conversations

6. Archive Important Test Runs¶

Keep artifacts for critical tests:

- uses: ./.github/actions/layercode-gym-test
  with:
    upload-artifacts: true

- uses: actions/upload-artifact@v4
  with:
    name: production-validation-${{ github.sha }}
    path: conversations/
    retention-days: 90  # Keep for 90 days

Troubleshooting¶

Tests Timing Out¶

Symptom: Conversations don't complete

Solutions: - Verify SERVER_URL is correct and accessible - Check server logs for errors - Increase max-turns if legitimate - Reduce number of personas to isolate issue

Judge Always Failing¶

Symptom: All conversations fail judge evaluation

Solutions: - Review judge feedback in artifacts - Simplify criteria to isolate issue - Test locally first with CLI - Set fail-on-judge-failure: false temporarily

Concurrent Run Conflicts¶

Symptom: Tests interfere with each other

Solution: Ensure all workflows have matching concurrency groups:

concurrency:
  group: layercode-gym-${{ secrets.LAYERCODE_AGENT_ID }}
  cancel-in-progress: false

Advanced Topics¶

Custom Post-Processing¶

Process results after tests complete:

- uses: ./.github/actions/layercode-gym-test
  id: test
  with:
    # ... config

- name: Analyze results
  run: |
    python scripts/analyze_results.py \
      --results-dir ${{ steps.test.outputs.results-path }} \
      --passed ${{ steps.test.outputs.conversations-passed }} \
      --failed ${{ steps.test.outputs.conversations-failed }}

Integration with Other Tools¶

Send results to external systems:

- name: Send to Slack
  if: failure()
  uses: slackapi/slack-github-action@v1
  with:
    payload: |
      {
        "text": "Voice agent tests failed: ${{ steps.test.outputs.conversations-failed }} conversations"
      }

Matrix Testing¶

Test multiple configurations:

strategy:
  matrix:
    agent-version: [v1, v2]
    model: [openai:gpt-4o-mini, anthropic:claude-sonnet-4-5]

steps:
  - uses: ./.github/actions/layercode-gym-test
    with:
      model: ${{ matrix.model }}
      # ... other config

Next Steps¶

See complete action documentation
View example workflows
api-agents CLI - Swap webhook URLs to test PR backends in CI
Read API reference
Explore advanced features

Support¶

Issues: GitHub Issues
Discussions: GitHub Discussions
LayerCode Docs: docs.layercode.com