Skip to content

GitHub Actions Integration

Run automated tests for your voice agents in CI/CD pipelines with the LayerCode Gym GitHub Action.

Overview

The LayerCode Gym GitHub Action enables you to:

  • Test multiple personas in parallel for maximum speed
  • Scripted conversations for deterministic regression testing
  • Automated judging with LLM-based quality evaluation
  • Track quality over time with historical test results
  • Continuous regression testing on every commit
  • Zero setup complexity - just configure and run

Quick Start

1. Set Up Secrets

Go to your repository: Settings → Secrets and variables → Actions → New repository secret

Add these required secrets:

Secret Name Description Where to Get
SERVER_URL Your backend server URL Your infrastructure
LAYERCODE_AGENT_ID LayerCode agent ID LayerCode Dashboard
OPENAI_API_KEY OpenAI API key OpenAI Platform

Optional but recommended:

Secret Name Description
LOGFIRE_TOKEN LogFire token for observability

2. Create Workflow File

Create .github/workflows/test-agent.yml:

name: Test Voice Agent
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest

    # IMPORTANT: Prevent concurrent runs
    concurrency:
      group: layercode-gym-${{ secrets.LAYERCODE_AGENT_ID }}
      cancel-in-progress: false

    steps:
      - uses: actions/checkout@v4

      - name: Test agent
        uses: ./.github/actions/layercode-gym-test
        with:
          personas: |
            - background: You are a potential customer
              intent: Learn about pricing and features
          judge-enabled: true
          judge-criteria: |
            - Did the agent provide clear pricing information?
          server-url: ${{ secrets.SERVER_URL }}
          layercode-agent-id: ${{ secrets.LAYERCODE_AGENT_ID }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}

3. Run Your Tests

Push your code and watch the tests run automatically!

Configuration

Conversation Types

The action supports two types of conversations that can be mixed:

AI Personas (Dynamic)

Personas define users with AI-driven responses. Each persona has:

  • background: Who the user is (role, context, characteristics)
  • intent: What they want to achieve
personas: |
  - background: You are a 35-year-old small business owner interested in AI
    intent: Learn how voice AI can help your customer service

  - background: You are a technical developer evaluating APIs
    intent: Understand integration requirements and documentation

Tips for writing good personas:

  • Be specific about who the user is
  • Include relevant context (frustrations, goals, technical level)
  • Make intents clear and actionable
  • Avoid making personas too similar

Scripted Messages (Deterministic)

Scripted conversations send exact message sequences for reproducible tests:

personas: |
  - messages:
      - Hello, I need to check my account balance
      - My account number is 12345
      - Thank you, goodbye

Use cases for scripted conversations:

  • Regression testing specific scenarios
  • Testing exact edge cases
  • Reproducing reported issues
  • Load testing with known inputs

Mixed Configurations

Combine both types in a single test run:

personas: |
  # Dynamic persona for exploratory testing
  - background: You are a frustrated customer
    intent: Get a refund for a defective product

  # Scripted conversation for regression testing
  - messages:
      - Hello, what are your business hours?
      - Do you have weekend support?
      - Thanks

  # Another dynamic persona
  - background: You are a technical user
    intent: Ask about API rate limits

Judge Criteria

The judge evaluates whether conversations meet your quality standards. Each criterion is a yes/no question that gets evaluated independently—all must pass for the conversation to pass.

Good criteria are: - Specific, measurable yes/no questions - Focused on agent behavior - Clear pass/fail conditions

Examples:

# Simple criteria
judge-criteria: |
  - Did the agent provide accurate product information?

# Detailed criteria
judge-criteria: |
  - Did the agent greet the user professionally?
  - Did the agent answer all questions with accurate information?
  - Did the agent offer next steps (demo, documentation, contact)?
  - Did the agent maintain a helpful and empathetic tone throughout?

# Domain-specific criteria
judge-criteria: |
  - Did the agent mention API authentication methods?
  - Did the agent provide documentation links?
  - Did the agent explain rate limits correctly?
  - Did the agent avoid providing incorrect technical details?

Audio Storage

By default, audio storage is disabled in the GitHub Action for faster CI execution. The judge only needs text transcripts, not audio files.

To enable audio storage (requires ffmpeg on the runner):

- uses: ./.github/actions/layercode-gym-test
  with:
    store-audio: true  # Enable audio file storage
    # ... other config

Full Configuration Options

See the Action README for complete documentation of all inputs and outputs.

Use Cases

Regression Testing

Run tests on every push to catch regressions. Use scripted conversations for deterministic tests:

name: Regression Tests
on: [push, pull_request]

jobs:
  regression:
    runs-on: ubuntu-latest
    concurrency:
      group: layercode-gym-${{ secrets.LAYERCODE_AGENT_ID }}
      cancel-in-progress: false

    steps:
      - uses: actions/checkout@v4
      - uses: ./.github/actions/layercode-gym-test
        with:
          personas: |
            # Scripted regression tests for known scenarios
            - messages:
                - Hello, I want to check my order status
                - Order number 12345
                - Thanks

            # AI persona for dynamic testing
            - background: Returning customer
              intent: Check order status

            - background: New user
              intent: Create account
          judge-enabled: true
          judge-criteria: |
            - Did the agent handle the request correctly?
            - Did the agent avoid errors during the conversation?
          server-url: ${{ secrets.SERVER_URL }}
          layercode-agent-id: ${{ secrets.LAYERCODE_AGENT_ID }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}

Nightly Quality Checks

Run comprehensive tests daily:

name: Nightly Quality Checks
on:
  schedule:
    - cron: '0 2 * * *'  # 2 AM daily

jobs:
  quality:
    runs-on: ubuntu-latest
    concurrency:
      group: layercode-gym-${{ secrets.LAYERCODE_AGENT_ID }}
      cancel-in-progress: false

    steps:
      - uses: actions/checkout@v4
      - uses: ./.github/actions/layercode-gym-test
        with:
          personas: |
            - background: Happy customer
              intent: Provide positive feedback

            - background: Angry customer
              intent: Demand refund

            - background: Confused user
              intent: Understand product features

            - background: Price-sensitive shopper
              intent: Find cheapest option

            - background: Enterprise buyer
              intent: Discuss bulk pricing
          max-turns: 10
          judge-enabled: true
          judge-criteria: |
            - Did the agent adapt tone to user's emotional state?
            - Did the agent provide accurate information?
            - Did the agent offer appropriate solutions?
            - Did the agent maintain professionalism?
          model: anthropic:claude-sonnet-4-5  # Use best model for quality checks
          server-url: ${{ secrets.SERVER_URL }}
          layercode-agent-id: ${{ secrets.LAYERCODE_AGENT_ID }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
          logfire-token: ${{ secrets.LOGFIRE_TOKEN }}

Pre-Deployment Validation

Require tests to pass before deploying:

name: Deploy Pipeline
on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    concurrency:
      group: layercode-gym-${{ secrets.LAYERCODE_AGENT_ID }}
      cancel-in-progress: false

    steps:
      - uses: actions/checkout@v4
      - name: Run pre-deployment tests
        uses: ./.github/actions/layercode-gym-test
        with:
          personas: |
            - background: Critical user journey 1
              intent: Complete purchase

            - background: Critical user journey 2
              intent: Get support

            - background: Critical user journey 3
              intent: Update account
          judge-enabled: true
          judge-criteria: |
            - Did all critical paths complete successfully?
            - Did the agent avoid errors during the conversation?
          fail-on-judge-failure: true  # Block deployment on failure
          server-url: ${{ secrets.SERVER_URL }}
          layercode-agent-id: ${{ secrets.LAYERCODE_AGENT_ID }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}

  deploy:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to production
        run: echo "Deploying..."

A/B Testing Different Agent Versions

Test changes against baseline:

name: A/B Test Agent Changes

on: [pull_request]

jobs:
  test-baseline:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          ref: main  # Test against main branch

      - uses: ./.github/actions/layercode-gym-test
        id: baseline
        with:
          personas: ${{ env.TEST_PERSONAS }}
          judge-enabled: true
          # ... other config

  test-changes:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4  # Test with PR changes

      - uses: ./.github/actions/layercode-gym-test
        id: changes
        with:
          personas: ${{ env.TEST_PERSONAS }}
          judge-enabled: true
          # ... other config

  compare:
    needs: [test-baseline, test-changes]
    runs-on: ubuntu-latest
    steps:
      - name: Compare results
        run: |
          echo "Baseline: ${{ needs.test-baseline.outputs.conversations-passed }} passed"
          echo "Changes: ${{ needs.test-changes.outputs.conversations-passed }} passed"

Best Practices

1. Use Concurrency Control

Always add concurrency control to prevent webhook conflicts:

concurrency:
  group: layercode-gym-${{ secrets.LAYERCODE_AGENT_ID }}
  cancel-in-progress: false

LayerCode agents support one webhook at a time. The concurrency group ensures tests don't overlap.

2. Start Small, Scale Up

Begin with a few core personas and add more over time:

# Week 1: Core happy paths
personas: |
  - background: Happy customer
    intent: Complete a purchase

  - background: New user
    intent: Learn about the product

# Week 2: Add edge cases
personas: |
  - background: Happy customer
    intent: Complete a purchase
  - background: Frustrated customer
    intent: Request a refund

# Week 3: Add scripted regression tests
personas: |
  - background: Happy customer
    intent: Complete a purchase
  - messages:
      - Hello, I want to cancel my order
      - Order number 12345

3. Use Different Models for Different Tests

  • Fast CI (PR checks): openai:gpt-4o-mini
  • Nightly quality: anthropic:claude-sonnet-4-5
  • Critical path: openai:gpt-4o
# Fast PR checks
- uses: ./.github/actions/layercode-gym-test
  with:
    model: openai:gpt-4o-mini  # Fast and cheap
    max-turns: 3

# Comprehensive nightly
- uses: ./.github/actions/layercode-gym-test
  with:
    model: anthropic:claude-sonnet-4-5  # High quality
    max-turns: 10

4. Enable LogFire for Production Tests

Set LOGFIRE_TOKEN for deep observability:

  • Real-time conversation monitoring
  • Performance metrics
  • Debugging failed tests
  • Historical trends
- uses: ./.github/actions/layercode-gym-test
  with:
    logfire-token: ${{ secrets.LOGFIRE_TOKEN }}  # Enable observability
    # ... other config

5. Create Separate Jobs for Different Test Suites

Organize tests into logical groups:

jobs:
  test-happy-paths:
    # Test successful scenarios

  test-error-handling:
    # Test edge cases and errors

  test-security:
    # Test authentication, authorization

  test-performance:
    # Test with long conversations

6. Archive Important Test Runs

Keep artifacts for critical tests:

- uses: ./.github/actions/layercode-gym-test
  with:
    upload-artifacts: true

- uses: actions/upload-artifact@v4
  with:
    name: production-validation-${{ github.sha }}
    path: conversations/
    retention-days: 90  # Keep for 90 days

Troubleshooting

Tests Timing Out

Symptom: Conversations don't complete

Solutions: - Verify SERVER_URL is correct and accessible - Check server logs for errors - Increase max-turns if legitimate - Reduce number of personas to isolate issue

Judge Always Failing

Symptom: All conversations fail judge evaluation

Solutions: - Review judge feedback in artifacts - Simplify criteria to isolate issue - Test locally first with CLI - Set fail-on-judge-failure: false temporarily

Concurrent Run Conflicts

Symptom: Tests interfere with each other

Solution: Ensure all workflows have matching concurrency groups:

concurrency:
  group: layercode-gym-${{ secrets.LAYERCODE_AGENT_ID }}
  cancel-in-progress: false

Advanced Topics

Custom Post-Processing

Process results after tests complete:

- uses: ./.github/actions/layercode-gym-test
  id: test
  with:
    # ... config

- name: Analyze results
  run: |
    python scripts/analyze_results.py \
      --results-dir ${{ steps.test.outputs.results-path }} \
      --passed ${{ steps.test.outputs.conversations-passed }} \
      --failed ${{ steps.test.outputs.conversations-failed }}

Integration with Other Tools

Send results to external systems:

- name: Send to Slack
  if: failure()
  uses: slackapi/slack-github-action@v1
  with:
    payload: |
      {
        "text": "Voice agent tests failed: ${{ steps.test.outputs.conversations-failed }} conversations"
      }

Matrix Testing

Test multiple configurations:

strategy:
  matrix:
    agent-version: [v1, v2]
    model: [openai:gpt-4o-mini, anthropic:claude-sonnet-4-5]

steps:
  - uses: ./.github/actions/layercode-gym-test
    with:
      model: ${{ matrix.model }}
      # ... other config

Next Steps

Support