CI/CD Issues Analysis and Fix Proposal

Current State Analysis

Problems Identified

  1. Multiple Redundant CI Workflows

    • ci-pr.yml - PR validation workflow
    • ci-optimized.yml - Docker layer reuse workflow
    • ci-native.yml - GitHub Actions + Docker Buildx workflow
    • Multiple backup workflows in .github/workflows/backup*/
    • Total: 39 workflow files
  2. Resource Contention on Self-Hosted Runner

    • All workflows use [self-hosted, Linux, X64] runner
    • 3 workflows stuck in "queued" state waiting for runner
    • Test job stuck for >1 hour running cargo test --workspace in Docker container
    • No timeout on Docker test execution
  3. Inconsistent Concurrency Settings

    • ci-pr.yml: Has concurrency with cancel-in-progress: true
    • ci-optimized.yml: Has concurrency with cancel-in-progress: true
    • ci-native.yml: Has concurrency but cancel-in-progress: true commented out
    • ci-main.yml: No concurrency settings
  4. Missing Job Timeouts

    • Docker test commands (ci-optimized.yml:307-313) have no timeout
    • Tests can hang indefinitely waiting for network/resources
    • Some jobs have no timeout-minutes defined
  5. Workflow Triggers Overlap

    • All CI workflows trigger on pull_request events
    • Multiple workflows trigger simultaneously on PR creation/update
    • Creates workflow explosion on every push

Immediate Fixes

Fix 1: Add Timeout to Docker Test Execution

File: .github/workflows/ci-optimized.yml

Change: Add timeout wrapper around Docker test command

# BEFORE (line 307-313):
- name: Run tests
  run: |
    docker run --rm \
      -v $PWD:/workspace \
      -w /workspace \
      ${{ needs.build-base-image.outputs.image-tag }} \
      cargo test --workspace

# AFTER:
- name: Run tests
  timeout-minutes: 15
  run: |
    docker run --rm \
      -v $PWD:/workspace \
      -w /workspace \
      ${{ needs.build-base-image.outputs.image-tag }} \
      bash -c "timeout 10m cargo test --workspace || exit 1"

Fix 2: Standardize Concurrency Settings

Files: All CI workflow files

Change: Ensure consistent concurrency settings

# Add to all CI workflows:
concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

Fix 3: Disable Redundant Workflows

Action: Disable ci-optimized.yml and ci-native.yml temporarily

# Rename to disable:
mv .github/workflows/ci-optimized.yml .github/workflows/ci-optimized.yml.disabled
mv .github/workflows/ci-native.yml .github/workflows/ci-native.yml.disabled

# Keep only:
# - ci-pr.yml for PR validation
# - ci-main.yml for main branch builds

Fix 4: Add Comprehensive Timeouts

Files: All CI workflow files

Change: Add job-level timeouts to all jobs

# Example timeouts:
- setup: 5 minutes
- build-frontend: 20 minutes
- rust-build: 45 minutes
- test: 15 minutes
- lint-and-format: 10 minutes

Fix 5: Improve Test Execution

Files: ci-pr.yml, ci-optimized.yml

Change: Run tests with better timeout and parallelization

- name: Run tests
  timeout-minutes: 15
  run: |
    docker run --rm \
      -v $PWD:/workspace \
      -w /workspace \
      ${{ needs.build-base-image.outputs.image-tag }} \
      bash -c "
        # Set reasonable test timeout
        export RUST_TEST_TIMEOUT=600

        # Run with limited parallelism
        cargo test --workspace \
          --timeout 600 \
          --test-threads 2 \
          -- -Z unstable-options \
          --report-time \
          --test-threads=2
      "

Fix 6: Clean Up Backup Workflows

Action: Move backup workflows to archive

mkdir -p .github/workflows/archive
mv .github/workflows/backup_* .github/workflows/archive/
mv .github/workflows/backup/ .github/workflows/archive/

Long-Term Improvements

1. Single Source of Truth CI Workflow

Create one unified CI workflow that handles:

  • PR validation (with quick checks)
  • Main branch builds (with full builds)
  • Matrix builds for different targets

2. Workflow Trigger Management

  • Use workflow_dispatch for testing
  • Restrict automatic triggers to essential events
  • Implement workflow prioritization

3. Runner Configuration

  • Add multiple self-hosted runners if possible
  • Use GitHub-hosted runners for non-critical jobs
  • Implement runner labels for job routing

4. Monitoring and Alerting

  • Add workflow duration metrics
  • Alert on stuck workflows (>30 min)
  • Implement automatic cancellation of stale workflows

5. Caching Strategy

  • Improve Cargo caching
  • Use Docker layer caching effectively
  • Cache build artifacts across workflows

Implementation Priority

  1. CRITICAL (Do immediately):

    • Add timeout to Docker test execution (Fix 1)
    • Disable redundant workflows (Fix 3)
  2. HIGH (This week):

    • Standardize concurrency settings (Fix 2)
    • Add comprehensive timeouts (Fix 4)
    • Improve test execution (Fix 5)
  3. MEDIUM (Next sprint):

    • Clean up backup workflows (Fix 6)
    • Implement single unified CI workflow
  4. LOW (Future):

    • Long-term improvements
    • Monitoring and alerting

Testing the Fixes

After implementing fixes:

  1. Create test PR to validate CI works
  2. Monitor workflow execution times
  3. Verify no workflows get stuck
  4. Check that queued jobs complete
  5. Confirm no duplicate job execution

Success Criteria

  • CI completes within 30 minutes for PRs
  • CI completes within 60 minutes for main branch
  • No jobs stuck >15 minutes
  • Only 1 CI workflow active per branch
  • Clear, actionable error messages on failure