CI/CD Issues Analysis and Fix Proposal
Current State Analysis
Problems Identified
-
Multiple Redundant CI Workflows
ci-pr.yml- PR validation workflowci-optimized.yml- Docker layer reuse workflowci-native.yml- GitHub Actions + Docker Buildx workflow- Multiple backup workflows in
.github/workflows/backup*/ - Total: 39 workflow files
-
Resource Contention on Self-Hosted Runner
- All workflows use
[self-hosted, Linux, X64]runner - 3 workflows stuck in "queued" state waiting for runner
- Test job stuck for >1 hour running
cargo test --workspacein Docker container - No timeout on Docker test execution
- All workflows use
-
Inconsistent Concurrency Settings
ci-pr.yml: Has concurrency withcancel-in-progress: trueci-optimized.yml: Has concurrency withcancel-in-progress: trueci-native.yml: Has concurrency butcancel-in-progress: truecommented outci-main.yml: No concurrency settings
-
Missing Job Timeouts
- Docker test commands (ci-optimized.yml:307-313) have no timeout
- Tests can hang indefinitely waiting for network/resources
- Some jobs have no timeout-minutes defined
-
Workflow Triggers Overlap
- All CI workflows trigger on
pull_requestevents - Multiple workflows trigger simultaneously on PR creation/update
- Creates workflow explosion on every push
- All CI workflows trigger on
Immediate Fixes
Fix 1: Add Timeout to Docker Test Execution
File: .github/workflows/ci-optimized.yml
Change: Add timeout wrapper around Docker test command
# BEFORE (line 307-313):
- name: Run tests
run: |
docker run --rm \
-v $PWD:/workspace \
-w /workspace \
${{ needs.build-base-image.outputs.image-tag }} \
cargo test --workspace
# AFTER:
- name: Run tests
timeout-minutes: 15
run: |
docker run --rm \
-v $PWD:/workspace \
-w /workspace \
${{ needs.build-base-image.outputs.image-tag }} \
bash -c "timeout 10m cargo test --workspace || exit 1"Fix 2: Standardize Concurrency Settings
Files: All CI workflow files
Change: Ensure consistent concurrency settings
# Add to all CI workflows:
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: trueFix 3: Disable Redundant Workflows
Action: Disable ci-optimized.yml and ci-native.yml temporarily
# Rename to disable:
# Keep only:
# - ci-pr.yml for PR validation
# - ci-main.yml for main branch buildsFix 4: Add Comprehensive Timeouts
Files: All CI workflow files
Change: Add job-level timeouts to all jobs
# Example timeouts:
- setup: 5 minutes
- build-frontend: 20 minutes
- rust-build: 45 minutes
- test: 15 minutes
- lint-and-format: 10 minutesFix 5: Improve Test Execution
Files: ci-pr.yml, ci-optimized.yml
Change: Run tests with better timeout and parallelization
- name: Run tests
timeout-minutes: 15
run: |
docker run --rm \
-v $PWD:/workspace \
-w /workspace \
${{ needs.build-base-image.outputs.image-tag }} \
bash -c "
# Set reasonable test timeout
export RUST_TEST_TIMEOUT=600
# Run with limited parallelism
cargo test --workspace \
--timeout 600 \
--test-threads 2 \
-- -Z unstable-options \
--report-time \
--test-threads=2
"Fix 6: Clean Up Backup Workflows
Action: Move backup workflows to archive
Long-Term Improvements
1. Single Source of Truth CI Workflow
Create one unified CI workflow that handles:
- PR validation (with quick checks)
- Main branch builds (with full builds)
- Matrix builds for different targets
2. Workflow Trigger Management
- Use workflow_dispatch for testing
- Restrict automatic triggers to essential events
- Implement workflow prioritization
3. Runner Configuration
- Add multiple self-hosted runners if possible
- Use GitHub-hosted runners for non-critical jobs
- Implement runner labels for job routing
4. Monitoring and Alerting
- Add workflow duration metrics
- Alert on stuck workflows (>30 min)
- Implement automatic cancellation of stale workflows
5. Caching Strategy
- Improve Cargo caching
- Use Docker layer caching effectively
- Cache build artifacts across workflows
Implementation Priority
-
CRITICAL (Do immediately):
- Add timeout to Docker test execution (Fix 1)
- Disable redundant workflows (Fix 3)
-
HIGH (This week):
- Standardize concurrency settings (Fix 2)
- Add comprehensive timeouts (Fix 4)
- Improve test execution (Fix 5)
-
MEDIUM (Next sprint):
- Clean up backup workflows (Fix 6)
- Implement single unified CI workflow
-
LOW (Future):
- Long-term improvements
- Monitoring and alerting
Testing the Fixes
After implementing fixes:
- Create test PR to validate CI works
- Monitor workflow execution times
- Verify no workflows get stuck
- Check that queued jobs complete
- Confirm no duplicate job execution
Success Criteria
- CI completes within 30 minutes for PRs
- CI completes within 60 minutes for main branch
- No jobs stuck >15 minutes
- Only 1 CI workflow active per branch
- Clear, actionable error messages on failure