Terraphim AI Testing Infrastructure Improvement Plan
Created: Tue Nov 11 2025 Status: ACTIVE Priority: HIGH
Executive Summary
This plan addresses critical testing infrastructure issues identified in the test and benchmark report. The focus is on creating a robust, modular testing framework that can execute all tests and benchmarks reliably within reasonable timeframes.
Phase 1: Immediate Fixes (Critical Issues)
1.1 Fix Multi-Agent Benchmark Configuration
Issue: test-utils feature flag missing from benchmark execution
Root Cause: run-benchmarks.sh doesn't include required feature flags
Status: β³ PENDING
Solution:
- Update
run-benchmarks.shline 46 to include--features test-utils - Modify cargo bench command:
cargo bench --features test-utils --bench agent_operations
1.2 Create Modular Test Execution Strategy
Issue: Full test suite compilation exceeds timeout limits Root Cause: MCP server and heavy dependencies slow down compilation Status: β³ PENDING Solution:
- Create separate test categories:
core-tests,integration-tests,mcp-tests - Implement
--categoryflag inrun_all_tests.sh - Exclude MCP server from core test runs
Phase 2: Enhanced Test Infrastructure
2.1 Create Tiered Test Scripts
New Scripts to Create:
scripts/run_core_tests.sh- Fast unit tests onlyscripts/run_integration_tests.sh- Service-dependent testsscripts/run_mcp_tests.sh- MCP-specific testsscripts/run_performance_tests.sh- All benchmarks with proper flags
2.2 Optimize Test Execution
Improvements:
- Add parallel test execution using
--test-threadsoptimization - Implement test result caching
- Add timeout management per test category
- Create test dependency graph for optimal execution order
Phase 3: Benchmark Infrastructure Fixes
3.1 Fix All Benchmark Configurations
Actions:
- Update
run-benchmarks.shto handle feature flags per crate - Add benchmark timeout management
- Implement fallback for missing dependencies
- Create benchmark-specific Cargo.toml profiles
3.2 Add Missing Benchmark Features
Enhancements:
- Add memory usage benchmarks for persistence
- Create performance regression detection
- Implement automated baseline comparison
- Add benchmark result archiving
Phase 4: CI/CD Integration
4.1 Implement Parallel CI Pipeline
Strategy:
- Split CI into stages: unit, integration, performance
- Use matrix builds for different test categories
- Add test result aggregation
- Implement performance regression alerts
4.2 Add Test Coverage Reporting
Implementation:
- Install
cargo-tarpaulinfor coverage - Generate coverage reports per crate
- Create coverage thresholds
- Add coverage badges to README
Phase 5: Advanced Testing Features
5.1 Performance Baselines
Create:
- Define performance thresholds for all benchmarks
- Implement automated performance regression detection
- Add performance trend analysis
- Create performance dashboard
5.2 Integration Test Environment
Setup:
- Create dedicated test environment configuration
- Implement service health checks
- Add test data fixtures
- Create test isolation mechanisms
Detailed Implementation Steps
Step 1: Fix Benchmark Script (Immediate)
File: scripts/run-benchmarks.sh
Line: 46
Change:
; then
; thenStep 2: Create Core Test Script
New File: scripts/run_core_tests.sh
Focus: terraphim_types, terraphim_automata, terraphim_persistence
Exclude: MCP server, integration tests
Timeout: 5 minutes
Step 3: Update Main Test Script
File: scripts/run_all_tests.sh
Changes:
- Add
--categoryflag - Implement conditional test execution
- Add better timeout management
- Separate MCP testing
Step 4: Add Performance Regression Detection
New File: scripts/check_performance_regression.sh
Functionality:
- Compare current benchmarks with baselines
- Alert on performance degradation > 10%
- Generate performance trend reports
Step 5: Implement Test Coverage
Add to CI pipeline:
- Generate coverage badges
- Set minimum coverage thresholds
Expected Outcomes
Test Execution Time Improvements
- Core Tests: < 2 minutes (vs current timeout)
- Integration Tests: < 10 minutes
- Full Suite: < 15 minutes (vs current failure)
- Benchmarks: Complete execution with all features
Quality Improvements
- Test Coverage: > 80% across all crates
- Performance Regression: Automated detection
- CI Pipeline: Parallel execution with clear stages
- Benchmark Reliability: Consistent execution with proper feature flags
Monitoring and Reporting
- Real-time Test Status: Dashboard with test progress
- Performance Trends: Historical benchmark data
- Coverage Reports: Per-crate coverage metrics
- Regression Alerts: Automated notifications
Implementation Priority
High Priority (Day 1)
- [ ] Fix benchmark feature flags
- [ ] Create core test script
- [ ] Update main test script with categories
Medium Priority (Week 1)
- [ ] Implement modular test execution
- [ ] Add coverage reporting
- [ ] Fix all benchmark configurations
Low Priority (Month 1)
- [ ] Advanced performance monitoring
- [ ] CI optimization
- [ ] Performance dashboard
Progress Tracking
Completed Tasks
- [x] Initial test and benchmark analysis
- [x] Issue identification and root cause analysis
- [x] Comprehensive plan creation
- [x] Fix benchmark script to include test-utils feature flag
- [x] Create core test script for fast unit tests
- [x] Update main test script with modular execution
- [x] Test core script execution (successful: 53 tests passed)
- [x] Verify benchmark script fix (compilation successful, execution in progress)
- [x] Complete benchmark execution verification (benchmarks compile and run properly)
- [x] Create comprehensive MCP test script for isolated MCP testing
- [x] Test MCP script execution (successful: 5 middleware tests passed, server compiles)
In Progress
- [ ] Fix remaining compilation issues in workspace (minor TUI feature flags)
Next Immediate Actions
- [ ] Create integration test script for service-dependent tests
- [ ] Implement performance regression detection script
- [ ] Add test coverage reporting with cargo-tarpaulin
Blocked
- [ ] Integration test environment setup
- [ ] CI/CD pipeline modifications
Risk Assessment
High Risk
- Timeout Issues: May require additional timeout optimizations
- Dependency Conflicts: MCP server dependencies may cause compilation issues
Medium Risk
- Performance Regression: New test infrastructure may affect performance
- Coverage Tooling: cargo-tarpaulin may have compatibility issues
Low Risk
- Script Modifications: Low risk, easily reversible
- Feature Flag Changes: Isolated impact
Success Metrics
Quantitative
- Test execution time reduced by 80%
- Benchmark success rate increased to 100%
- Test coverage > 80%
- Zero timeout failures
Qualitative
- Improved developer experience
- Reliable CI/CD pipeline
- Better performance monitoring
- Clear test categorization
Last Updated: Tue Nov 11 2025 Next Review: Daily during implementation phase