Terraphim AI Testing Infrastructure Improvement Plan

Created: Tue Nov 11 2025 Status: ACTIVE Priority: HIGH

Executive Summary

This plan addresses critical testing infrastructure issues identified in the test and benchmark report. The focus is on creating a robust, modular testing framework that can execute all tests and benchmarks reliably within reasonable timeframes.

Phase 1: Immediate Fixes (Critical Issues)

1.1 Fix Multi-Agent Benchmark Configuration

Issue: test-utils feature flag missing from benchmark execution Root Cause: run-benchmarks.sh doesn't include required feature flags Status: ⏳ PENDING Solution:

Update run-benchmarks.sh line 46 to include --features test-utils
Modify cargo bench command: cargo bench --features test-utils --bench agent_operations

1.2 Create Modular Test Execution Strategy

Issue: Full test suite compilation exceeds timeout limits Root Cause: MCP server and heavy dependencies slow down compilation Status: ⏳ PENDING Solution:

Create separate test categories: core-tests, integration-tests, mcp-tests
Implement --category flag in run_all_tests.sh
Exclude MCP server from core test runs

Phase 2: Enhanced Test Infrastructure

2.1 Create Tiered Test Scripts

New Scripts to Create:

scripts/run_core_tests.sh - Fast unit tests only
scripts/run_integration_tests.sh - Service-dependent tests
scripts/run_mcp_tests.sh - MCP-specific tests
scripts/run_performance_tests.sh - All benchmarks with proper flags

2.2 Optimize Test Execution

Improvements:

Add parallel test execution using --test-threads optimization
Implement test result caching
Add timeout management per test category
Create test dependency graph for optimal execution order

Phase 3: Benchmark Infrastructure Fixes

3.1 Fix All Benchmark Configurations

Actions:

Update run-benchmarks.sh to handle feature flags per crate
Add benchmark timeout management
Implement fallback for missing dependencies
Create benchmark-specific Cargo.toml profiles

3.2 Add Missing Benchmark Features

Enhancements:

Add memory usage benchmarks for persistence
Create performance regression detection
Implement automated baseline comparison
Add benchmark result archiving

Phase 4: CI/CD Integration

4.1 Implement Parallel CI Pipeline

Strategy:

Split CI into stages: unit, integration, performance
Use matrix builds for different test categories
Add test result aggregation
Implement performance regression alerts

4.2 Add Test Coverage Reporting

Implementation:

Install cargo-tarpaulin for coverage
Generate coverage reports per crate
Create coverage thresholds
Add coverage badges to README

Phase 5: Advanced Testing Features

5.1 Performance Baselines

Create:

Define performance thresholds for all benchmarks
Implement automated performance regression detection
Add performance trend analysis
Create performance dashboard

5.2 Integration Test Environment

Setup:

Create dedicated test environment configuration
Implement service health checks
Add test data fixtures
Create test isolation mechanisms

Detailed Implementation Steps

Step 1: Fix Benchmark Script (Immediate)

File: scripts/run-benchmarks.sh Line: 46 Change:

FROM: if cargo bench --bench agent_operations > "${CARGO_BENCH_OUTPUT}" 2>&1; then
TO:   if cargo bench --features test-utils --bench agent_operations > "${CARGO_BENCH_OUTPUT}" 2>&1; then

Step 2: Create Core Test Script

New File: scripts/run_core_tests.sh Focus: terraphim_types, terraphim_automata, terraphim_persistence Exclude: MCP server, integration tests Timeout: 5 minutes

Step 3: Update Main Test Script

File: scripts/run_all_tests.sh Changes:

Add --category flag
Implement conditional test execution
Add better timeout management
Separate MCP testing

Step 4: Add Performance Regression Detection

New File: scripts/check_performance_regression.sh Functionality:

Compare current benchmarks with baselines
Alert on performance degradation > 10%
Generate performance trend reports

Step 5: Implement Test Coverage

Add to CI pipeline:

cargo tarpaulin --workspace --out Html --output-dir coverage/

Generate coverage badges
Set minimum coverage thresholds

Expected Outcomes

Test Execution Time Improvements

Core Tests: < 2 minutes (vs current timeout)
Integration Tests: < 10 minutes
Full Suite: < 15 minutes (vs current failure)
Benchmarks: Complete execution with all features

Quality Improvements

Test Coverage: > 80% across all crates
Performance Regression: Automated detection
CI Pipeline: Parallel execution with clear stages
Benchmark Reliability: Consistent execution with proper feature flags

Monitoring and Reporting

Real-time Test Status: Dashboard with test progress
Performance Trends: Historical benchmark data
Coverage Reports: Per-crate coverage metrics
Regression Alerts: Automated notifications

Implementation Priority

High Priority (Day 1)

[ ] Fix benchmark feature flags
[ ] Create core test script
[ ] Update main test script with categories

Medium Priority (Week 1)

[ ] Implement modular test execution
[ ] Add coverage reporting
[ ] Fix all benchmark configurations

Low Priority (Month 1)

[ ] Advanced performance monitoring
[ ] CI optimization
[ ] Performance dashboard

Progress Tracking

Completed Tasks

[x] Initial test and benchmark analysis
[x] Issue identification and root cause analysis
[x] Comprehensive plan creation
[x] Fix benchmark script to include test-utils feature flag
[x] Create core test script for fast unit tests
[x] Update main test script with modular execution
[x] Test core script execution (successful: 53 tests passed)
[x] Verify benchmark script fix (compilation successful, execution in progress)
[x] Complete benchmark execution verification (benchmarks compile and run properly)
[x] Create comprehensive MCP test script for isolated MCP testing
[x] Test MCP script execution (successful: 5 middleware tests passed, server compiles)

In Progress

[ ] Fix remaining compilation issues in workspace (minor TUI feature flags)

Next Immediate Actions

[ ] Create integration test script for service-dependent tests
[ ] Implement performance regression detection script
[ ] Add test coverage reporting with cargo-tarpaulin

Blocked

[ ] Integration test environment setup
[ ] CI/CD pipeline modifications

Risk Assessment

High Risk

Timeout Issues: May require additional timeout optimizations
Dependency Conflicts: MCP server dependencies may cause compilation issues

Medium Risk

Performance Regression: New test infrastructure may affect performance
Coverage Tooling: cargo-tarpaulin may have compatibility issues

Low Risk

Script Modifications: Low risk, easily reversible
Feature Flag Changes: Isolated impact

Success Metrics

Quantitative

Test execution time reduced by 80%
Benchmark success rate increased to 100%
Test coverage > 80%
Zero timeout failures

Qualitative

Improved developer experience
Reliable CI/CD pipeline
Better performance monitoring
Clear test categorization

Last Updated: Tue Nov 11 2025 Next Review: Daily during implementation phase