VM Execution Testing Plan
This document outlines the comprehensive testing strategy for the LLM-to-Firecracker VM code execution system.
Overview
The VM execution system enables LLM agents to safely execute code in isolated Firecracker microVMs. Testing must validate security, functionality, performance, and integration across multiple layers.
Test Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β End-to-End Tests β
β (Agent β HTTP/WebSocket β VM) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Integration Tests β
β (HTTP API, WebSocket Protocol, VM Management) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Unit Tests β
β (Code Extraction, Validation, Client Logic) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββTest Categories
1. Unit Tests (crates/terraphim_multi_agent/tests/vm_execution_tests.rs)
Code Extraction Tests:
- Multi-language code block extraction (Python, JavaScript, Bash, Rust)
- Inline executable code detection
- Confidence scoring accuracy
- Code metadata preservation
Execution Intent Detection:
- High confidence trigger detection ("run this", "execute")
- Medium confidence context detection ("can you run", "test this")
- Low confidence handling ("here's an example")
- False positive prevention
Security Validation:
- Dangerous pattern detection (rm -rf, curl | sh, eval)
- Language-specific restriction enforcement
- Code length limit validation
- Safe code acceptance
Client Logic:
- HTTP client configuration
- Request/response serialization
- Timeout handling
- Error propagation
- Authentication token handling
2. Integration Tests (scratchpad/firecracker-rust/fcctl-web/tests/llm_api_tests.rs)
HTTP API Endpoints:
/api/llm/execute- Direct code execution/api/llm/parse-execute- LLM response parsing and execution/api/llm/vm-pool/{agent_id}- VM pool management
Request Validation:
- Required field validation
- Language support verification
- Payload size limits
- Invalid JSON handling
Execution Scenarios:
- Successful multi-language execution
- Syntax error handling
- Runtime error management
- Timeout enforcement
- Resource limit validation
Security Testing:
- Code injection prevention
- Dangerous pattern blocking
- Network access restrictions
- File system isolation
- Privilege escalation prevention
Performance Testing:
- Concurrent execution handling
- Large output management
- Request throughput measurement
- Memory usage validation
3. WebSocket Tests (scratchpad/firecracker-rust/fcctl-web/tests/websocket_tests.rs)
Protocol Testing:
- Message type handling (LlmExecuteCode, LlmParseExecute)
- Streaming output delivery
- Execution completion notifications
- Error message propagation
Real-time Features:
- Live execution output streaming
- Execution cancellation
- Connection state management
- Multiple client support
Error Handling:
- Invalid message format handling
- Unknown message type responses
- Missing field validation
- Connection failure recovery
Performance Testing:
- High-frequency message handling
- Large output streaming
- Concurrent client management
- Memory usage under load
4. End-to-End Tests (tests/agent_vm_integration_tests.rs)
Agent Integration:
- TerraphimAgent VM execution capability
- Configuration-based VM enabling/disabling
- Multi-language agent support
- Execution metadata handling
Workflow Testing:
- Complete user request to execution flow
- Code extraction from LLM responses
- Automatic execution based on intent
- Result formatting and presentation
Security Integration:
- Agent-level security policy enforcement
- VM isolation verification
- Resource limit compliance
- Dangerous code blocking at agent level
Production Scenarios:
- Multiple concurrent agents
- VM pool resource management
- Long-running execution handling
- Error recovery and graceful degradation
Test Data and Fixtures
Safe Code Examples
# Basic computation
= 5 + 3
# Data processing
=
=
= /
Dangerous Code Patterns
# File system destruction
# Network exploitation
|
|
# Code injection
Performance Test Cases
# Memory stress test
=
=
# CPU intensive
return
return + Test Execution Strategy
Local Development Testing
# Unit tests
# Integration tests (requires fcctl-web server)
# WebSocket tests (requires server)
# End-to-end tests (requires full setup)
Mock Testing Setup
- WireMock for HTTP API testing
- In-memory VM state simulation
- Controlled execution environment
- Deterministic test outcomes
Live Testing Requirements
- fcctl-web server running on localhost:8080
- Firecracker VM capabilities available
- Network isolation configured
- Resource limits enforced
Test Environment Configuration
Test VM Configuration
Test Agent Configuration
Security Testing Requirements
Code Injection Prevention
- SQL injection patterns in code strings
- Shell injection via command concatenation
- Python eval/exec exploitation
- JavaScript prototype pollution
VM Escape Prevention
- Container breakout attempts
- Kernel vulnerability exploitation
- Network stack attacks
- File system boundary violations
Resource Exhaustion Testing
- Memory bomb detection
- CPU intensive infinite loops
- Fork bomb prevention
- Disk space exhaustion
Performance Testing Metrics
Execution Performance
- Cold start time: VM provisioning latency
- Warm execution: Pre-warmed VM execution time
- Throughput: Executions per minute per host
- Concurrency: Maximum parallel executions
Resource Utilization
- Memory usage: Peak memory per execution
- CPU utilization: Average CPU load during execution
- Network bandwidth: WebSocket streaming throughput
- Disk I/O: Temporary file operations
Scalability Targets
- Agent capacity: 20+ concurrent agents per host
- Execution throughput: 100+ executions per minute
- Response latency: <500ms for simple executions
- Stream latency: <100ms for output streaming
Error Handling Test Cases
Network Failures
- fcctl-web server unavailable
- WebSocket connection drops
- HTTP request timeouts
- DNS resolution failures
VM Failures
- VM provisioning failures
- VM crash during execution
- Resource limit exceeded
- VM pool exhaustion
Code Failures
- Syntax errors in submitted code
- Runtime exceptions
- Infinite loops and timeouts
- Memory allocation failures
Continuous Integration
Test Automation
# GitHub Actions workflow
vm_execution_tests:
runs-on: ubuntu-latest
services:
fcctl-web:
image: fcctl-web:latest
ports:
- 8080:8080
steps:
- name: Unit Tests
run: cargo test vm_execution
- name: Integration Tests
run: cargo test llm_api_tests
- name: E2E Tests
run: cargo test agent_vm_integration_tests --ignoredTest Reporting
- Test coverage measurement
- Performance regression detection
- Security vulnerability scanning
- Integration test status dashboard
Manual Testing Procedures
Security Audit Checklist
- [ ] Dangerous code patterns blocked
- [ ] VM network isolation verified
- [ ] File system access restricted
- [ ] Resource limits enforced
- [ ] Privilege escalation prevented
User Experience Testing
- [ ] Agent responds appropriately to code requests
- [ ] Execution results properly formatted
- [ ] Error messages are helpful
- [ ] Performance is acceptable
- [ ] Security warnings are clear
Production Readiness Testing
- [ ] Load testing under realistic conditions
- [ ] Failure recovery mechanisms work
- [ ] Monitoring and alerting functional
- [ ] Documentation accurate and complete
- [ ] Deployment process validated
Test Maintenance
Regular Updates Required
- Update dangerous code patterns as new threats emerge
- Refresh test data for realistic scenarios
- Update performance benchmarks
- Maintain compatibility with fcctl-web updates
Test Environment Hygiene
- Regular cleanup of test VMs
- Reset test databases between runs
- Clear temporary files and state
- Validate test isolation boundaries
This comprehensive testing plan ensures the VM execution system is secure, reliable, and performant for production use with TerraphimAgent.