VM Execution API Design - Comprehensive Architecture
Overview
This document outlines the complete VM execution API design for terraphim-ai, leveraging the existing firecracker-rust infrastructure to provide secure, isolated code execution with sub-2-second response times.
Architecture Components
1. Core Infrastructure (Existing)
VmManager (firecracker-rust/fcctl-core/src/vm/manager.rs)
- Purpose: Low-level VM lifecycle management
- Key Methods:
create_vm(config, domain)→ Result<String>start_vm(vm_id)→ Result<()>stop_vm(vm_id)→ Result<()>get_vm_status(vm_id)→ Result<VmState>
VmPoolManager (firecracker-rust/fcctl-web/src/vm_pool/mod.rs)
- Purpose: IP allocation and VM-IP mapping
- IP Range: 172.26.0.2 - 172.26.0.254 (253 IPs)
- Key Methods:
allocate_ip(vm_id)→ Result<String>deallocate_ip(vm_id)→ Result<()>get_vm_ip(vm_id)→ Option<String>restore_from_database()→ Result<()>
Session Management (firecracker-rust/fcctl-repl/src/session.rs)
- Purpose: Command execution with snapshot/rollback
- Key Methods:
execute_command(command, timeout)→ Result<ExecutionResult>create_snapshot(name)→ Result<String>rollback_to_snapshot(snapshot_id)→ Result<()>
2. API Layer (Existing)
LLM Execution API (firecracker-rust/fcctl-web/src/api/llm.rs)
- Endpoint:
POST /api/llm/execute - Request:
LlmExecuteRequest { code, language, agent_id, vm_id?, timeout_ms? } - Response:
LlmExecuteResponse { execution_id, vm_id, exit_code, stdout, stderr, duration_ms, timestamps }
VM Pool API (firecracker-rust/fcctl-web/src/api/vm_pool.rs)
- Endpoints:
GET /api/vm-pool/stats→ Pool statisticsGET /api/vm-pool/list→ VMs with IPsPOST /api/vm-pool/allocate→ Allocate IP for VM
3. Enhanced VM Execution Flow Design
3.1 VM Pool Management System
Pre-warmed VM Pool Strategy:
Pool Management Algorithm:
- Target Pool Size: 10 prewarmed VMs (configurable)
- Warm-up Strategy: Maintain 8-12 prewarmed VMs
- VM Lifecycle:
- New → Warming (30-45 seconds) → Prewarmed → Active → Cleanup
- Resource Recycling: VMs used >50 executions or >2 hours get recycled
3.2 Fast Execution Flow (<2 seconds)
Request Processing Pipeline:
pub async VM Allocation Strategy:
async 3.3 Snapshot and Rollback System
Execution Snapshots:
Smart Rollback Strategy:
pub async 4. Performance Optimization
4.1 Sub-2 Second Boot Time Optimization
Techniques:
- Kernel Caching: Pre-loaded kernel in memory
- Root FS Optimization: Minimal root filesystem with essential tools
- Memory Pre-allocation: VM memory reserved in host
- CPU Pinning: Dedicated CPU cores for VM pool
- Network Pre-configuration: TAP interfaces pre-created
Boot Time Breakdown:
Total Target: <2000ms
├── VM Allocation: 50ms
├── Snapshot Creation: 100ms
├── Code Transfer: 50ms
├── Execution Setup: 100ms
├── Code Execution: 1000-1500ms (variable)
├── Result Collection: 100ms
└── Response Building: 50ms4.2 Resource Management
Memory Management:
CPU Management:
5. Security Model
5.1 VM Isolation
Network Isolation:
- Each VM in isolated network namespace
- Only outbound HTTPS allowed (port 443)
- No inbound connections from external networks
- DNS filtering to prevent exfiltration
File System Isolation:
- Read-only root filesystem
- Temporary writable overlay for execution
- No access to host filesystem
- Automatic cleanup on VM destruction
Process Isolation:
- No privileged processes allowed
- Resource limits (CPU, memory, disk)
- Process monitoring and termination
- No access to host system calls
5.2 Code Validation
Input Validation Pipeline:
Security Patterns Detected:
- File system access outside sandbox
- Network connections to non-HTTPS endpoints
- System calls for privilege escalation
- Infinite loops and resource exhaustion
- Cryptocurrency mining patterns
- Data exfiltration attempts
6. Monitoring and Observability
6.1 Metrics Collection
Execution Metrics:
VM Pool Metrics:
6.2 Alerting
Alert Conditions:
- VM pool size < 5 prewarmed VMs
- Average execution time > 5 seconds
- Memory utilization > 80%
- VM failure rate > 10%
- Security pattern detection rate > 1%
7. API Endpoints Specification
7.1 Core Execution Endpoints
Execute Code:
POST /api/v2/execute
Content-Type: application/json
Authorization: Bearer <token>
{
"code": "print('Hello, World!')",
"language": "python",
"agent_id": "agent-123",
"vm_id": "vm-456", // optional
"timeout_ms": 10000, // optional, default 10000
"snapshot_strategy": "pre-execution", // optional
"resource_limits": { // optional
"memory_mb": 512,
"timeout_secs": 10
}
}
Response:
{
"execution_id": "exec-789",
"vm_id": "vm-456",
"agent_id": "agent-123",
"status": "success",
"exit_code": 0,
"stdout": "Hello, World!\n",
"stderr": "",
"duration_ms": 1234,
"memory_peak_mb": 64,
"snapshot_id": "snap-101",
"started_at": "2025-10-18T10:30:00Z",
"completed_at": "2025-10-18T10:30:01.234Z"
}Execute Multiple Code Blocks:
POST /api/v2/execute/batch
Content-Type: application/json
{
"code_blocks": [
{
"code": "x = 1",
"language": "python"
},
{
"code": "print(x)",
"language": "python"
}
],
"agent_id": "agent-123",
"execution_mode": "sequential" // or "parallel"
}
Response:
{
"batch_id": "batch-456",
"execution_id": "exec-789",
"results": [
{
"block_index": 0,
"status": "success",
"stdout": "",
"stderr": "",
"exit_code": 0,
"duration_ms": 45
},
{
"block_index": 1,
"status": "success",
"stdout": "1\n",
"stderr": "",
"exit_code": 0,
"duration_ms": 67
}
],
"total_duration_ms": 156,
"vm_id": "vm-456",
"snapshot_id": "snap-102"
}7.2 VM Management Endpoints
Get VM Pool Status:
GET /api/v2/vm-pool/status
Response:
{
"pool_stats": {
"total_vms": 15,
"prewarmed_vms": 8,
"active_vms": 5,
"warming_vms": 2,
"utilization_percent": 33,
"average_boot_time_ms": 1800
},
"resource_stats": {
"memory_total_gb": 32,
"memory_used_gb": 10,
"memory_utilization_percent": 31,
"cpu_cores_total": 16,
"cpu_cores_used": 8,
"cpu_utilization_percent": 50
}
}Allocate VM for Agent:
POST /api/v2/vms/allocate
Content-Type: application/json
{
"agent_id": "agent-123",
"vm_config": {
"memory_mb": 2048,
"vcpus": 2,
"root_fs_size_gb": 10
}
}
Response:
{
"vm_id": "vm-456",
"ip_address": "172.26.0.42",
"status": "prewarmed",
"estimated_ready_time_ms": 500,
"allocated_at": "2025-10-18T10:30:00Z"
}7.3 Snapshot Management Endpoints
Create Snapshot:
POST /api/v2/vms/{vm_id}/snapshots
Content-Type: application/json
{
"name": "before-risky-operation",
"description": "Snapshot before executing untrusted code",
"snapshot_type": "manual"
}
Response:
{
"snapshot_id": "snap-103",
"vm_id": "vm-456",
"name": "before-risky-operation",
"created_at": "2025-10-18T10:30:00Z",
"size_mb": 256,
"creation_time_ms": 234
}Rollback to Snapshot:
POST /api/v2/vms/{vm_id}/snapshots/{snapshot_id}/rollback
Response:
{
"vm_id": "vm-456",
"snapshot_id": "snap-103",
"rollback_time_ms": 567,
"files_affected": 12,
"processes_terminated": 3,
"rolled_back_at": "2025-10-18T10:30:05Z"
}8. Integration with Terraphim AI
8.1 Agent Integration
Agent VM Assignment:
8.2 LLM Proxy Integration
LLM Model Configuration:
9. Implementation Roadmap
Phase 1: Core VM Execution (Weeks 1-2)
- [ ] Implement PrewarmedVmPool with basic allocation
- [ ] Add fast execution flow with snapshot support
- [ ] Integrate with existing firecracker-rust infrastructure
- [ ] Basic performance monitoring and metrics
Phase 2: Advanced Features (Weeks 3-4)
- [ ] Smart rollback system with selective restore
- [ ] Resource management with memory/CPU limits
- [ ] Enhanced security validation and scanning
- [ ] Batch execution support for multiple code blocks
Phase 3: Performance Optimization (Weeks 5-6)
- [ ] Sub-2 second boot time optimization
- [ ] VM prewarming with predictive allocation
- [ ] Advanced caching and memory management
- [ ] Load balancing across multiple host machines
Phase 4: Production Readiness (Weeks 7-8)
- [ ] Comprehensive monitoring and alerting
- [ ] Disaster recovery and backup procedures
- [ ] Performance tuning and capacity planning
- [ ] Security audit and penetration testing
10. Success Metrics
Performance Targets:
- VM allocation time: <50ms (from pool)
- Code execution setup: <200ms
- Total execution time: <2000ms (95th percentile)
- VM boot time: <1800ms (cold start)
- Snapshot creation: <100ms
- Rollback time: <500ms
Reliability Targets:
- VM pool availability: 99.9%
- Execution success rate: >95%
- Snapshot success rate: >99%
- System uptime: 99.95%
Security Targets:
- Zero VM escape incidents
- 100% code validation coverage
- <1% false positive security alerts
- Complete audit trail for all executions
This comprehensive VM execution API design provides the foundation for secure, high-performance code execution in terraphim-ai, leveraging the existing firecracker-rust infrastructure while adding advanced features for performance, security, and reliability.