Implementation Plan: PR #529 Gap Coverage
Status: Draft
Research Doc: docs/plans/pr529-gap-analysis-research.md
Author: AI Agent Analysis
Date: 2026-02-20
Estimated Effort: 16 hours
Overview
Summary
This plan addresses 15 identified gaps in PR #529, prioritized by criticality. P0 gaps must be resolved before production deployment. P1 gaps should be addressed in the immediate follow-up sprint. P2/P3 gaps are documented for future sprints.
Approach
Focus on security (token leakage), reliability (gateway testing), and operational readiness (health checks, documentation). Each gap is addressed with minimal, focused changes.
Scope
In Scope (P0/P1):
- GAP-003: Gateway outbound dispatch tests (P0 - Critical)
- GAP-001/GAP-002: Security - Remove token logging (P1)
- GAP-004: Channel error recovery (P1)
- GAP-009: Health check endpoint (P1)
- GAP-011/GAP-012: Gateway and channel documentation (P1)
Out of Scope (P2/P3):
- GAP-005: Session cleanup (deferred - P2)
- GAP-006: Rate limiting (deferred - P2)
- GAP-007: Matrix adapter (blocked by upstream - P3)
- GAP-008: ToolRegistry error handling (acceptable - P3)
- GAP-010: Config validation (nice-to-have - P3)
- GAP-013: Load testing (future work - P3)
- GAP-014: Degradation spec (docs - P2)
- GAP-015: CLI parity (intentional - P3)
Avoid At All Cost:
- Complex retry logic with exponential backoff (over-engineering)
- Distributed session storage (Redis, etc.) - premature optimization
- Web dashboard for monitoring - scope creep
- Custom metrics pipeline - use existing logging
Architecture
Component Diagram
+------------------+
| Health Check |
| Endpoint |
+--------+---------+
|
+------------+ +--------v---------+ +------------------+
| Telegram |<---->| |----->| Agent Loop |
| Channel | | MessageBus | | (ToolCalling) |
+-----^------+ | |<-----+------------------+
| +--------^---------+
| |
+-----v------+ +--------v---------+ +------------------+
| Discord |<---->| ChannelMgr |----->| SkillExecutor |
| Channel | | (Recovery) | | (with registry) |
+------------+ +------------------+ +------------------+Data Flow
Gateway Mode:
Telegram/Discord -> MessageBus.inbound -> AgentLoop -> MessageBus.outbound -> ChannelManager.send()
Health Check:
HTTP GET /health -> HealthCheckHandler -> Status ResponseKey Design Decisions
| Decision | Rationale | Alternatives Rejected | |----------|-----------|----------------------| | Remove token logging entirely | Security over debugging | Token hashing, partial logging | | Test outbound dispatch via integration | Real behavior verification | Unit test with mocks | | Health check as simple HTTP | Operational standard | Complex health aggregation | | Channel panic isolation | Prevent gateway crashes | Process per channel (too heavy) |
Eliminated Options
| Option Rejected | Why Rejected | Risk of Including | |-----------------|--------------|-------------------| | Token masking/hashing | Still leaks token length and format | Complexity without security benefit | | Full circuit breaker pattern | Over-engineering for current scale | Premature optimization | | Distributed session storage | Not needed for single-node deployment | Infrastructure complexity | | Prometheus metrics | Overkill for initial release | Operational burden |
Simplicity Check
What if this could be easy?
- Token logging: Just don't log it (easiest)
- Health check: Return HTTP 200 with process status (simplest)
- Error recovery: Catch panics, log error, continue (simplest)
- Gateway tests: Spawn channels, send message, verify delivery (simplest)
Senior Engineer Test: This design is minimal and direct. A senior engineer would approve.
Nothing Speculative Checklist:
- [x] No features the user didn't request
- [x] No abstractions "in case we need them later"
- [x] No flexibility "just in case"
- [x] No error handling for scenarios that cannot occur
- [x] No premature optimization
File Changes
New Files
| File | Purpose |
|------|---------|
| crates/terraphim_tinyclaw/src/health.rs | Health check endpoint handler |
| crates/terraphim_tinyclaw/tests/gateway_integration.rs | Gateway mode integration tests |
| crates/terraphim_tinyclaw/tests/channel_recovery.rs | Channel error recovery tests |
| docs/src/gateway-deployment.md | Gateway deployment guide |
| docs/src/channel-setup.md | Bot setup instructions |
Modified Files
| File | Changes |
|------|---------|
| crates/terraphim_tinyclaw/src/channels/telegram.rs | Remove token logging |
| crates/terraphim_tinyclaw/src/channels/discord.rs | Remove token logging |
| crates/terraphim_tinyclaw/src/channel.rs | Add panic recovery wrapper |
| crates/terraphim_tinyclaw/src/main.rs | Add health check endpoint, fix logging |
| crates/terraphim_tinyclaw/src/lib.rs | Export health module |
| crates/terraphim_tinyclaw/Cargo.toml | Add optional axum for health endpoint |
Deleted Files
| File | Reason | |------|--------| | None | All changes are additive or modifications |
API Design
Health Check Response
/// Health check response
Error Recovery Types
/// Result of channel execution with recovery
pub type ChannelResult<T> = ;
Test Strategy
Unit Tests
| Test | Location | Purpose |
|------|----------|---------|
| test_telegram_no_token_log | telegram.rs | Verify token not logged |
| test_discord_no_token_log | discord.rs | Verify token not logged |
| test_health_response_format | health.rs | Verify health check format |
| test_channel_panic_recovery | channel.rs | Verify panic isolation |
Integration Tests
| Test | Location | Purpose |
|------|----------|---------|
| test_gateway_message_flow | tests/gateway_integration.rs | End-to-end message flow |
| test_outbound_dispatch | tests/gateway_integration.rs | GAP-003 verification |
| test_channel_failure_isolation | tests/channel_recovery.rs | GAP-004 verification |
| test_health_endpoint | tests/gateway_integration.rs | GAP-009 verification |
Implementation Steps
Step 1: Security - Remove Token Logging
Files: crates/terraphim_tinyclaw/src/channels/telegram.rs, discord.rs
Description: Remove all token logging including partial tokens
Tests: Unit tests verifying no token in logs
Estimated: 1 hour
Priority: P1
// BEFORE (remove this):
info!;
// AFTER:
info!;Step 2: Gateway Integration Tests
Files: crates/terraphim_tinyclaw/tests/gateway_integration.rs
Description: Test that outbound messages are dispatched to channels
Tests: GAP-003 verification
Estimated: 4 hours
Priority: P0
Depends on: None
async Step 3: Channel Error Recovery
Files: crates/terraphim_tinyclaw/src/channel.rs
Description: Wrap channel operations in catch_unwind to prevent gateway crashes
Tests: Unit and integration tests
Estimated: 3 hours
Priority: P1
/// Run channel with panic recovery
pub async Step 4: Health Check Endpoint
Files: crates/terraphim_tinyclaw/src/health.rs, main.rs
Description: Add HTTP health check endpoint for load balancers
Tests: Integration test
Estimated: 3 hours
Priority: P1
pub async Step 5: Documentation
Files: docs/src/gateway-deployment.md, docs/src/channel-setup.md
Description: User-facing documentation for gateway deployment
Tests: Doc tests, manual verification
Estimated: 4 hours
Priority: P1
Depends on: Steps 1-4
Outline for gateway-deployment.md:
- Overview of gateway mode
- Prerequisites (server, tokens)
- Configuration file format
- Docker deployment
- systemd service setup
- Health check usage
- Troubleshooting
Outline for channel-setup.md:
- Creating a Telegram bot (@BotFather)
- Getting Discord bot token
- Allowlist configuration
- Testing channel connectivity
Step 6: Cargo.toml Updates
Files: crates/terraphim_tinyclaw/Cargo.toml
Description: Add optional axum dependency for health endpoint
Tests: Build verification
Estimated: 1 hour
Priority: P1
[dependencies]
# Add for health endpoint
axum = { version = "0.7", optional = true, features = ["tokio"] }
tower = { version = "0.4", optional = true }
[features]
default = ["telegram", "discord", "health"]
health = ["dep:axum", "dep:tower"]Rollback Plan
If issues discovered:
- Token logging removal: Revert specific commits, but tokens in logs are security risk
- Health endpoint: Can be disabled by not building with
healthfeature - Channel recovery: If causing issues, revert wrapper and accept crash risk
- Gateway tests: Test-only changes, safe to keep
Dependencies
New Dependencies
| Crate | Version | Justification | |-------|---------|---------------| | axum | 0.7 | Health check HTTP endpoint | | tower | 0.4 | Axum middleware support |
Dependency Updates
None - all optional additions.
Performance Considerations
Expected Performance
| Metric | Target | Measurement | |--------|--------|-------------| | Health check latency | < 10ms | HTTP request | | Channel recovery time | < 100ms | Panic to restart | | Memory overhead | < 1MB | Recovery wrapper |
No Benchmarks Required
Changes are operational, not performance-critical.
Open Items
| Item | Status | Owner | |------|--------|-------| | Review security fix approach | Pending | Security review | | Confirm health endpoint port | Pending | DevOps input | | Validate documentation accuracy | Pending | Manual testing |
Approval Checklist
- [ ] Security review of token logging removal
- [ ] Technical review of panic recovery approach
- [ ] DevOps approval of health endpoint design
- [ ] Test strategy approved
- [ ] Documentation review scheduled
- [ ] Human approval received
Next Steps
- Review this implementation plan
- Approve P0 (GAP-003) test implementation
- Prioritize P1 items based on deployment timeline
- Schedule implementation sprints
- Create GitHub issues for each gap
Traceability Matrix
| Gap ID | Priority | Step | Test | Status | |--------|----------|------|------|--------| | GAP-001 | P1 | Step 1 | Unit test | Planned | | GAP-002 | P1 | Step 1 | Unit test | Planned | | GAP-003 | P0 | Step 2 | Integration test | Planned | | GAP-004 | P1 | Step 3 | Integration test | Planned | | GAP-009 | P1 | Step 4 | Integration test | Planned | | GAP-011 | P1 | Step 5 | Doc review | Planned | | GAP-012 | P1 | Step 5 | Doc review | Planned |