Phase 2 Implementation Plan: TinyClaw Enhancements
Status: Draft
Research Doc: docs/plans/tinyclaw-phase2-research.md
Author: Terraphim AI Team
Date: 2026-02-12
Estimated Effort: 4-6 weeks
Overview
Summary
Extend TinyClaw with three major capabilities: WhatsApp integration via Matrix bridge, voice message transcription using Whisper, and a skills system for reusable workflows.
Approach
Build incrementally on Phase 1 architecture:
- WhatsApp Channel - New Channel trait implementation using Matrix SDK
- Voice Transcription - New tool that downloads audio, transcribes with Whisper
- Skills System - JSON-based workflow definitions with save/load/monitor
Scope
In Scope:
- WhatsApp integration via Matrix bridge (mautrix-whatsapp)
- Voice transcription tool (Whisper base model)
- Skills system with JSON storage
- Audio format conversion (ogg/opus → wav)
- Skill CRUD operations (/skill save, /skill load, /skill list)
Out of Scope:
- Real-time voice streaming (batch processing only)
- WhatsApp Business API (approval process too long)
- Skill marketplace/cloud sync (local files only)
- Custom voice models (Whisper base only)
Avoid At All Cost:
- Native WhatsApp libraries (whatsapp-web-rs) - unreliable, frequent breakage
- Video transcription - scope creep, audio-only is sufficient
- Cloud-hosted skills - adds complexity, local files work
Architecture
Component Diagram
┌─────────────────────────────────────────────────────────────────┐
│ TinyClaw Phase 2 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Channels Agent Loop Tools │
│ ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ CLI │ │ │ │ Filesystem │ │
│ ├─────────────┤◄──────────►│ Hybrid │ ├──────────────┤ │
│ │ Telegram │ │ LLM │ │ Edit │ │
│ ├─────────────┤ │ Router │ ├──────────────┤ │
│ │ Discord │ │ │ │ Shell │ │
│ ├─────────────┤ └──────────────┘ ├──────────────┤ │
│ │ WhatsApp ◄──┼──┐ │ Web Search │ │
│ │ (Matrix) │ │ ├──────────────┤ │
│ └─────────────┘ │ │ Web Fetch │ │
│ │ ├──────────────┤ │
│ Voice Messages │ │ Voice ◄──────┼─┘
│ ┌─────────────┐ │ │ Transcribe │
│ │ Download │ │ └──────────────┘
│ │ Convert │ │
│ │ Whisper │──┘
│ └─────────────┘
│
│ Skills
│ ┌─────────────┐
│ │ Load JSON │
│ │ Execute │
│ │ Monitor │
│ └─────────────┘
│
└─────────────────────────────────────────────────────────────────┘Data Flow
WhatsApp Message:
WhatsApp → mautrix-whatsapp → Matrix Server → MatrixChannel → Bus → Agent → ResponseVoice Message:
WhatsApp voice → Download → Convert to WAV → Whisper → Text → AgentSkill Execution:
/skill run analyze-repo → Load JSON → Parse steps → Execute tools → Report resultsKey Design Decisions
| Decision | Rationale | Alternatives Rejected | |----------|-----------|----------------------| | Matrix bridge for WhatsApp | Reliable, handles reconnection, rate limits | whatsapp-web-rs (breaks frequently) | | Whisper base model | 74MB, 2x real-time, good quality | Large model (155MB, slower), Tiny (poor quality) | | JSON for skills | Human-readable, standard, editable | YAML (slower), Binary (not readable) | | Download-then-transcribe | Simpler than streaming, handles large files | Real-time streaming (complex, unnecessary) |
Eliminated Options
| Option Rejected | Why Rejected | Risk of Including | |-----------------|--------------|-------------------| | whatsapp-web-rs direct | Meta blocks frequently, ToS violations | Sudden breakage, user frustration | | Real-time voice streaming | Adds complexity, not needed for messaging | Delayed delivery, audio glitches | | Skill cloud storage | Adds auth, sync complexity | Privacy concerns, offline failure | | Whisper large model | 155MB download, 5x slower | Slow user experience, memory pressure |
Simplicity Check
"Minimum code that solves the problem. Nothing speculative."
What if this could be easy?
- WhatsApp: Matrix bridge handles complexity, we just implement Matrix channel
- Voice: Download file, run Whisper, return text - no streaming needed
- Skills: JSON files, simple sequential execution - no DAG or parallelism yet
Senior Engineer Test: ✅ This design would pass senior review - straightforward extensions of existing patterns.
Nothing Speculative Checklist:
- [x] No features the user didn't request (no video, no streaming)
- [x] No abstractions "in case we need them later" (skills are simple sequential)
- [x] No flexibility "just in case" (WhatsApp only via Matrix for now)
- [x] No error handling for scenarios that cannot occur (Whisper will always return something)
- [x] No premature optimization (base model is fast enough)
File Changes
New Files
| File | Purpose |
|------|---------|
| src/channels/matrix.rs | Matrix channel adapter for WhatsApp bridge |
| src/tools/voice_transcribe.rs | Audio download + Whisper transcription |
| src/skills/mod.rs | Skills system module |
| src/skills/types.rs | Skill JSON types |
| src/skills/executor.rs | Skill step execution |
| src/skills/monitor.rs | Skill monitoring/logging |
Modified Files
| File | Changes |
|------|---------|
| src/channels/mod.rs | Add pub mod matrix; |
| src/tools/mod.rs | Add voice_transcribe tool to registry |
| src/main.rs | Add /skill slash commands |
| src/agent/agent_loop.rs | Handle skill execution, voice messages |
| src/bus.rs | Add media_url field to InboundMessage |
| Cargo.toml | Add matrix-sdk, whisper-rs, symphonia dependencies |
Dependencies
New Dependencies
| Crate | Version | Justification | |-------|---------|---------------| | matrix-sdk | 0.7 | Matrix client for WhatsApp bridge | | whisper-rs | 0.11 | OpenAI Whisper transcription | | ort | 2.0 | ONNX runtime for Whisper | | symphonia | 0.5 | Audio format conversion | | tokio-util | 0.7 | Already present, codec support |
API Design
Matrix Channel
/// Matrix channel for WhatsApp bridge
Voice Transcription Tool
/// Tool for transcribing voice messages
Skills System
/// Skill definition from JSON
/// Individual step in a skill
/// Skill execution handle
Enhanced InboundMessage
Test Strategy
Unit Tests
| Test | Location | Purpose |
|------|----------|---------|
| test_matrix_channel_new | channels/matrix.rs | Channel creation |
| test_matrix_login | channels/matrix.rs | Matrix authentication |
| test_voice_transcribe_mock | tools/voice_transcribe.rs | Tool execution |
| test_skill_load | skills/executor.rs | JSON parsing |
| test_skill_execute_steps | skills/executor.rs | Step execution |
| test_skill_cancel | skills/monitor.rs | Cancellation |
Integration Tests
| Test | Location | Purpose |
|------|----------|---------|
| test_matrix_whatsapp_flow | tests/matrix_integration.rs | Full Matrix→WhatsApp flow |
| test_voice_message_pipeline | tests/voice_integration.rs | Download→Transcribe→Text |
| test_skill_end_to_end | tests/skill_integration.rs | Save→Load→Execute |
Property Tests
// Skill JSON should always parse if valid
proptest! Implementation Steps
Step 1: WhatsApp via Matrix (Week 1-2)
Files: src/channels/matrix.rs, src/channels/mod.rs, Cargo.toml
Description: Implement Matrix channel adapter that connects to mautrix-whatsapp bridge
Tasks:
- Add matrix-sdk dependency
- Implement MatrixChannel with login/sync
- Handle m.room.message events
- Support text and media messages
- Map Matrix users to allow_from whitelist
Tests:
- Unit: Channel creation, login
- Integration: Message flow (mock Matrix server)
Dependencies: None Estimated: 8 hours
Step 2: Voice Transcription Tool (Week 2-3)
Files: src/tools/voice_transcribe.rs, src/tools/mod.rs
Description: Create tool that downloads voice messages and transcribes with Whisper
Tasks:
- Add whisper-rs, ort, symphonia dependencies
- Implement audio download (reqwest)
- Implement format conversion (ogg→wav)
- Integrate Whisper base model
- Add voice_transcribe to ToolRegistry
Tests:
- Unit: Mock Whisper, test pipeline
- Integration: Real audio file transcription
Dependencies: Step 1 (for Matrix voice messages) Estimated: 10 hours
Step 3: Skills System Core (Week 3-4)
Files: src/skills/mod.rs, src/skills/types.rs, src/skills/executor.rs
Description: JSON-based skill definitions with sequential step execution
Tasks:
- Define Skill and SkillStep types
- Implement JSON serialization
- Create SkillExecutor with load/save/list
- Implement step execution (tool calls, LLM prompts)
- Add skills directory (~/.config/terraphim/skills/)
Tests:
- Unit: Load/save, step execution
- Integration: End-to-end skill execution
Dependencies: None (independent) Estimated: 12 hours
Step 4: Skills Slash Commands (Week 4)
Files: src/main.rs, src/agent/agent_loop.rs
Description: CLI commands for skill management
Tasks:
- Add /skill save <name> - Save current conversation as skill
- Add /skill load <name> - Load and execute skill
- Add /skill list - Show available skills
- Add /skill cancel - Cancel running skill
- Handle skill execution in agent loop
Tests:
- Unit: Slash command parsing
- Integration: Full command flow
Dependencies: Step 3 Estimated: 6 hours
Step 5: Skills Monitoring (Week 5)
Files: src/skills/monitor.rs
Description: Monitor skill execution, logging, progress reporting
Tasks:
- Add execution logging
- Implement progress reporting (step X of Y)
- Add timeout handling
- Create execution report on completion
Tests:
- Unit: Monitor functionality
- Integration: Long-running skill monitoring
Dependencies: Step 4 Estimated: 4 hours
Step 6: Integration & Polish (Week 5-6)
Files: All modified files
Description: End-to-end testing, documentation, examples
Tasks:
- Write integration tests for all features
- Create example skills (analyze-repo, research-topic)
- Update documentation
- Performance testing
- Code review and cleanup
Tests:
- Full test suite
- UBS static analysis
- Benchmarks
Dependencies: All previous steps Estimated: 10 hours
Rollback Plan
If issues discovered:
- WhatsApp not working: Disable Matrix feature flag, keep Telegram/Discord
- Voice too slow: Disable voice_transcribe tool, text-only mode
- Skills buggy: Disable /skill commands, core agent loop unaffected
Feature Flags:
matrix- Enable Matrix/WhatsApp supportvoice- Enable voice transcriptionskills- Enable skill system
Performance Considerations
Expected Performance
| Metric | Target | Measurement | |--------|--------|-------------| | Matrix message latency | < 2s | Message round-trip | | Voice transcription (30s) | < 5s | Download + convert + Whisper | | Skill load time | < 100ms | File I/O | | Memory per voice model | < 200MB | Whisper base + ort |
Benchmarks to Add
Open Items
| Item | Status | Owner | |------|--------|-------| | Matrix server setup docs | Pending | Docs team | | Whisper model download strategy | Pending | Dev team | | Skill examples (3 minimum) | Pending | Dev team |
Approval
- [ ] Technical review complete
- [ ] Test strategy approved
- [ ] Performance targets agreed
- [ ] Human approval received
Design Complete: Ready for specification interview and implementation.