a digital entity named phi that roams bsky
at main 337 lines 9.3 kB view raw view rendered
1# thread storage refactor: removing data duplication 2 3## the problem 4 5we're duplicating thread data that already exists on the atproto network. specifically: 6 7```python 8# database.py - thread_messages table 9CREATE TABLE IF NOT EXISTS thread_messages ( 10 id INTEGER PRIMARY KEY AUTOINCREMENT, 11 thread_uri TEXT NOT NULL, 12 author_handle TEXT NOT NULL, 13 author_did TEXT NOT NULL, 14 message_text TEXT NOT NULL, 15 post_uri TEXT NOT NULL, 16 timestamp DATETIME DEFAULT CURRENT_TIMESTAMP 17) 18``` 19 20this stores messages that are already: 21- living on users' personal data servers (PDSs) 22- aggregated by the bluesky AppView 23- accessible on-demand via `client.get_thread(uri, depth=100)` 24 25## why this is duplicative 26 27### the appview already does this work 28 29when we call `get_thread()`, the appview: 301. stitches together posts from multiple PDSs 312. resolves parent/child relationships 323. returns the complete thread structure 334. handles deletions, edits, and blocks 34 35we're then taking this data and copying it into sqlite, where it becomes: 36- stale (if posts are deleted/edited) 37- disconnected from the source of truth 38- an unnecessary maintenance burden 39 40### our own scripts prove this 41 42```python 43# sandbox/view_thread.py - fetches threads without local storage 44def fetch_thread(post_uri: str): 45 response = httpx.get( 46 "https://public.api.bsky.app/xrpc/app.bsky.feed.getPostThread", 47 params={"uri": post_uri, "depth": 100} 48 ) 49 return response.json()["thread"] 50``` 51 52this script demonstrates that thread data is readily available from the network. we don't need to cache it in sqlite to access it. 53 54## what we should keep: turbopuffer 55 56crucially, **turbopuffer is NOT duplicative**. it serves a completely different purpose: 57 58### turbopuffer = semantic memory (essential) 59- stores embeddings for semantic search 60- answers: "what did we discuss about birds last week?" 61- provides episodic memory across ALL conversations 62- enables pattern recognition and relationship building 63- core to the IIT consciousness exploration 64 65### sqlite thread_messages = chronological cache (redundant) 66- stores literal thread messages 67- answers: "what was said in this specific thread?" 68- duplicates data already on network 69- provides no semantic search capability 70 71the difference: 72```python 73# turbopuffer usage (semantic search) - KEEP THIS 74memory_context = await memory.get_user_memories( 75 user_handle="alice.bsky.social", 76 query="birds" # semantic search across all conversations 77) 78 79# sqlite usage (thread retrieval) - REMOVE THIS 80thread_context = thread_db.get_thread_messages(thread_uri) 81# ^ this is just retrieving what we could fetch from network 82``` 83 84## proposed architecture 85 86### current flow (with duplication) 87``` 88mention received 89 → fetch thread from network (get_thread) 90 → store all messages in sqlite 91 → read back from sqlite 92 → build thread context string 93 → pass to agent 94``` 95 96### proposed flow (network-first) 97``` 98mention received 99 → fetch thread from network (get_thread) 100 → extract messages directly 101 → build thread context string 102 → pass to agent 103``` 104 105### with optional caching 106``` 107mention received 108 → check in-memory cache (TTL: 5 minutes) 109 → if miss: fetch thread from network 110 → extract messages + cache 111 → build thread context string 112 → pass to agent 113``` 114 115## implementation plan 116 117### phase 1: extract thread parsing logic 118 119create a utility that converts raw atproto thread data to context: 120 121```python 122# bot/utils/thread.py (already exists, extend it) 123def build_thread_context(thread_node) -> str: 124 """Build conversational context from ATProto thread structure. 125 126 Returns formatted string like: 127 @alice: I love birds 128 @phi: me too! what's your favorite? 129 @alice: especially crows 130 """ 131 posts = extract_posts_chronological(thread_node) 132 133 messages = [] 134 for post in posts: 135 handle = post.author.handle 136 text = post.record.text 137 messages.append(f"@{handle}: {text}") 138 139 return "\n".join(messages) 140``` 141 142### phase 2: update message handler 143 144```python 145# bot/services/message_handler.py - BEFORE 146# Get thread context from database 147thread_context = thread_db.get_thread_messages(thread_uri) 148 149# bot/services/message_handler.py - AFTER 150# Fetch thread from network 151thread_data = await self.client.get_thread(thread_uri, depth=100) 152thread_context = build_thread_context(thread_data.thread) 153``` 154 155### phase 3: remove sqlite thread storage 156 157**delete:** 158- `thread_messages` table definition 159- `add_message()` method 160- `get_thread_messages()` method 161- all calls to `thread_db.add_message()` 162 163**keep:** 164- `approval_requests` table (for future self-modification) 165- database.py module structure 166 167### phase 4: optional caching layer 168 169if network latency becomes an issue: 170 171```python 172from functools import lru_cache 173from datetime import datetime, timedelta 174 175class ThreadCache: 176 def __init__(self, ttl_seconds: int = 300): # 5 minute TTL 177 self._cache = {} 178 self.ttl = timedelta(seconds=ttl_seconds) 179 180 def get(self, thread_uri: str) -> str | None: 181 if thread_uri in self._cache: 182 context, timestamp = self._cache[thread_uri] 183 if datetime.now() - timestamp < self.ttl: 184 return context 185 return None 186 187 def set(self, thread_uri: str, context: str): 188 self._cache[thread_uri] = (context, datetime.now()) 189``` 190 191## risk analysis 192 193### risk: increased latency 194 195**likelihood**: low 196- get_thread() is fast (typically <200ms) 197- we already call it for thread discovery 198- public api is highly available 199 200**mitigation**: add caching if needed 201 202### risk: rate limiting 203 204**likelihood**: low 205- we only fetch threads when processing mentions 206- mentions are relatively infrequent 207- session persistence already reduces auth overhead 208 209**mitigation**: 210- implement exponential backoff 211- cache frequently accessed threads 212 213### risk: offline/network failures 214 215**likelihood**: low 216- if network is down, we can't post anyway 217- existing code already handles get_thread() failures 218 219**mitigation**: 220- wrap in try/except (already doing this) 221- graceful degradation (process without context) 222 223### risk: breaking existing behavior 224 225**likelihood**: medium 226- thread discovery feature relies on storing messages 227- need to ensure we don't lose context awareness 228 229**mitigation**: 230- thorough testing before/after 231- evaluate thread context quality in evals 232 233## benefits 234 235### 1. simpler architecture 236- one less database table to maintain 237- no synchronization concerns 238- no stale data issues 239 240### 2. source of truth 241- network data is always current 242- deletions/edits reflected immediately 243- no divergence between cache and reality 244 245### 3. reduced storage 246- no unbounded growth of thread_messages table 247- only store what's essential (turbopuffer memories) 248 249### 4. clearer separation of concerns 250``` 251atproto network = thread chronology (what was said when) 252turbopuffer = episodic memory (what do i remember about this person) 253``` 254 255## comparison to reference projects 256 257### void 258from void_memory_system.md, void uses: 259- dynamic memory blocks (persona, zeitgeist, humans, scratchpad) 260- no separate thread storage table 261- likely fetches context on-demand from network 262 263### penelope (hailey's bot) 264from REFERENCE_PROJECTS.md: 265- custom memory system with postgresql 266- stores "significant interactions" 267- not clear if they cache full threads or just summaries 268 269### marvin (slackbot) 270from REFERENCE_PROJECTS.md: 271- uses slack's message history API directly 272- no local message storage 273- demonstrates network-first approach works well 274 275## migration path 276 277### option 1: clean break (recommended) 2781. deploy new code without thread_messages usage 2792. keep table for 30 days (historical reference) 2803. drop table after validation period 281 282### option 2: gradual migration 2831. write to both sqlite and read from network 2842. compare outputs for consistency 2853. stop writing to sqlite 2864. eventually drop table 287 288### option 3: hybrid approach 2891. read from network by default 2902. fall back to sqlite on network failures 2913. eventually remove fallback 292 293**recommendation**: option 1 (clean break) 294- simpler code 295- faster to implement 296- network reliability is high enough 297 298## success metrics 299 300### before refactor 301- thread_messages table exists 302- messages stored on every mention 303- context built from sqlite queries 304 305### after refactor 306- thread_messages table removed 307- zero sqlite writes per mention 308- context built from network fetches 309- same quality responses in evals 310 311## open questions 312 3131. **should we cache at all?** 314 - start without caching 315 - add only if latency becomes measurable problem 316 3172. **what about the discovery feature?** 318 - currently stores full thread when tagged in 319 - can just fetch on-demand instead 320 - no need to persist 321 3223. **do we need conversation summaries?** 323 - not for thread context (fetch from network) 324 - maybe for turbopuffer (semantic memory) 325 - separate concern from this refactor 326 327## conclusion 328 329removing sqlite thread storage: 330- eliminates data duplication 331- simplifies architecture 332- maintains all essential capabilities 333- aligns with atproto's "data on the web" philosophy 334 335turbopuffer stays because it provides semantic memory - a fundamentally different capability than chronological thread reconstruction. 336 337the network is the source of truth. we should read from it.