docs: add thread storage refactor analysis

a digital entity named phi that roams bsky

analyzes data duplication between sqlite thread_messages table and atproto network. proposes removing sqlite storage in favor of on-demand network fetching with optional caching.

key insight: turbopuffer is NOT duplicative (semantic memory) but thread_messages IS (chronological cache of network data)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

zzstoatzz.io 4 months ago 5ce04186 d3d53854

+337

1 changed file

expand all

unified split

sandbox

THREAD_STORAGE_REFACTOR.md

+337

sandbox/THREAD_STORAGE_REFACTOR.md

··· 1 + # thread storage refactor: removing data duplication 2 + 3 + ## the problem 4 + 5 + we're duplicating thread data that already exists on the atproto network. specifically: 6 + 7 + ```python 8 + # database.py - thread_messages table 9 + CREATE TABLE IF NOT EXISTS thread_messages ( 10 + id INTEGER PRIMARY KEY AUTOINCREMENT, 11 + thread_uri TEXT NOT NULL, 12 + author_handle TEXT NOT NULL, 13 + author_did TEXT NOT NULL, 14 + message_text TEXT NOT NULL, 15 + post_uri TEXT NOT NULL, 16 + timestamp DATETIME DEFAULT CURRENT_TIMESTAMP 17 + ) 18 + ``` 19 + 20 + this stores messages that are already: 21 + - living on users' personal data servers (PDSs) 22 + - aggregated by the bluesky AppView 23 + - accessible on-demand via `client.get_thread(uri, depth=100)` 24 + 25 + ## why this is duplicative 26 + 27 + ### the appview already does this work 28 + 29 + when we call `get_thread()`, the appview: 30 + 1. stitches together posts from multiple PDSs 31 + 2. resolves parent/child relationships 32 + 3. returns the complete thread structure 33 + 4. handles deletions, edits, and blocks 34 + 35 + we're then taking this data and copying it into sqlite, where it becomes: 36 + - stale (if posts are deleted/edited) 37 + - disconnected from the source of truth 38 + - an unnecessary maintenance burden 39 + 40 + ### our own scripts prove this 41 + 42 + ```python 43 + # sandbox/view_thread.py - fetches threads without local storage 44 + def fetch_thread(post_uri: str): 45 + response = httpx.get( 46 + "https://public.api.bsky.app/xrpc/app.bsky.feed.getPostThread", 47 + params={"uri": post_uri, "depth": 100} 48 + ) 49 + return response.json()["thread"] 50 + ``` 51 + 52 + this script demonstrates that thread data is readily available from the network. we don't need to cache it in sqlite to access it. 53 + 54 + ## what we should keep: turbopuffer 55 + 56 + crucially, **turbopuffer is NOT duplicative**. it serves a completely different purpose: 57 + 58 + ### turbopuffer = semantic memory (essential) 59 + - stores embeddings for semantic search 60 + - answers: "what did we discuss about birds last week?" 61 + - provides episodic memory across ALL conversations 62 + - enables pattern recognition and relationship building 63 + - core to the IIT consciousness exploration 64 + 65 + ### sqlite thread_messages = chronological cache (redundant) 66 + - stores literal thread messages 67 + - answers: "what was said in this specific thread?" 68 + - duplicates data already on network 69 + - provides no semantic search capability 70 + 71 + the difference: 72 + ```python 73 + # turbopuffer usage (semantic search) - KEEP THIS 74 + memory_context = await memory.get_user_memories( 75 + user_handle="alice.bsky.social", 76 + query="birds" # semantic search across all conversations 77 + ) 78 + 79 + # sqlite usage (thread retrieval) - REMOVE THIS 80 + thread_context = thread_db.get_thread_messages(thread_uri) 81 + # ^ this is just retrieving what we could fetch from network 82 + ``` 83 + 84 + ## proposed architecture 85 + 86 + ### current flow (with duplication) 87 + ``` 88 + mention received 89 + → fetch thread from network (get_thread) 90 + → store all messages in sqlite 91 + → read back from sqlite 92 + → build thread context string 93 + → pass to agent 94 + ``` 95 + 96 + ### proposed flow (network-first) 97 + ``` 98 + mention received 99 + → fetch thread from network (get_thread) 100 + → extract messages directly 101 + → build thread context string 102 + → pass to agent 103 + ``` 104 + 105 + ### with optional caching 106 + ``` 107 + mention received 108 + → check in-memory cache (TTL: 5 minutes) 109 + → if miss: fetch thread from network 110 + → extract messages + cache 111 + → build thread context string 112 + → pass to agent 113 + ``` 114 + 115 + ## implementation plan 116 + 117 + ### phase 1: extract thread parsing logic 118 + 119 + create a utility that converts raw atproto thread data to context: 120 + 121 + ```python 122 + # bot/utils/thread.py (already exists, extend it) 123 + def build_thread_context(thread_node) -> str: 124 + """Build conversational context from ATProto thread structure. 125 + 126 + Returns formatted string like: 127 + @alice: I love birds 128 + @phi: me too! what's your favorite? 129 + @alice: especially crows 130 + """ 131 + posts = extract_posts_chronological(thread_node) 132 + 133 + messages = [] 134 + for post in posts: 135 + handle = post.author.handle 136 + text = post.record.text 137 + messages.append(f"@{handle}: {text}") 138 + 139 + return "\n".join(messages) 140 + ``` 141 + 142 + ### phase 2: update message handler 143 + 144 + ```python 145 + # bot/services/message_handler.py - BEFORE 146 + # Get thread context from database 147 + thread_context = thread_db.get_thread_messages(thread_uri) 148 + 149 + # bot/services/message_handler.py - AFTER 150 + # Fetch thread from network 151 + thread_data = await self.client.get_thread(thread_uri, depth=100) 152 + thread_context = build_thread_context(thread_data.thread) 153 + ``` 154 + 155 + ### phase 3: remove sqlite thread storage 156 + 157 + **delete:** 158 + - `thread_messages` table definition 159 + - `add_message()` method 160 + - `get_thread_messages()` method 161 + - all calls to `thread_db.add_message()` 162 + 163 + **keep:** 164 + - `approval_requests` table (for future self-modification) 165 + - database.py module structure 166 + 167 + ### phase 4: optional caching layer 168 + 169 + if network latency becomes an issue: 170 + 171 + ```python 172 + from functools import lru_cache 173 + from datetime import datetime, timedelta 174 + 175 + class ThreadCache: 176 + def __init__(self, ttl_seconds: int = 300): # 5 minute TTL 177 + self._cache = {} 178 + self.ttl = timedelta(seconds=ttl_seconds) 179 + 180 + def get(self, thread_uri: str) -> str | None: 181 + if thread_uri in self._cache: 182 + context, timestamp = self._cache[thread_uri] 183 + if datetime.now() - timestamp < self.ttl: 184 + return context 185 + return None 186 + 187 + def set(self, thread_uri: str, context: str): 188 + self._cache[thread_uri] = (context, datetime.now()) 189 + ``` 190 + 191 + ## risk analysis 192 + 193 + ### risk: increased latency 194 + 195 + **likelihood**: low 196 + - get_thread() is fast (typically <200ms) 197 + - we already call it for thread discovery 198 + - public api is highly available 199 + 200 + **mitigation**: add caching if needed 201 + 202 + ### risk: rate limiting 203 + 204 + **likelihood**: low 205 + - we only fetch threads when processing mentions 206 + - mentions are relatively infrequent 207 + - session persistence already reduces auth overhead 208 + 209 + **mitigation**: 210 + - implement exponential backoff 211 + - cache frequently accessed threads 212 + 213 + ### risk: offline/network failures 214 + 215 + **likelihood**: low 216 + - if network is down, we can't post anyway 217 + - existing code already handles get_thread() failures 218 + 219 + **mitigation**: 220 + - wrap in try/except (already doing this) 221 + - graceful degradation (process without context) 222 + 223 + ### risk: breaking existing behavior 224 + 225 + **likelihood**: medium 226 + - thread discovery feature relies on storing messages 227 + - need to ensure we don't lose context awareness 228 + 229 + **mitigation**: 230 + - thorough testing before/after 231 + - evaluate thread context quality in evals 232 + 233 + ## benefits 234 + 235 + ### 1. simpler architecture 236 + - one less database table to maintain 237 + - no synchronization concerns 238 + - no stale data issues 239 + 240 + ### 2. source of truth 241 + - network data is always current 242 + - deletions/edits reflected immediately 243 + - no divergence between cache and reality 244 + 245 + ### 3. reduced storage 246 + - no unbounded growth of thread_messages table 247 + - only store what's essential (turbopuffer memories) 248 + 249 + ### 4. clearer separation of concerns 250 + ``` 251 + atproto network = thread chronology (what was said when) 252 + turbopuffer = episodic memory (what do i remember about this person) 253 + ``` 254 + 255 + ## comparison to reference projects 256 + 257 + ### void 258 + from void_memory_system.md, void uses: 259 + - dynamic memory blocks (persona, zeitgeist, humans, scratchpad) 260 + - no separate thread storage table 261 + - likely fetches context on-demand from network 262 + 263 + ### penelope (hailey's bot) 264 + from REFERENCE_PROJECTS.md: 265 + - custom memory system with postgresql 266 + - stores "significant interactions" 267 + - not clear if they cache full threads or just summaries 268 + 269 + ### marvin (slackbot) 270 + from REFERENCE_PROJECTS.md: 271 + - uses slack's message history API directly 272 + - no local message storage 273 + - demonstrates network-first approach works well 274 + 275 + ## migration path 276 + 277 + ### option 1: clean break (recommended) 278 + 1. deploy new code without thread_messages usage 279 + 2. keep table for 30 days (historical reference) 280 + 3. drop table after validation period 281 + 282 + ### option 2: gradual migration 283 + 1. write to both sqlite and read from network 284 + 2. compare outputs for consistency 285 + 3. stop writing to sqlite 286 + 4. eventually drop table 287 + 288 + ### option 3: hybrid approach 289 + 1. read from network by default 290 + 2. fall back to sqlite on network failures 291 + 3. eventually remove fallback 292 + 293 + **recommendation**: option 1 (clean break) 294 + - simpler code 295 + - faster to implement 296 + - network reliability is high enough 297 + 298 + ## success metrics 299 + 300 + ### before refactor 301 + - thread_messages table exists 302 + - messages stored on every mention 303 + - context built from sqlite queries 304 + 305 + ### after refactor 306 + - thread_messages table removed 307 + - zero sqlite writes per mention 308 + - context built from network fetches 309 + - same quality responses in evals 310 + 311 + ## open questions 312 + 313 + 1. **should we cache at all?** 314 + - start without caching 315 + - add only if latency becomes measurable problem 316 + 317 + 2. **what about the discovery feature?** 318 + - currently stores full thread when tagged in 319 + - can just fetch on-demand instead 320 + - no need to persist 321 + 322 + 3. **do we need conversation summaries?** 323 + - not for thread context (fetch from network) 324 + - maybe for turbopuffer (semantic memory) 325 + - separate concern from this refactor 326 + 327 + ## conclusion 328 + 329 + removing sqlite thread storage: 330 + - eliminates data duplication 331 + - simplifies architecture 332 + - maintains all essential capabilities 333 + - aligns with atproto's "data on the web" philosophy 334 + 335 + turbopuffer stays because it provides semantic memory - a fundamentally different capability than chronological thread reconstruction. 336 + 337 + the network is the source of truth. we should read from it.