sandbox/THREAD_STORAGE_REFACTOR.md at main

zzstoatzz.io / bot
fork atom
a digital entity named phi that roams bsky
fork atom
bot / sandbox / THREAD_STORAGE_REFACTOR.md
at main 337 lines 9.3 kB view raw view rendered
wrap content
zzstoatzz.io docs: add thread storage refactor analysis 4mo ago
5ce04186
  1# thread storage refactor: removing data duplication
  2
  3## the problem
  4
  5we're duplicating thread data that already exists on the atproto network. specifically:
  6
  7```python
  8# database.py - thread_messages table
  9CREATE TABLE IF NOT EXISTS thread_messages (
 10    id INTEGER PRIMARY KEY AUTOINCREMENT,
 11    thread_uri TEXT NOT NULL,
 12    author_handle TEXT NOT NULL,
 13    author_did TEXT NOT NULL,
 14    message_text TEXT NOT NULL,
 15    post_uri TEXT NOT NULL,
 16    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
 17)
 18```
 19
 20this stores messages that are already:
 21- living on users' personal data servers (PDSs)
 22- aggregated by the bluesky AppView
 23- accessible on-demand via `client.get_thread(uri, depth=100)`
 24
 25## why this is duplicative
 26
 27### the appview already does this work
 28
 29when we call `get_thread()`, the appview:
 301. stitches together posts from multiple PDSs
 312. resolves parent/child relationships
 323. returns the complete thread structure
 334. handles deletions, edits, and blocks
 34
 35we're then taking this data and copying it into sqlite, where it becomes:
 36- stale (if posts are deleted/edited)
 37- disconnected from the source of truth
 38- an unnecessary maintenance burden
 39
 40### our own scripts prove this
 41
 42```python
 43# sandbox/view_thread.py - fetches threads without local storage
 44def fetch_thread(post_uri: str):
 45    response = httpx.get(
 46        "https://public.api.bsky.app/xrpc/app.bsky.feed.getPostThread",
 47        params={"uri": post_uri, "depth": 100}
 48    )
 49    return response.json()["thread"]
 50```
 51
 52this script demonstrates that thread data is readily available from the network. we don't need to cache it in sqlite to access it.
 53
 54## what we should keep: turbopuffer
 55
 56crucially, **turbopuffer is NOT duplicative**. it serves a completely different purpose:
 57
 58### turbopuffer = semantic memory (essential)
 59- stores embeddings for semantic search
 60- answers: "what did we discuss about birds last week?"
 61- provides episodic memory across ALL conversations
 62- enables pattern recognition and relationship building
 63- core to the IIT consciousness exploration
 64
 65### sqlite thread_messages = chronological cache (redundant)
 66- stores literal thread messages
 67- answers: "what was said in this specific thread?"
 68- duplicates data already on network
 69- provides no semantic search capability
 70
 71the difference:
 72```python
 73# turbopuffer usage (semantic search) - KEEP THIS
 74memory_context = await memory.get_user_memories(
 75    user_handle="alice.bsky.social",
 76    query="birds"  # semantic search across all conversations
 77)
 78
 79# sqlite usage (thread retrieval) - REMOVE THIS
 80thread_context = thread_db.get_thread_messages(thread_uri)
 81# ^ this is just retrieving what we could fetch from network
 82```
 83
 84## proposed architecture
 85
 86### current flow (with duplication)
 87```
 88mention received
 89  → fetch thread from network (get_thread)
 90  → store all messages in sqlite
 91  → read back from sqlite
 92  → build thread context string
 93  → pass to agent
 94```
 95
 96### proposed flow (network-first)
 97```
 98mention received
 99  → fetch thread from network (get_thread)
100  → extract messages directly
101  → build thread context string
102  → pass to agent
103```
104
105### with optional caching
106```
107mention received
108  → check in-memory cache (TTL: 5 minutes)
109  → if miss: fetch thread from network
110  → extract messages + cache
111  → build thread context string
112  → pass to agent
113```
114
115## implementation plan
116
117### phase 1: extract thread parsing logic
118
119create a utility that converts raw atproto thread data to context:
120
121```python
122# bot/utils/thread.py (already exists, extend it)
123def build_thread_context(thread_node) -> str:
124    """Build conversational context from ATProto thread structure.
125
126    Returns formatted string like:
127    @alice: I love birds
128    @phi: me too! what's your favorite?
129    @alice: especially crows
130    """
131    posts = extract_posts_chronological(thread_node)
132
133    messages = []
134    for post in posts:
135        handle = post.author.handle
136        text = post.record.text
137        messages.append(f"@{handle}: {text}")
138
139    return "\n".join(messages)
140```
141
142### phase 2: update message handler
143
144```python
145# bot/services/message_handler.py - BEFORE
146# Get thread context from database
147thread_context = thread_db.get_thread_messages(thread_uri)
148
149# bot/services/message_handler.py - AFTER
150# Fetch thread from network
151thread_data = await self.client.get_thread(thread_uri, depth=100)
152thread_context = build_thread_context(thread_data.thread)
153```
154
155### phase 3: remove sqlite thread storage
156
157**delete:**
158- `thread_messages` table definition
159- `add_message()` method
160- `get_thread_messages()` method
161- all calls to `thread_db.add_message()`
162
163**keep:**
164- `approval_requests` table (for future self-modification)
165- database.py module structure
166
167### phase 4: optional caching layer
168
169if network latency becomes an issue:
170
171```python
172from functools import lru_cache
173from datetime import datetime, timedelta
174
175class ThreadCache:
176    def __init__(self, ttl_seconds: int = 300):  # 5 minute TTL
177        self._cache = {}
178        self.ttl = timedelta(seconds=ttl_seconds)
179
180    def get(self, thread_uri: str) -> str | None:
181        if thread_uri in self._cache:
182            context, timestamp = self._cache[thread_uri]
183            if datetime.now() - timestamp < self.ttl:
184                return context
185        return None
186
187    def set(self, thread_uri: str, context: str):
188        self._cache[thread_uri] = (context, datetime.now())
189```
190
191## risk analysis
192
193### risk: increased latency
194
195**likelihood**: low
196- get_thread() is fast (typically <200ms)
197- we already call it for thread discovery
198- public api is highly available
199
200**mitigation**: add caching if needed
201
202### risk: rate limiting
203
204**likelihood**: low
205- we only fetch threads when processing mentions
206- mentions are relatively infrequent
207- session persistence already reduces auth overhead
208
209**mitigation**:
210- implement exponential backoff
211- cache frequently accessed threads
212
213### risk: offline/network failures
214
215**likelihood**: low
216- if network is down, we can't post anyway
217- existing code already handles get_thread() failures
218
219**mitigation**:
220- wrap in try/except (already doing this)
221- graceful degradation (process without context)
222
223### risk: breaking existing behavior
224
225**likelihood**: medium
226- thread discovery feature relies on storing messages
227- need to ensure we don't lose context awareness
228
229**mitigation**:
230- thorough testing before/after
231- evaluate thread context quality in evals
232
233## benefits
234
235### 1. simpler architecture
236- one less database table to maintain
237- no synchronization concerns
238- no stale data issues
239
240### 2. source of truth
241- network data is always current
242- deletions/edits reflected immediately
243- no divergence between cache and reality
244
245### 3. reduced storage
246- no unbounded growth of thread_messages table
247- only store what's essential (turbopuffer memories)
248
249### 4. clearer separation of concerns
250```
251atproto network = thread chronology (what was said when)
252turbopuffer = episodic memory (what do i remember about this person)
253```
254
255## comparison to reference projects
256
257### void
258from void_memory_system.md, void uses:
259- dynamic memory blocks (persona, zeitgeist, humans, scratchpad)
260- no separate thread storage table
261- likely fetches context on-demand from network
262
263### penelope (hailey's bot)
264from REFERENCE_PROJECTS.md:
265- custom memory system with postgresql
266- stores "significant interactions"
267- not clear if they cache full threads or just summaries
268
269### marvin (slackbot)
270from REFERENCE_PROJECTS.md:
271- uses slack's message history API directly
272- no local message storage
273- demonstrates network-first approach works well
274
275## migration path
276
277### option 1: clean break (recommended)
2781. deploy new code without thread_messages usage
2792. keep table for 30 days (historical reference)
2803. drop table after validation period
281
282### option 2: gradual migration
2831. write to both sqlite and read from network
2842. compare outputs for consistency
2853. stop writing to sqlite
2864. eventually drop table
287
288### option 3: hybrid approach
2891. read from network by default
2902. fall back to sqlite on network failures
2913. eventually remove fallback
292
293**recommendation**: option 1 (clean break)
294- simpler code
295- faster to implement
296- network reliability is high enough
297
298## success metrics
299
300### before refactor
301- thread_messages table exists
302- messages stored on every mention
303- context built from sqlite queries
304
305### after refactor
306- thread_messages table removed
307- zero sqlite writes per mention
308- context built from network fetches
309- same quality responses in evals
310
311## open questions
312
3131. **should we cache at all?**
314   - start without caching
315   - add only if latency becomes measurable problem
316
3172. **what about the discovery feature?**
318   - currently stores full thread when tagged in
319   - can just fetch on-demand instead
320   - no need to persist
321
3223. **do we need conversation summaries?**
323   - not for thread context (fetch from network)
324   - maybe for turbopuffer (semantic memory)
325   - separate concern from this refactor
326
327## conclusion
328
329removing sqlite thread storage:
330- eliminates data duplication
331- simplifies architecture
332- maintains all essential capabilities
333- aligns with atproto's "data on the web" philosophy
334
335turbopuffer stays because it provides semantic memory - a fundamentally different capability than chronological thread reconstruction.
336
337the network is the source of truth. we should read from it.