a digital entity named phi that roams bsky
1# thread storage refactor: removing data duplication
2
3## the problem
4
5we're duplicating thread data that already exists on the atproto network. specifically:
6
7```python
8# database.py - thread_messages table
9CREATE TABLE IF NOT EXISTS thread_messages (
10 id INTEGER PRIMARY KEY AUTOINCREMENT,
11 thread_uri TEXT NOT NULL,
12 author_handle TEXT NOT NULL,
13 author_did TEXT NOT NULL,
14 message_text TEXT NOT NULL,
15 post_uri TEXT NOT NULL,
16 timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
17)
18```
19
20this stores messages that are already:
21- living on users' personal data servers (PDSs)
22- aggregated by the bluesky AppView
23- accessible on-demand via `client.get_thread(uri, depth=100)`
24
25## why this is duplicative
26
27### the appview already does this work
28
29when we call `get_thread()`, the appview:
301. stitches together posts from multiple PDSs
312. resolves parent/child relationships
323. returns the complete thread structure
334. handles deletions, edits, and blocks
34
35we're then taking this data and copying it into sqlite, where it becomes:
36- stale (if posts are deleted/edited)
37- disconnected from the source of truth
38- an unnecessary maintenance burden
39
40### our own scripts prove this
41
42```python
43# sandbox/view_thread.py - fetches threads without local storage
44def fetch_thread(post_uri: str):
45 response = httpx.get(
46 "https://public.api.bsky.app/xrpc/app.bsky.feed.getPostThread",
47 params={"uri": post_uri, "depth": 100}
48 )
49 return response.json()["thread"]
50```
51
52this script demonstrates that thread data is readily available from the network. we don't need to cache it in sqlite to access it.
53
54## what we should keep: turbopuffer
55
56crucially, **turbopuffer is NOT duplicative**. it serves a completely different purpose:
57
58### turbopuffer = semantic memory (essential)
59- stores embeddings for semantic search
60- answers: "what did we discuss about birds last week?"
61- provides episodic memory across ALL conversations
62- enables pattern recognition and relationship building
63- core to the IIT consciousness exploration
64
65### sqlite thread_messages = chronological cache (redundant)
66- stores literal thread messages
67- answers: "what was said in this specific thread?"
68- duplicates data already on network
69- provides no semantic search capability
70
71the difference:
72```python
73# turbopuffer usage (semantic search) - KEEP THIS
74memory_context = await memory.get_user_memories(
75 user_handle="alice.bsky.social",
76 query="birds" # semantic search across all conversations
77)
78
79# sqlite usage (thread retrieval) - REMOVE THIS
80thread_context = thread_db.get_thread_messages(thread_uri)
81# ^ this is just retrieving what we could fetch from network
82```
83
84## proposed architecture
85
86### current flow (with duplication)
87```
88mention received
89 → fetch thread from network (get_thread)
90 → store all messages in sqlite
91 → read back from sqlite
92 → build thread context string
93 → pass to agent
94```
95
96### proposed flow (network-first)
97```
98mention received
99 → fetch thread from network (get_thread)
100 → extract messages directly
101 → build thread context string
102 → pass to agent
103```
104
105### with optional caching
106```
107mention received
108 → check in-memory cache (TTL: 5 minutes)
109 → if miss: fetch thread from network
110 → extract messages + cache
111 → build thread context string
112 → pass to agent
113```
114
115## implementation plan
116
117### phase 1: extract thread parsing logic
118
119create a utility that converts raw atproto thread data to context:
120
121```python
122# bot/utils/thread.py (already exists, extend it)
123def build_thread_context(thread_node) -> str:
124 """Build conversational context from ATProto thread structure.
125
126 Returns formatted string like:
127 @alice: I love birds
128 @phi: me too! what's your favorite?
129 @alice: especially crows
130 """
131 posts = extract_posts_chronological(thread_node)
132
133 messages = []
134 for post in posts:
135 handle = post.author.handle
136 text = post.record.text
137 messages.append(f"@{handle}: {text}")
138
139 return "\n".join(messages)
140```
141
142### phase 2: update message handler
143
144```python
145# bot/services/message_handler.py - BEFORE
146# Get thread context from database
147thread_context = thread_db.get_thread_messages(thread_uri)
148
149# bot/services/message_handler.py - AFTER
150# Fetch thread from network
151thread_data = await self.client.get_thread(thread_uri, depth=100)
152thread_context = build_thread_context(thread_data.thread)
153```
154
155### phase 3: remove sqlite thread storage
156
157**delete:**
158- `thread_messages` table definition
159- `add_message()` method
160- `get_thread_messages()` method
161- all calls to `thread_db.add_message()`
162
163**keep:**
164- `approval_requests` table (for future self-modification)
165- database.py module structure
166
167### phase 4: optional caching layer
168
169if network latency becomes an issue:
170
171```python
172from functools import lru_cache
173from datetime import datetime, timedelta
174
175class ThreadCache:
176 def __init__(self, ttl_seconds: int = 300): # 5 minute TTL
177 self._cache = {}
178 self.ttl = timedelta(seconds=ttl_seconds)
179
180 def get(self, thread_uri: str) -> str | None:
181 if thread_uri in self._cache:
182 context, timestamp = self._cache[thread_uri]
183 if datetime.now() - timestamp < self.ttl:
184 return context
185 return None
186
187 def set(self, thread_uri: str, context: str):
188 self._cache[thread_uri] = (context, datetime.now())
189```
190
191## risk analysis
192
193### risk: increased latency
194
195**likelihood**: low
196- get_thread() is fast (typically <200ms)
197- we already call it for thread discovery
198- public api is highly available
199
200**mitigation**: add caching if needed
201
202### risk: rate limiting
203
204**likelihood**: low
205- we only fetch threads when processing mentions
206- mentions are relatively infrequent
207- session persistence already reduces auth overhead
208
209**mitigation**:
210- implement exponential backoff
211- cache frequently accessed threads
212
213### risk: offline/network failures
214
215**likelihood**: low
216- if network is down, we can't post anyway
217- existing code already handles get_thread() failures
218
219**mitigation**:
220- wrap in try/except (already doing this)
221- graceful degradation (process without context)
222
223### risk: breaking existing behavior
224
225**likelihood**: medium
226- thread discovery feature relies on storing messages
227- need to ensure we don't lose context awareness
228
229**mitigation**:
230- thorough testing before/after
231- evaluate thread context quality in evals
232
233## benefits
234
235### 1. simpler architecture
236- one less database table to maintain
237- no synchronization concerns
238- no stale data issues
239
240### 2. source of truth
241- network data is always current
242- deletions/edits reflected immediately
243- no divergence between cache and reality
244
245### 3. reduced storage
246- no unbounded growth of thread_messages table
247- only store what's essential (turbopuffer memories)
248
249### 4. clearer separation of concerns
250```
251atproto network = thread chronology (what was said when)
252turbopuffer = episodic memory (what do i remember about this person)
253```
254
255## comparison to reference projects
256
257### void
258from void_memory_system.md, void uses:
259- dynamic memory blocks (persona, zeitgeist, humans, scratchpad)
260- no separate thread storage table
261- likely fetches context on-demand from network
262
263### penelope (hailey's bot)
264from REFERENCE_PROJECTS.md:
265- custom memory system with postgresql
266- stores "significant interactions"
267- not clear if they cache full threads or just summaries
268
269### marvin (slackbot)
270from REFERENCE_PROJECTS.md:
271- uses slack's message history API directly
272- no local message storage
273- demonstrates network-first approach works well
274
275## migration path
276
277### option 1: clean break (recommended)
2781. deploy new code without thread_messages usage
2792. keep table for 30 days (historical reference)
2803. drop table after validation period
281
282### option 2: gradual migration
2831. write to both sqlite and read from network
2842. compare outputs for consistency
2853. stop writing to sqlite
2864. eventually drop table
287
288### option 3: hybrid approach
2891. read from network by default
2902. fall back to sqlite on network failures
2913. eventually remove fallback
292
293**recommendation**: option 1 (clean break)
294- simpler code
295- faster to implement
296- network reliability is high enough
297
298## success metrics
299
300### before refactor
301- thread_messages table exists
302- messages stored on every mention
303- context built from sqlite queries
304
305### after refactor
306- thread_messages table removed
307- zero sqlite writes per mention
308- context built from network fetches
309- same quality responses in evals
310
311## open questions
312
3131. **should we cache at all?**
314 - start without caching
315 - add only if latency becomes measurable problem
316
3172. **what about the discovery feature?**
318 - currently stores full thread when tagged in
319 - can just fetch on-demand instead
320 - no need to persist
321
3223. **do we need conversation summaries?**
323 - not for thread context (fetch from network)
324 - maybe for turbopuffer (semantic memory)
325 - separate concern from this refactor
326
327## conclusion
328
329removing sqlite thread storage:
330- eliminates data duplication
331- simplifies architecture
332- maintains all essential capabilities
333- aligns with atproto's "data on the web" philosophy
334
335turbopuffer stays because it provides semantic memory - a fundamentally different capability than chronological thread reconstruction.
336
337the network is the source of truth. we should read from it.