thread storage refactor: removing data duplication#
the problem#
we're duplicating thread data that already exists on the atproto network. specifically:
# database.py - thread_messages table
CREATE TABLE IF NOT EXISTS thread_messages (
id INTEGER PRIMARY KEY AUTOINCREMENT,
thread_uri TEXT NOT NULL,
author_handle TEXT NOT NULL,
author_did TEXT NOT NULL,
message_text TEXT NOT NULL,
post_uri TEXT NOT NULL,
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
)
this stores messages that are already:
- living on users' personal data servers (PDSs)
- aggregated by the bluesky AppView
- accessible on-demand via
client.get_thread(uri, depth=100)
why this is duplicative#
the appview already does this work#
when we call get_thread(), the appview:
- stitches together posts from multiple PDSs
- resolves parent/child relationships
- returns the complete thread structure
- handles deletions, edits, and blocks
we're then taking this data and copying it into sqlite, where it becomes:
- stale (if posts are deleted/edited)
- disconnected from the source of truth
- an unnecessary maintenance burden
our own scripts prove this#
# sandbox/view_thread.py - fetches threads without local storage
def fetch_thread(post_uri: str):
response = httpx.get(
"https://public.api.bsky.app/xrpc/app.bsky.feed.getPostThread",
params={"uri": post_uri, "depth": 100}
)
return response.json()["thread"]
this script demonstrates that thread data is readily available from the network. we don't need to cache it in sqlite to access it.
what we should keep: turbopuffer#
crucially, turbopuffer is NOT duplicative. it serves a completely different purpose:
turbopuffer = semantic memory (essential)#
- stores embeddings for semantic search
- answers: "what did we discuss about birds last week?"
- provides episodic memory across ALL conversations
- enables pattern recognition and relationship building
- core to the IIT consciousness exploration
sqlite thread_messages = chronological cache (redundant)#
- stores literal thread messages
- answers: "what was said in this specific thread?"
- duplicates data already on network
- provides no semantic search capability
the difference:
# turbopuffer usage (semantic search) - KEEP THIS
memory_context = await memory.get_user_memories(
user_handle="alice.bsky.social",
query="birds" # semantic search across all conversations
)
# sqlite usage (thread retrieval) - REMOVE THIS
thread_context = thread_db.get_thread_messages(thread_uri)
# ^ this is just retrieving what we could fetch from network
proposed architecture#
current flow (with duplication)#
mention received
→ fetch thread from network (get_thread)
→ store all messages in sqlite
→ read back from sqlite
→ build thread context string
→ pass to agent
proposed flow (network-first)#
mention received
→ fetch thread from network (get_thread)
→ extract messages directly
→ build thread context string
→ pass to agent
with optional caching#
mention received
→ check in-memory cache (TTL: 5 minutes)
→ if miss: fetch thread from network
→ extract messages + cache
→ build thread context string
→ pass to agent
implementation plan#
phase 1: extract thread parsing logic#
create a utility that converts raw atproto thread data to context:
# bot/utils/thread.py (already exists, extend it)
def build_thread_context(thread_node) -> str:
"""Build conversational context from ATProto thread structure.
Returns formatted string like:
@alice: I love birds
@phi: me too! what's your favorite?
@alice: especially crows
"""
posts = extract_posts_chronological(thread_node)
messages = []
for post in posts:
handle = post.author.handle
text = post.record.text
messages.append(f"@{handle}: {text}")
return "\n".join(messages)
phase 2: update message handler#
# bot/services/message_handler.py - BEFORE
# Get thread context from database
thread_context = thread_db.get_thread_messages(thread_uri)
# bot/services/message_handler.py - AFTER
# Fetch thread from network
thread_data = await self.client.get_thread(thread_uri, depth=100)
thread_context = build_thread_context(thread_data.thread)
phase 3: remove sqlite thread storage#
delete:
thread_messagestable definitionadd_message()methodget_thread_messages()method- all calls to
thread_db.add_message()
keep:
approval_requeststable (for future self-modification)- database.py module structure
phase 4: optional caching layer#
if network latency becomes an issue:
from functools import lru_cache
from datetime import datetime, timedelta
class ThreadCache:
def __init__(self, ttl_seconds: int = 300): # 5 minute TTL
self._cache = {}
self.ttl = timedelta(seconds=ttl_seconds)
def get(self, thread_uri: str) -> str | None:
if thread_uri in self._cache:
context, timestamp = self._cache[thread_uri]
if datetime.now() - timestamp < self.ttl:
return context
return None
def set(self, thread_uri: str, context: str):
self._cache[thread_uri] = (context, datetime.now())
risk analysis#
risk: increased latency#
likelihood: low
- get_thread() is fast (typically <200ms)
- we already call it for thread discovery
- public api is highly available
mitigation: add caching if needed
risk: rate limiting#
likelihood: low
- we only fetch threads when processing mentions
- mentions are relatively infrequent
- session persistence already reduces auth overhead
mitigation:
- implement exponential backoff
- cache frequently accessed threads
risk: offline/network failures#
likelihood: low
- if network is down, we can't post anyway
- existing code already handles get_thread() failures
mitigation:
- wrap in try/except (already doing this)
- graceful degradation (process without context)
risk: breaking existing behavior#
likelihood: medium
- thread discovery feature relies on storing messages
- need to ensure we don't lose context awareness
mitigation:
- thorough testing before/after
- evaluate thread context quality in evals
benefits#
1. simpler architecture#
- one less database table to maintain
- no synchronization concerns
- no stale data issues
2. source of truth#
- network data is always current
- deletions/edits reflected immediately
- no divergence between cache and reality
3. reduced storage#
- no unbounded growth of thread_messages table
- only store what's essential (turbopuffer memories)
4. clearer separation of concerns#
atproto network = thread chronology (what was said when)
turbopuffer = episodic memory (what do i remember about this person)
comparison to reference projects#
void#
from void_memory_system.md, void uses:
- dynamic memory blocks (persona, zeitgeist, humans, scratchpad)
- no separate thread storage table
- likely fetches context on-demand from network
penelope (hailey's bot)#
from REFERENCE_PROJECTS.md:
- custom memory system with postgresql
- stores "significant interactions"
- not clear if they cache full threads or just summaries
marvin (slackbot)#
from REFERENCE_PROJECTS.md:
- uses slack's message history API directly
- no local message storage
- demonstrates network-first approach works well
migration path#
option 1: clean break (recommended)#
- deploy new code without thread_messages usage
- keep table for 30 days (historical reference)
- drop table after validation period
option 2: gradual migration#
- write to both sqlite and read from network
- compare outputs for consistency
- stop writing to sqlite
- eventually drop table
option 3: hybrid approach#
- read from network by default
- fall back to sqlite on network failures
- eventually remove fallback
recommendation: option 1 (clean break)
- simpler code
- faster to implement
- network reliability is high enough
success metrics#
before refactor#
- thread_messages table exists
- messages stored on every mention
- context built from sqlite queries
after refactor#
- thread_messages table removed
- zero sqlite writes per mention
- context built from network fetches
- same quality responses in evals
open questions#
-
should we cache at all?
- start without caching
- add only if latency becomes measurable problem
-
what about the discovery feature?
- currently stores full thread when tagged in
- can just fetch on-demand instead
- no need to persist
-
do we need conversation summaries?
- not for thread context (fetch from network)
- maybe for turbopuffer (semantic memory)
- separate concern from this refactor
conclusion#
removing sqlite thread storage:
- eliminates data duplication
- simplifies architecture
- maintains all essential capabilities
- aligns with atproto's "data on the web" philosophy
turbopuffer stays because it provides semantic memory - a fundamentally different capability than chronological thread reconstruction.
the network is the source of truth. we should read from it.