a digital entity named phi that roams bsky

chore: move unproven scripts to sandbox, document graduation process

- moved view_phi_posts.py and view_thread.py to sandbox/
- removed premature justfile commands
- updated CLAUDE.md with script graduation process: sandbox -> scripts -> justfile
- scripts must prove their worth before promotion

+248 -9
+6 -2
CLAUDE.md
··· 26 26 - `templates.py` - HTML templates 27 27 28 28 - `tests/` - Test files 29 - - `scripts/` - Utility scripts (test_post.py, test_mention.py) 30 - - `sandbox/` - Documentation and analysis 29 + - `scripts/` - Curated utility scripts that have proven useful 30 + - `sandbox/` - Proving ground for experiments, analysis, and unproven scripts 31 31 - Reference project analyses 32 32 - Architecture plans 33 33 - Implementation notes 34 + - Experimental scripts (graduate to scripts/ once proven useful) 34 35 - `.eggs/` - Cloned reference projects (void, penelope, marvin) 36 + 37 + ## Script Graduation Process 38 + New scripts start in `sandbox/`, get promoted to `scripts/` once proven useful, and may eventually get just commands added if the workflow should be broadcast to other developers. Not everything graduates - most things stay in sandbox. 35 39 36 40 ## Testing 37 41 - Run bot: `just dev`
-7
justfile
··· 30 30 31 31 check: lint typecheck test 32 32 33 - # view phi's activity 34 - view-posts: 35 - uv run --with rich --with httpx python scripts/view_phi_posts.py 36 - 37 - view-thread URI: 38 - uv run --with rich --with httpx python scripts/view_thread.py {{URI}} 39 - 40 33 # setup reference projects 41 34 setup: 42 35 @mkdir -p .eggs
+236
sandbox/TESTING_STRATEGY.md
··· 1 + # testing strategy for phi 2 + 3 + ## goal 4 + test behavior/outcomes cleanly without polluting production environments (bluesky, turbopuffer, etc.) 5 + 6 + ## principles 7 + 1. **test outcomes, not implementation** - we care that phi replies appropriately, not that it made specific HTTP calls 8 + 2. **isolated test environments** - tests should never touch production bluesky, turbopuffer, or post real content 9 + 3. **behavioral assertions** - test what phi does (reply, ignore, like) and what it says, not how it does it 10 + 4. **fixture-based mocking** - use pytest fixtures to provide test doubles that are reusable across tests 11 + 12 + ## what to test 13 + 14 + ### behavior tests (high-level) 15 + - **mention handling**: does phi reply when mentioned? does it use thread context? 16 + - **memory integration**: does phi retrieve and use relevant memories? 17 + - **decision making**: does phi choose the right action (reply/ignore/like/repost)? 18 + - **content quality**: does phi's response match its personality? (llm-as-judge) 19 + 20 + ### unit tests (low-level) 21 + - **memory operations**: storing/retrieving memories works correctly 22 + - **thread context**: building conversation context from thread history 23 + - **response parsing**: structured output (Response model) is valid 24 + 25 + ## what NOT to test 26 + - exact HTTP calls to bluesky API 27 + - exact vector embeddings used 28 + - implementation details of atproto client 29 + - exact format of turbopuffer queries 30 + 31 + ## mocking strategy 32 + 33 + ### level 1: mock external services (clean boundary) 34 + ```python 35 + @pytest.fixture 36 + def mock_atproto_client(): 37 + """Mock ATProto client that doesn't actually post to bluesky""" 38 + class MockClient: 39 + def __init__(self): 40 + self.posts = [] # track what would have been posted 41 + self.me = MockMe() 42 + 43 + def send_post(self, text, reply_to=None): 44 + self.posts.append({"text": text, "reply_to": reply_to}) 45 + return MockPostRef() 46 + 47 + return MockClient() 48 + 49 + @pytest.fixture 50 + def mock_memory(): 51 + """Mock memory that uses in-memory dict instead of turbopuffer""" 52 + class MockMemory: 53 + def __init__(self): 54 + self.memories = {} 55 + 56 + async def store_user_memory(self, handle, content, memory_type): 57 + if handle not in self.memories: 58 + self.memories[handle] = [] 59 + self.memories[handle].append(content) 60 + 61 + async def build_conversation_context(self, handle, include_core=False, query=None): 62 + # return relevant memories without hitting turbopuffer 63 + return "\n".join(self.memories.get(handle, [])) 64 + 65 + return MockMemory() 66 + ``` 67 + 68 + ### level 2: mock agent responses (for deterministic tests) 69 + ```python 70 + @pytest.fixture 71 + def mock_agent_response(): 72 + """Return pre-determined responses instead of hitting Claude API""" 73 + def _mock(mention_text: str) -> Response: 74 + # simple rule-based responses for testing 75 + if "hello" in mention_text.lower(): 76 + return Response(action="reply", text="hi there!", reason=None) 77 + elif "spam" in mention_text.lower(): 78 + return Response(action="ignore", text=None, reason="spam") 79 + else: 80 + return Response(action="reply", text="interesting point", reason=None) 81 + 82 + return _mock 83 + ``` 84 + 85 + ### level 3: integration fixtures (compose mocks) 86 + ```python 87 + @pytest.fixture 88 + def test_phi_agent(mock_atproto_client, mock_memory): 89 + """Create a phi agent with mocked dependencies for integration tests""" 90 + agent = PhiAgent() 91 + agent.client = mock_atproto_client 92 + agent.memory = mock_memory 93 + # agent still uses real Claude for responses (can be slow but tests real behavior) 94 + return agent 95 + 96 + @pytest.fixture 97 + def fully_mocked_phi_agent(mock_atproto_client, mock_memory, mock_agent_response): 98 + """Create a fully mocked phi agent for fast unit tests""" 99 + agent = PhiAgent() 100 + agent.client = mock_atproto_client 101 + agent.memory = mock_memory 102 + agent._generate_response = mock_agent_response # deterministic responses 103 + return agent 104 + ``` 105 + 106 + ## test environments 107 + 108 + ### approach 1: environment variable switching 109 + ```python 110 + # conftest.py 111 + @pytest.fixture(scope="session", autouse=True) 112 + def test_environment(): 113 + """Force test environment settings""" 114 + os.environ["ENVIRONMENT"] = "test" 115 + os.environ["TURBOPUFFER_NAMESPACE"] = "phi-test" # separate test namespace 116 + # could use a different bluesky account too 117 + yield 118 + # cleanup test data after all tests 119 + ``` 120 + 121 + ### approach 2: dependency injection 122 + ```python 123 + # bot/agent.py 124 + class PhiAgent: 125 + def __init__(self, client=None, memory=None, llm=None): 126 + self.client = client or create_production_client() 127 + self.memory = memory or create_production_memory() 128 + self.llm = llm or create_production_llm() 129 + ``` 130 + 131 + This makes testing clean: 132 + ```python 133 + def test_mention_handling(mock_client, mock_memory): 134 + agent = PhiAgent(client=mock_client, memory=mock_memory) 135 + # test with mocked dependencies 136 + ``` 137 + 138 + ## example test cases 139 + 140 + ### integration test (uses real LLM, mocked infrastructure) 141 + ```python 142 + async def test_phi_uses_thread_context_in_response(test_phi_agent): 143 + """Phi should reference previous messages in thread when replying""" 144 + 145 + # setup: create a thread with context 146 + thread_context = """ 147 + Previous messages: 148 + @alice: I love birds 149 + @phi: me too! what's your favorite? 150 + """ 151 + 152 + # act: phi processes a new mention 153 + response = await test_phi_agent.process_mention( 154 + mention_text="especially crows", 155 + author_handle="alice.test", 156 + thread_context=thread_context, 157 + thread_uri="at://test/thread/1" 158 + ) 159 + 160 + # assert: phi replies and references the conversation 161 + assert response.action == "reply" 162 + assert response.text is not None 163 + # behavioral assertion - should show awareness of context 164 + assert any(word in response.text.lower() for word in ["bird", "crow", "favorite"]) 165 + ``` 166 + 167 + ### unit test (fully mocked, fast) 168 + ```python 169 + async def test_phi_ignores_spam(fully_mocked_phi_agent): 170 + """Phi should ignore obvious spam""" 171 + 172 + response = await fully_mocked_phi_agent.process_mention( 173 + mention_text="BUY CRYPTO NOW!!! spam spam spam", 174 + author_handle="spammer.test", 175 + thread_context="No previous messages", 176 + thread_uri="at://test/thread/2" 177 + ) 178 + 179 + assert response.action == "ignore" 180 + assert response.reason is not None 181 + ``` 182 + 183 + ### memory test 184 + ```python 185 + async def test_memory_stores_user_interactions(mock_memory): 186 + """Memories should persist user interactions""" 187 + 188 + await mock_memory.store_user_memory( 189 + "alice.test", 190 + "Alice mentioned she loves birds", 191 + MemoryType.USER_FACT 192 + ) 193 + 194 + context = await mock_memory.build_conversation_context("alice.test") 195 + 196 + assert "birds" in context.lower() 197 + ``` 198 + 199 + ## fixture organization 200 + 201 + ``` 202 + tests/ 203 + ├── conftest.py # shared fixtures 204 + │ ├── settings # test settings 205 + │ ├── mock_client # mock atproto client 206 + │ ├── mock_memory # mock turbopuffer 207 + │ └── test_phi_agent # composed test agent 208 + ├── unit/ 209 + │ ├── test_memory.py # memory operations 210 + │ └── test_response.py # response generation 211 + └── integration/ 212 + ├── test_mentions.py # full mention handling flow 213 + └── test_threads.py # thread context handling 214 + ``` 215 + 216 + ## key challenges 217 + 218 + 1. **mocking MCP tools** - phi uses atproto MCP server for posting 219 + - solution: mock the entire MCP transport or provide fake tool implementations 220 + 221 + 2. **testing non-deterministic LLM responses** - claude's responses vary 222 + - solution: use llm-as-judge for behavioral assertions instead of exact text matching 223 + - alternative: mock agent responses for unit tests, use real LLM for integration tests 224 + 225 + 3. **async testing** - everything is async 226 + - solution: use pytest-asyncio (already doing this) 227 + 228 + 4. **test data cleanup** - don't leave garbage in test environments 229 + - solution: use separate test namespaces, clean up in fixture teardown 230 + 231 + ## next steps 232 + 233 + 1. create mock implementations of key dependencies (client, memory) 234 + 2. add dependency injection to PhiAgent for easier testing 235 + 3. write a few example tests to validate the approach 236 + 4. decide on integration vs unit test balance
+6
sandbox/fetch_blog.py
··· 1 + import trafilatura 2 + 3 + url = "https://overreacted.io/open-social/" 4 + downloaded = trafilatura.fetch_url(url) 5 + text = trafilatura.extract(downloaded, include_comments=False, include_tables=True) 6 + print(text)
scripts/view_phi_posts.py sandbox/view_phi_posts.py
scripts/view_thread.py sandbox/view_thread.py