a digital entity named phi that roams bsky
at main 111 lines 2.6 kB view raw view rendered
1# testing 2 3phi uses behavioral testing with llm-as-judge evaluation. 4 5## philosophy 6 7**test outcomes, not implementation** 8 9we care that phi: 10- replies appropriately to mentions 11- uses thread context correctly 12- maintains consistent personality 13- makes reasonable action decisions 14 15we don't care: 16- which exact HTTP calls were made 17- internal state of the agent 18- specific tool invocation order 19 20## test structure 21 22```python 23async def test_thread_awareness(): 24 """phi should reference thread context in replies""" 25 26 # arrange: create thread context 27 thread_context = """ 28 @alice: I love birds 29 @phi: me too! what's your favorite? 30 """ 31 32 # act: process new mention 33 response = await agent.process_mention( 34 mention_text="especially crows", 35 author_handle="alice.bsky.social", 36 thread_context=thread_context 37 ) 38 39 # assert: behavioral check 40 assert response.action == "reply" 41 assert any(word in response.text.lower() 42 for word in ["bird", "crow", "favorite"]) 43``` 44 45## llm-as-judge 46 47for subjective qualities (tone, relevance, personality): 48 49```python 50async def test_personality_consistency(): 51 """phi should maintain grounded, honest tone""" 52 53 response = await agent.process_mention(...) 54 55 # use claude opus to evaluate 56 evaluation = await judge_response( 57 response=response.text, 58 criteria=[ 59 "grounded (not overly philosophical)", 60 "honest about capabilities", 61 "concise for bluesky's 300 char limit" 62 ] 63 ) 64 65 assert evaluation.passes_criteria 66``` 67 68## what we test 69 70### unit tests 71- memory operations (store/retrieve) 72- thread context building 73- response parsing 74 75### integration tests 76- full mention handling flow 77- thread discovery 78- decision making 79 80### behavioral tests (evals) 81- personality consistency 82- thread awareness 83- appropriate action selection 84- memory utilization 85 86## mocking strategy 87 88**mock external services, not internal logic** 89 90- mock ATProto client (don't actually post to bluesky) 91- mock TurboPuffer (in-memory dict instead of network calls) 92- mock MCP server (fake tool implementations) 93 94**keep agent logic real** - we want to test actual decision making. 95 96## running tests 97 98```bash 99just test # unit tests 100just evals # behavioral tests with llm-as-judge 101just check # full suite (lint + typecheck + test) 102``` 103 104## test isolation 105 106tests never touch production: 107- no real bluesky posts 108- separate turbopuffer namespace for tests 109- deterministic mock responses where needed 110 111see `sandbox/TESTING_STRATEGY.md` for detailed approach.