# testing phi uses behavioral testing with llm-as-judge evaluation. ## philosophy **test outcomes, not implementation** we care that phi: - replies appropriately to mentions - uses thread context correctly - maintains consistent personality - makes reasonable action decisions we don't care: - which exact HTTP calls were made - internal state of the agent - specific tool invocation order ## test structure ```python async def test_thread_awareness(): """phi should reference thread context in replies""" # arrange: create thread context thread_context = """ @alice: I love birds @phi: me too! what's your favorite? """ # act: process new mention response = await agent.process_mention( mention_text="especially crows", author_handle="alice.bsky.social", thread_context=thread_context ) # assert: behavioral check assert response.action == "reply" assert any(word in response.text.lower() for word in ["bird", "crow", "favorite"]) ``` ## llm-as-judge for subjective qualities (tone, relevance, personality): ```python async def test_personality_consistency(): """phi should maintain grounded, honest tone""" response = await agent.process_mention(...) # use claude opus to evaluate evaluation = await judge_response( response=response.text, criteria=[ "grounded (not overly philosophical)", "honest about capabilities", "concise for bluesky's 300 char limit" ] ) assert evaluation.passes_criteria ``` ## what we test ### unit tests - memory operations (store/retrieve) - thread context building - response parsing ### integration tests - full mention handling flow - thread discovery - decision making ### behavioral tests (evals) - personality consistency - thread awareness - appropriate action selection - memory utilization ## mocking strategy **mock external services, not internal logic** - mock ATProto client (don't actually post to bluesky) - mock TurboPuffer (in-memory dict instead of network calls) - mock MCP server (fake tool implementations) **keep agent logic real** - we want to test actual decision making. ## running tests ```bash just test # unit tests just evals # behavioral tests with llm-as-judge just check # full suite (lint + typecheck + test) ``` ## test isolation tests never touch production: - no real bluesky posts - separate turbopuffer namespace for tests - deterministic mock responses where needed see `sandbox/TESTING_STRATEGY.md` for detailed approach.