docs/testing.md at main · zzstoatzz.io/bot

zzstoatzz.io / bot
fork atom
a digital entity named phi that roams bsky
fork atom
bot / docs / testing.md
at main 111 lines 2.6 kB view raw view rendered
wrap content
zzstoatzz.io docs: create documentation structure 4mo ago
ff68c8e2
  1# testing
  2
  3phi uses behavioral testing with llm-as-judge evaluation.
  4
  5## philosophy
  6
  7**test outcomes, not implementation**
  8
  9we care that phi:
 10- replies appropriately to mentions
 11- uses thread context correctly
 12- maintains consistent personality
 13- makes reasonable action decisions
 14
 15we don't care:
 16- which exact HTTP calls were made
 17- internal state of the agent
 18- specific tool invocation order
 19
 20## test structure
 21
 22```python
 23async def test_thread_awareness():
 24    """phi should reference thread context in replies"""
 25
 26    # arrange: create thread context
 27    thread_context = """
 28    @alice: I love birds
 29    @phi: me too! what's your favorite?
 30    """
 31
 32    # act: process new mention
 33    response = await agent.process_mention(
 34        mention_text="especially crows",
 35        author_handle="alice.bsky.social",
 36        thread_context=thread_context
 37    )
 38
 39    # assert: behavioral check
 40    assert response.action == "reply"
 41    assert any(word in response.text.lower()
 42              for word in ["bird", "crow", "favorite"])
 43```
 44
 45## llm-as-judge
 46
 47for subjective qualities (tone, relevance, personality):
 48
 49```python
 50async def test_personality_consistency():
 51    """phi should maintain grounded, honest tone"""
 52
 53    response = await agent.process_mention(...)
 54
 55    # use claude opus to evaluate
 56    evaluation = await judge_response(
 57        response=response.text,
 58        criteria=[
 59            "grounded (not overly philosophical)",
 60            "honest about capabilities",
 61            "concise for bluesky's 300 char limit"
 62        ]
 63    )
 64
 65    assert evaluation.passes_criteria
 66```
 67
 68## what we test
 69
 70### unit tests
 71- memory operations (store/retrieve)
 72- thread context building
 73- response parsing
 74
 75### integration tests
 76- full mention handling flow
 77- thread discovery
 78- decision making
 79
 80### behavioral tests (evals)
 81- personality consistency
 82- thread awareness
 83- appropriate action selection
 84- memory utilization
 85
 86## mocking strategy
 87
 88**mock external services, not internal logic**
 89
 90- mock ATProto client (don't actually post to bluesky)
 91- mock TurboPuffer (in-memory dict instead of network calls)
 92- mock MCP server (fake tool implementations)
 93
 94**keep agent logic real** - we want to test actual decision making.
 95
 96## running tests
 97
 98```bash
 99just test        # unit tests
100just evals       # behavioral tests with llm-as-judge
101just check       # full suite (lint + typecheck + test)
102```
103
104## test isolation
105
106tests never touch production:
107- no real bluesky posts
108- separate turbopuffer namespace for tests
109- deterministic mock responses where needed
110
111see `sandbox/TESTING_STRATEGY.md` for detailed approach.