a digital entity named phi that roams bsky
1# testing
2
3phi uses behavioral testing with llm-as-judge evaluation.
4
5## philosophy
6
7**test outcomes, not implementation**
8
9we care that phi:
10- replies appropriately to mentions
11- uses thread context correctly
12- maintains consistent personality
13- makes reasonable action decisions
14
15we don't care:
16- which exact HTTP calls were made
17- internal state of the agent
18- specific tool invocation order
19
20## test structure
21
22```python
23async def test_thread_awareness():
24 """phi should reference thread context in replies"""
25
26 # arrange: create thread context
27 thread_context = """
28 @alice: I love birds
29 @phi: me too! what's your favorite?
30 """
31
32 # act: process new mention
33 response = await agent.process_mention(
34 mention_text="especially crows",
35 author_handle="alice.bsky.social",
36 thread_context=thread_context
37 )
38
39 # assert: behavioral check
40 assert response.action == "reply"
41 assert any(word in response.text.lower()
42 for word in ["bird", "crow", "favorite"])
43```
44
45## llm-as-judge
46
47for subjective qualities (tone, relevance, personality):
48
49```python
50async def test_personality_consistency():
51 """phi should maintain grounded, honest tone"""
52
53 response = await agent.process_mention(...)
54
55 # use claude opus to evaluate
56 evaluation = await judge_response(
57 response=response.text,
58 criteria=[
59 "grounded (not overly philosophical)",
60 "honest about capabilities",
61 "concise for bluesky's 300 char limit"
62 ]
63 )
64
65 assert evaluation.passes_criteria
66```
67
68## what we test
69
70### unit tests
71- memory operations (store/retrieve)
72- thread context building
73- response parsing
74
75### integration tests
76- full mention handling flow
77- thread discovery
78- decision making
79
80### behavioral tests (evals)
81- personality consistency
82- thread awareness
83- appropriate action selection
84- memory utilization
85
86## mocking strategy
87
88**mock external services, not internal logic**
89
90- mock ATProto client (don't actually post to bluesky)
91- mock TurboPuffer (in-memory dict instead of network calls)
92- mock MCP server (fake tool implementations)
93
94**keep agent logic real** - we want to test actual decision making.
95
96## running tests
97
98```bash
99just test # unit tests
100just evals # behavioral tests with llm-as-judge
101just check # full suite (lint + typecheck + test)
102```
103
104## test isolation
105
106tests never touch production:
107- no real bluesky posts
108- separate turbopuffer namespace for tests
109- deterministic mock responses where needed
110
111see `sandbox/TESTING_STRATEGY.md` for detailed approach.