feat: minimal eval proof of concept
- single test demonstrating LLM-as-judge pattern
- test agent without MCP tools to prevent posting
- simplified conftest to bare essentials
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>