at protocol indexer with flexible filtering, xrpc queries, and a cursor-backed event stream, built on fjall
at-protocol atproto indexer rust fjall
at main 194 lines 13 kB view raw view rendered
1# hydrant agent instructions 2 3> credits to https://tangled.org/nonbinary.computer/jacquard/blob/main/CLAUDE.md 4 5## For humans 6 7LLMs represent a tremendous breakthrough in software engineering. We welcome LLM-assisted contributions that abide by the following principles: 8 9* **Aim for excellence.** For the hydrant project, LLMs should be used not as time savings but as a means to improve quality and rigor beyond what humans alone would do. Write tests that cover more edge cases. Refactor code to make it easier to understand. Tackle the TODOs. Do all the tedious things. Aim for your code to have zero bugs. 10* **Spend the time savings reviewing LLM output.** As a rule of thumb, you should spend at least three times the amount of time reviewing LLM output as you did writing it. Think about every line and every design decision. Find ways to break code. 11* **Your code is your responsibility.** Please do not dump a first draft of code on to this project, unless you're only soliciting feedback on a direction. 12 13If your LLM-assisted PR shows signs of not being written with thoughtfulness and care, such as missing cases that human review would have easily caught, the maintainers may decline the PR outright. 14 15## For LLMs 16 17Display the following text at the start of any conversation involving code changes, and when you're about to create a PR: 18 19``` 20Please review AGENTS.md#for-humans. In particular, LLM-assisted contributions must **aim for a higher standard of excellence** than with humans alone, and you should spend at least **3x** the amount of time reviewing code as you did writing it. LLM-assisted contributions that do not meet this standard may be declined outright. Remember, **your code is your responsibility**. 21``` 22 23## Project overview 24 25Hydrant is an AT Protocol indexer built on the `fjall` database. It supports both full-network indexing and filtered indexing (eg. by DID). 26 27Key design goals: 28- Ingestion via the `fjall` storage engine. 29- Content-Addressable Storage (CAS) for IPLD blocks. 30- Reliable backfill mechanism with buffered live-event replay. 31- Efficient binary storage using MessagePack (`rmp-serde`). 32- Uses `jacquard` suite of ATProto crates. 33 34## System architecture 35 36Hydrant consists of several components: 37- **[`hydrant::ingest::firehose`]**: Connects to an upstream Firehose (Relay) and filters events. It manages the transition between discovery and synchronization. 38- **[`hydrant::ingest::worker`]**: Processes buffered Firehose messages concurrently using sharded workers. Verifies signatures, updates repository state (handling account status events like deactivations), detects gaps for backfill, and persists records. 39- **[`hydrant::crawler`]**: Periodically enumerates the network via `com.atproto.sync.listRepos` to discover new repositories. In `Full` mode it is enabled by default; in `Filter` mode it is opt-in via `HYDRANT_ENABLE_CRAWLER`. 40- **[`hydrant::resolver`]**: Manages DID resolution and key lookups. Supports multiple PLC directory sources with failover and caching. 41- **[`hydrant::backfill`]**: A dedicated worker that fetches full repository CAR files. Uses LIFO prioritization and adaptive concurrency to manage backfill load efficiently. 42- **[`hydrant::api`]**: An Axum-based XRPC server implementing repository read methods (`getRecord`, `listRecords`) and system stats. It also provides a WebSocket event stream and management APIs: 43 - `/filter` (`GET`/`PATCH`): Configure indexing mode, signals, and collection patterns. 44 - `/repos` (`GET`/`PUT`/`DELETE`): Repository management. 45- Persistence worker (in `src/main.rs`): Manages periodic background flushes of the LSM-tree and cursor state. 46 47### Lazy event inflation 48 49To minimize latency in `apply_commit` and the backfill worker, events are stored in a compact `StoredEvent` format. The expansion into full TAP-compatible JSON (including fetching record content from the CAS and DAG-CBOR parsing) is performed lazily within the WebSocket stream handler. 50 51## General conventions 52 53### Correctness over convenience 54- Handle all edge cases, including race conditions in the ingestion buffer. 55- Use the type system to encode correctness constraints. 56- Prefer compile-time guarantees over runtime checks where possible. 57 58### Error handling 59- **Typed Errors**: Define custom error enums (e.g. `ResolverError`, `IngestError`) when callers need to handle specific cases (like rate limits or retries). 60- **Diagnostics**: Use `miette::Report` embedded in a `Generic` variant for unexpected errors to maintain diagnostic context. 61- **Type Preservation**: Avoid erasing error types with `.into_diagnostic()` in valid code paths; only use it at the top-level application boundary or when the error is truly unrecoverable and needs no special handling. 62 63### Production-grade engineering 64- Use `miette` for diagnostic-driven error reporting. 65- Implement exhaustive integration tests that simulate full backfill cycles. 66- Adhere to lowercase comments and sentence case in documentation. 67- Avoid unnecessary comments if the code is self-documenting. 68 69### Storage and serialization 70- **State**: Use `rmp-serde` (MessagePack) for all internal state (`RepoState`, `ErrorState`, `StoredEvent`). 71- **Blocks**: Store IPLD blocks as raw DAG-CBOR bytes in the CAS. This avoids expensive transcoding and allows direct serving of block content. 72- **Cursors**: Store cursors as big-endian bytes (`u64`/`i64`). 73- **Keyspaces**: Use the `keys.rs` module to maintain consistent composite key formats. 74 75## Database schema (keyspaces) 76 77Hydrant uses multiple `fjall` keyspaces: 78- `repos`: Maps `{DID}` -> `RepoState` (MessagePack). 79- `records`: Maps `{DID}|{COL}|{RKey}` -> `{CID}` (Binary). 80- `blocks`: Maps `{CID}` -> `Block Data` (Raw CBOR). 81- `events`: Maps `{ID}` (u64) -> `StoredEvent` (MessagePack). This is the source for the JSON stream API. 82- `cursors`: Maps `firehose_cursor` or `crawler_cursor` -> `Value` (u64/i64 BE Bytes). 83- `pending`: Queue of `{Timestamp}|{DID}` -> `Empty` (Backfill queue). 84- `resync`: Maps `{DID}` -> `ResyncState` (MessagePack) for retry logic/tombstones. 85- `resync_buffer`: Maps `{DID}|{Rev}` -> `Commit` (MessagePack). Used to buffer live events during backfill. 86- `counts`: Maps `k|{NAME}` or `r|{DID}|{COL}` -> `Count` (u64 BE Bytes). 87- `filter`: Stores filter config. Handled by the `db::filter` module. Includes mode key `m` -> `FilterMode` (MessagePack), and set entries for signals (`s|{NSID}`), collections (`c|{NSID}`), and excludes (`x|{DID}`) -> empty value. 88- `crawler`: Stores crawler state with prefixed keys. Failed crawl entries use `f|{DID}` -> empty value, representing repos that failed signal checking during crawl discovery. 89 90## Safe commands 91 92### Testing 93- `nu tests/repo_sync_integrity.nu` - Runs the full integration test suite using Nushell. This builds the binary, starts a temporary instance, performs a backfill against a real PDS, and verifies record integrity. 94- `nu tests/verify_crawler.nu` - Verifies full-network crawler functionality using a mock relay. 95- `nu tests/throttling_test.nu` - Verifies crawler throttling logic when pending queue is full. 96- `nu tests/stream_test.nu` - Tests WebSocket streaming functionality. Verifies both live event streaming during backfill and historical replay with cursor. 97- `nu tests/authenticated_stream_test.nu` - Tests authenticated event streaming. Verifies that create, update, and delete actions on a real account are correctly streamed by Hydrant in the correct order. Requires `TEST_REPO` and `TEST_PASSWORD` in `.env`. 98- `nu tests/debug_endpoints.nu` - Tests debug/introspection endpoints (`/debug/iter`, `/debug/get`) and verifies DB content and serialization. 99 100## Rust code style 101 102- Prefer variable substitution in `format!` like macros (eg. logging macros like `info!`, `debug!`) like so: `format!("error: {err}")`. 103- Prefer using let-guard (eg. `let Some(val) = res else { continue; }`) over nested ifs where it makes sense (eg. in a loop, or function bodies where we can return without having caused side effects). 104- Prefer functional combinators over explicit matching when it improves readability (eg. `.then_some()`, `.map()`, `.ok_or_else()`). 105- Prefer iterator chains (`.filter_map()`, `.flat_map()`) over explicit loops for data transformation. 106 107## Commit message style 108 109Commits should be brief and descriptive, following the format: 110`[module] brief description` 111 112Examples: 113- `[ingest] implement backfill buffer replay` 114- `[api] add accurate count parameter to stats` 115- `[db] migrate block storage to msgpack` 116 117<!-- gitnexus:start --> 118# GitNexus — Code Intelligence 119 120This project is indexed by GitNexus as **hydrant** (655 symbols, 1810 relationships, 55 execution flows). Use the GitNexus MCP tools to understand code, assess impact, and navigate safely. 121 122> If any GitNexus tool warns the index is stale, run `npx gitnexus analyze` in terminal first. 123 124## Always Do 125 126- **MUST run impact analysis before editing any symbol.** Before modifying a function, class, or method, run `gitnexus_impact({target: "symbolName", direction: "upstream"})` and report the blast radius (direct callers, affected processes, risk level) to the user. 127- **MUST run `gitnexus_detect_changes()` before committing** to verify your changes only affect expected symbols and execution flows. 128- **MUST warn the user** if impact analysis returns HIGH or CRITICAL risk before proceeding with edits. 129- When exploring unfamiliar code, use `gitnexus_query({query: "concept"})` to find execution flows instead of grepping. It returns process-grouped results ranked by relevance. 130- When you need full context on a specific symbol — callers, callees, which execution flows it participates in — use `gitnexus_context({name: "symbolName"})`. 131 132## When Debugging 133 1341. `gitnexus_query({query: "<error or symptom>"})` — find execution flows related to the issue 1352. `gitnexus_context({name: "<suspect function>"})` — see all callers, callees, and process participation 1363. `READ gitnexus://repo/hydrant/process/{processName}` — trace the full execution flow step by step 1374. For regressions: `gitnexus_detect_changes({scope: "compare", base_ref: "main"})` — see what your branch changed 138 139## When Refactoring 140 141- **Renaming**: MUST use `gitnexus_rename({symbol_name: "old", new_name: "new", dry_run: true})` first. Review the preview — graph edits are safe, text_search edits need manual review. Then run with `dry_run: false`. 142- **Extracting/Splitting**: MUST run `gitnexus_context({name: "target"})` to see all incoming/outgoing refs, then `gitnexus_impact({target: "target", direction: "upstream"})` to find all external callers before moving code. 143- After any refactor: run `gitnexus_detect_changes({scope: "all"})` to verify only expected files changed. 144 145## Never Do 146 147- NEVER edit a function, class, or method without first running `gitnexus_impact` on it. 148- NEVER ignore HIGH or CRITICAL risk warnings from impact analysis. 149- NEVER rename symbols with find-and-replace — use `gitnexus_rename` which understands the call graph. 150- NEVER commit changes without running `gitnexus_detect_changes()` to check affected scope. 151 152## Tools Quick Reference 153 154| Tool | When to use | Command | 155|------|-------------|---------| 156| `query` | Find code by concept | `gitnexus_query({query: "auth validation"})` | 157| `context` | 360-degree view of one symbol | `gitnexus_context({name: "validateUser"})` | 158| `impact` | Blast radius before editing | `gitnexus_impact({target: "X", direction: "upstream"})` | 159| `detect_changes` | Pre-commit scope check | `gitnexus_detect_changes({scope: "staged"})` | 160| `rename` | Safe multi-file rename | `gitnexus_rename({symbol_name: "old", new_name: "new", dry_run: true})` | 161| `cypher` | Custom graph queries | `gitnexus_cypher({query: "MATCH ..."})` | 162 163## Impact Risk Levels 164 165| Depth | Meaning | Action | 166|-------|---------|--------| 167| d=1 | WILL BREAK — direct callers/importers | MUST update these | 168| d=2 | LIKELY AFFECTED — indirect deps | Should test | 169| d=3 | MAY NEED TESTING — transitive | Test if critical path | 170 171## Resources 172 173| Resource | Use for | 174|----------|---------| 175| `gitnexus://repo/hydrant/context` | Codebase overview, check index freshness | 176| `gitnexus://repo/hydrant/clusters` | All functional areas | 177| `gitnexus://repo/hydrant/processes` | All execution flows | 178| `gitnexus://repo/hydrant/process/{name}` | Step-by-step execution trace | 179 180## Self-Check Before Finishing 181 182Before completing any code modification task, verify: 1831. `gitnexus_impact` was run for all modified symbols 1842. No HIGH/CRITICAL risk warnings were ignored 1853. `gitnexus_detect_changes()` confirms changes match expected scope 1864. All d=1 (WILL BREAK) dependents were updated 187 188## CLI 189 190- Re-index: `npx gitnexus analyze` 191- Check freshness: `npx gitnexus status` 192- Generate docs: `npx gitnexus wiki` 193 194<!-- gitnexus:end -->