# hydrant

`hydrant` is an AT Protocol indexer built on the `fjall` database. it's meant to be a flexible indexer, supporting both full-network indexing and filtered indexing (e.g., by DID), also allowing querying with XRPCs and providing an ordered event stream with cursor support.

you can see [random.wisp.place](https://tangled.org/did:plc:dfl62fgb7wtjj3fcbb72naae/random.wisp.place) for an example on how to use hydrant.

## vs `tap`

while [`tap`](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) is designed as a firehose consumer and simply just propagates events while handling sync, `hydrant` is flexible, it allows you to directly query the database for records, and it also provides an ordered view of events, allowing the use of a cursor to fetch events from a specific point in time.

### stream behavior

the `WS /stream` (hydrant) and `WS /channel` (tap) endpoints have different designs:

| aspect | `tap` (`/channel`) | `hydrant` (`/stream`) |
| :--- | :--- | :--- |
| distribution | sharded work queue: events are load-balanced across connected clients. If 5 clients connect, each receives ~20% of events. | broadcast: every connected client receives a full copy of the event stream. if 5 clients connect, all 5 receive 100% of events. |
| cursors | server-managed: clients ACK messages. the server tracks progress and redelivers unacked messages. | client-managed: client provides `?cursor=123`. the server streams from that point. |
| persistence | events are stored in an outbox and sent to the consumer, removing them, so they can't be replayed once they are acked. | `record` events are replayable. `identity`/`account` are ephemeral. use `GET /repos/:did` to query identity / account info (handle, pds, signing key, etc.). |
| backfill | backfill events are mixed into the live queue and prioritized (per-repo, acting as synchronization barrier) by the server. | backfill simply inserts historical events (`live: false`) into the global event log. streaming is just reading this log sequentially. synchronization is the same as tap, `live: true` vs `live: false`. |
| event types | `record`, `identity` (includes status) | `record`, `identity` (handle), `account` (status) |

## configuration

`hydrant` is configured via environment variables. all variables are prefixed with `HYDRANT_`.

| variable | default | description |
| :--- | :--- | :--- |
| `DATABASE_PATH` | `./hydrant.db` | path to the database folder. |
| `LOG_LEVEL` | `info` | log level (e.g., `debug`, `info`, `warn`, `error`). |
| `RELAY_HOST` | `wss://relay.fire.hose.cam` | websocket URL of the upstream firehose relay. |
| `PLC_URL` | `https://plc.wtf` | base URL(s) of the PLC directory (comma-separated for multiple). |
| `FULL_NETWORK` | `false` | if `true`, discovers and indexes all repositories in the network. |
| `FILTER_SIGNALS` | | comma-separated list of NSID patterns to use for the filter on startup (e.g. `app.bsky.feed.post,app.bsky.graph.*`). |
| `FILTER_COLLECTIONS` | | comma-separated list of NSID patterns to use for the collections filter on startup. |
| `FILTER_EXCLUDES` | | comma-separated list of DIDs to exclude from indexing on startup. |
| `FIREHOSE_WORKERS` | `8` (`32` if full network) | number of concurrent workers for firehose events. |
| `BACKFILL_CONCURRENCY_LIMIT` | `128` | maximum number of concurrent backfill tasks. |
| `VERIFY_SIGNATURES` | `full` | signature verification level: `full`, `backfill-only`, or `none`. |
| `CURSOR_SAVE_INTERVAL` | `5` | interval (in seconds) to save the firehose cursor. |
| `REPO_FETCH_TIMEOUT` | `300` | timeout (in seconds) for fetching repositories. |
| `CACHE_SIZE` | `256` | size of the database cache in MB. |
| `IDENTITY_CACHE_SIZE` | `1000000` | number of identity entries to cache. |
| `API_PORT` | `3000` | port for the API server. |
| `ENABLE_DEBUG` | `false` | enable debug endpoints. |
| `DEBUG_PORT` | `3001` | port for debug endpoints (if enabled). |
| `NO_LZ4_COMPRESSION` | `false` | disable lz4 compression for storage. |
| `ENABLE_FIREHOSE` | `true` | whether to ingest relay subscriptions. |
| `ENABLE_BACKFILL` | `true` | whether to backfill from PDS instances. |
| `ENABLE_CRAWLER` | `false` (if Filter), `true` (if Full) | whether to actively query the network for unknown repositories. |
| `CRAWLER_MAX_PENDING_REPOS` | `2000` | max pending repos for crawler. |
| `CRAWLER_RESUME_PENDING_REPOS` | `1000` | resume threshold for crawler pending repos. |

## api

### management

- `GET /filter`: get the current filter configuration.
- `PATCH /filter`: update the filter configuration.

#### filter mode

the `mode` field controls what gets indexed:

| mode | behaviour |
| :--- | :--- |
| `filter` | auto-discovers and backfills any account whose firehose commit touches a collection matching one of the `signals` patterns. you can also explicitly track individual repositories via the `/repos` endpoint regardless of matching signals. |
| `full` | index the entire network. `signals` are ignored for discovery, but `excludes` and `collections` still apply. |

#### fields

| field | type | description |
| :--- | :--- | :--- |
| `mode` | `"filter"` \| `"full"` | indexing mode (see above). |
| `signals` | set update | NSID patterns (e.g. `app.bsky.feed.post` or `app.bsky.*`) that trigger auto-discovery in `filter` mode. |
| `collections` | set update | NSID patterns used to filter which records are stored. if empty, all collections are stored. applies in all modes. |
| `excludes` | set update | set of DIDs to always skip, regardless of mode. checked before any other filter logic. |

#### set updates

each set field accepts one of two forms:

- **replace**: an array replaces the entire set — `["did:plc:abc", "did:web:example.org"]`
- **patch**: an object maps items to `true` (add) or `false` (remove) — `{"did:plc:abc": true, "did:web:example.org": false}`

#### NSID patterns

`signals` and `collections` support an optional `.*` suffix to match an entire namespace:

- `app.bsky.feed.post` — exact match only
- `app.bsky.feed.*` — matches any collection under `app.bsky.feed`

### repository management

- `GET /repos`: get an NDJSON stream of repositories and their sync status. supports pagination and filtering:
    - `limit`: max results (default 100, max 1000)
    - `cursor`: opaque key for paginating.
    - `partition`: `all` (default), `pending` (backfill queue), or `resync` (retries)
- `GET /repos/{did}`: get the sync status and metadata of a specific repository. also returns the handle, PDS URL and the atproto signing key (these won't be available before the repo has been backfilled once at least).
- `PUT /repos`: explicitly track repositories. accepts an NDJSON body of `{"did": "..."}` (or JSON array of the same).
- `DELETE /repos`: untrack repositories. accepts an NDJSON body of `{"did": "..."}` (or JSON array of the same). optionally include `"deleteData": true` to also purge the repository from the database.

### data access (xrpc)

`hydrant` implements the following XRPC endpoints under `/xrpc/`:

#### com.atproto.*

the following are implemented currently:
- `com.atproto.repo.getRecord`
- `com.atproto.repo.listRecords`

#### systems.gaze.hydrant.*

these are some non-standard XRPCs that might be useful.

##### systems.gaze.hydrant.countRecords

return the total number of stored records in a collection.

| param | required | description |
| :--- | :--- | :--- |
| `identifier` | yes | DID or handle of the repository. |
| `collection` | yes | NSID of the collection. |

returns `{ count }`.

### event stream

- `GET /stream`: subscribe to the event stream.
  - query parameters:
    - `cursor` (optional): start streaming from a specific event ID.

### stats

- `GET /stats`: get aggregate counts of repos, records, events, and errors.