···3636Hydrant consists of several components:
3737- **[`hydrant::ingest::firehose`]**: Connects to an upstream Firehose (Relay) and filters events. It manages the transition between discovery and synchronization.
3838- **[`hydrant::ingest::worker`]**: Processes buffered Firehose messages concurrently using sharded workers. Verifies signatures, updates repository state (handling account status events like deactivations), detects gaps for backfill, and persists records.
3939-- **[`hydrant::crawler`]**: Periodically enumerates the network via `com.atproto.sync.listRepos` to discover new repositories when in full-network mode.
3939+- **[`hydrant::crawler`]**: Periodically enumerates the network via `com.atproto.sync.listRepos` to discover new repositories. In `Full` mode it is enabled by default; in `Filter` mode it is opt-in via `HYDRANT_ENABLE_CRAWLER`.
4040- **[`hydrant::resolver`]**: Manages DID resolution and key lookups. Supports multiple PLC directory sources with failover and caching.
4141- **[`hydrant::backfill`]**: A dedicated worker that fetches full repository CAR files. Uses LIFO prioritization and adaptive concurrency to manage backfill load efficiently.
4242-- **[`hydrant::api`]**: An Axum-based XRPC server implementing repository read methods (`getRecord`, `listRecords`) and system stats. It also provides a WebSocket event stream and a filter management API (`GET`/`PATCH /filter`) for configuring indexing mode, DID lists, signals, and collection patterns.
4343-- **Persistence worker** (in `src/main.rs`): Manages periodic background flushes of the LSM-tree and cursor state.
4242+- **[`hydrant::api`]**: An Axum-based XRPC server implementing repository read methods (`getRecord`, `listRecords`) and system stats. It also provides a WebSocket event stream and management APIs:
4343+ - `/filter` (`GET`/`PATCH`): Configure indexing mode, signals, and collection patterns.
4444+ - `/repos` (`GET`/`PUT`/`DELETE`): Bulk repository management using NDJSON or JSON arrays.
4545+- Persistence worker (in `src/main.rs`): Manages periodic background flushes of the LSM-tree and cursor state.
44464547### Lazy event inflation
4648···8284- `resync`: Maps `{DID}` -> `ResyncState` (MessagePack) for retry logic/tombstones.
8385- `resync_buffer`: Maps `{DID}|{Rev}` -> `Commit` (MessagePack). Used to buffer live events during backfill.
8486- `counts`: Maps `k|{NAME}` or `r|{DID}|{COL}` -> `Count` (u64 BE Bytes).
8585-- `filter`: Stores filter config: mode key `m` -> `FilterMode` (MessagePack), and set entries for DIDs (`d|{DID}`), signals (`s|{NSID}`), collections (`c|{NSID}`), and excludes (`x|{DID}`) -> empty value.
8787+- `filter`: Stores filter config. Handled by the `db::filter` module. Includes mode key `m` -> `FilterMode` (MessagePack), and set entries for signals (`s|{NSID}`), collections (`c|{NSID}`), and excludes (`x|{DID}`) -> empty value.
86888789## Safe commands
8890
+15-10
README.md
···4040| `ENABLE_DEBUG` | `false` | enable debug endpoints. |
4141| `DEBUG_PORT` | `3001` | port for debug endpoints (if enabled). |
4242| `NO_LZ4_COMPRESSION` | `false` | disable lz4 compression for storage. |
4343-| `DISABLE_FIREHOSE` | `false` | disable firehose ingestion. |
4444-| `DISABLE_BACKFILL` | `false` | disable backfill processing. |
4343+| `ENABLE_FIREHOSE` | `true` | whether to ingest relay subscriptions. |
4444+| `ENABLE_BACKFILL` | `true` | whether to backfill from PDS instances. |
4545+| `ENABLE_CRAWLER` | `false` (if Filter), `true` (if Full) | whether to actively query the network for unknown repositories. |
4546| `DB_WORKER_THREADS` | `4` (`8` if full network) | database worker threads. |
4647| `DB_MAX_JOURNALING_SIZE_MB` | `512` (`1024` if full network) | max database journaling size in MB. |
4748| `DB_PENDING_MEMTABLE_SIZE_MB` | `64` (`192` if full network) | pending memtable size in MB. |
···65666667| mode | behaviour |
6768| :--- | :--- |
6868-| `dids` | only index repositories explicitly listed in `dids`. new accounts seen on the firehose are ignored unless they are in the list. |
6969-| `signal` | like `dids`, but also auto-discovers and backfills any account whose firehose commit touches a collection matching one of the `signals` patterns. |
7070-| `full` | index the entire network. `dids` and `signals` are ignored for discovery, but `excludes` and `collections` still apply. |
6969+| `filter` | auto-discovers and backfills any account whose firehose commit touches a collection matching one of the `signals` patterns. you can also explicitly track individual repositories via the `/repos` endpoint regardless of matching signals. |
7070+| `full` | index the entire network. `signals` are ignored for discovery, but `excludes` and `collections` still apply. |
71717272#### fields
73737474| field | type | description |
7575| :--- | :--- | :--- |
7676-| `mode` | `"dids"` \| `"signal"` \| `"full"` | indexing mode (see above). |
7777-| `dids` | set update | set of DIDs to explicitly track. in `dids` and `signal` modes, always processed regardless of signal matching. adding an untracked DID enqueues a backfill. |
7878-| `signals` | set update | NSID patterns (e.g. `app.bsky.feed.post` or `app.bsky.*`) that trigger auto-discovery in `signal` mode. |
7676+| `mode` | `"filter"` \| `"full"` | indexing mode (see above). |
7777+| `signals` | set update | NSID patterns (e.g. `app.bsky.feed.post` or `app.bsky.*`) that trigger auto-discovery in `filter` mode. |
7978| `collections` | set update | NSID patterns used to filter which records are stored. if empty, all collections are stored. applies in all modes. |
8079| `excludes` | set update | set of DIDs to always skip, regardless of mode. checked before any other filter logic. |
8180···9392- `app.bsky.feed.post` — exact match only
9493- `app.bsky.feed.*` — matches any collection under `app.bsky.feed`
95949595+### repository management
9696+9797+- `GET /repos`: get an NDJSON stream of all repositories and their sync status.
9898+- `PUT /repos`: explicitly track repositories. accepts an NDJSON body of `{"did": "..."}` (or JSON array of the same).
9999+- `DELETE /repos`: untrack repositories. accepts an NDJSON body of `{"did": "..."}` (or JSON array of the same). optionally include `"deleteData": true` to also purge the repository from the database.
100100+96101### data access (xrpc)
9710298103`hydrant` implements the following XRPC endpoints under `/xrpc/`:
99104100105#### `com.atproto.repo.getRecord`
101106102102-retrieve a single record by its AT-URI components.
107107+retrieve a single record by its AT URI components.
103108104109| param | required | description |
105110| :--- | :--- | :--- |
···107112| `collection` | yes | NSID of the collection. |
108113| `rkey` | yes | record key. |
109114110110-returns the record value, its CID, and its AT-URI. responds with `RecordNotFound` if not present.
115115+returns the record value, its CID, and its AT URI. responds with `RecordNotFound` if not present.
111116112117#### `com.atproto.repo.listRecords`
113118
···8899mod debug;
1010pub mod filter;
1111+pub mod repos; // Added this line
1112pub mod stats;
1213mod stream;
1314pub mod xrpc;
···2122 .route("/stream", get(stream::handle_stream))
2223 .merge(xrpc::router())
2324 .merge(filter::router())
2525+ .merge(repos::router()) // Added this line
2426 .with_state(state)
2527 .layer(TraceLayer::new_for_http())
2628 .layer(CorsLayer::permissive());
···11+use serde::{Deserialize, Serialize};
22+use smol_str::SmolStr;
13use std::sync::Arc;
2433-use arc_swap::ArcSwap;
44-use fjall::Keyspace;
55-use miette::{IntoDiagnostic, Result};
66-use serde::{Deserialize, Serialize};
77-use smol_str::SmolStr;
55+pub type FilterHandle = Arc<arc_swap::ArcSwap<FilterConfig>>;
66+77+pub fn new_handle(config: FilterConfig) -> FilterHandle {
88+ Arc::new(arc_swap::ArcSwap::new(Arc::new(config)))
99+}
81099-pub const MODE_KEY: &[u8] = b"m";
1010-pub const DID_PREFIX: u8 = b'd';
1111-pub const SIGNAL_PREFIX: u8 = b's';
1212-pub const COLLECTION_PREFIX: u8 = b'c';
1313-pub const EXCLUDE_PREFIX: u8 = b'x';
1414-pub const SEP: u8 = b'|';
1111+/// apply a bool patch or set replacement for a single set update.
1212+#[derive(Debug, Clone, Serialize, Deserialize)]
1313+#[serde(untagged)]
1414+pub enum SetUpdate {
1515+ /// replace the entire set with this list
1616+ Set(Vec<String>),
1717+ /// patch: true = add, false = remove
1818+ Patch(std::collections::HashMap<String, bool>),
1919+}
15201621#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
1722#[serde(rename_all = "snake_case")]
1823pub enum FilterMode {
1919- Dids = 0,
2020- Signal = 1,
2424+ Filter = 0,
2125 Full = 2,
2226}
2327···3943 }
4044 }
41454242- pub fn load(ks: &Keyspace) -> Result<Self> {
4343- let mode = ks
4444- .get(MODE_KEY)
4545- .into_diagnostic()?
4646- .map(|v| rmp_serde::from_slice(&v).into_diagnostic())
4747- .transpose()?
4848- .unwrap_or(FilterMode::Dids);
4949-5050- let mut config = Self::new(mode);
5151-5252- let signal_prefix = [SIGNAL_PREFIX, SEP];
5353- for guard in ks.prefix(signal_prefix) {
5454- let (k, _) = guard.into_inner().into_diagnostic()?;
5555- let val = std::str::from_utf8(&k[signal_prefix.len()..]).into_diagnostic()?;
5656- config.signals.push(SmolStr::new(val));
5757- }
5858-5959- let col_prefix = [COLLECTION_PREFIX, SEP];
6060- for guard in ks.prefix(col_prefix) {
6161- let (k, _) = guard.into_inner().into_diagnostic()?;
6262- let val = std::str::from_utf8(&k[col_prefix.len()..]).into_diagnostic()?;
6363- config.collections.push(SmolStr::new(val));
6464- }
6565-6666- Ok(config)
6767- }
6868-6946 /// returns true if the collection matches the content filter.
7047 /// if collections is empty, all collections match.
7148 pub fn matches_collection(&self, collection: &str) -> bool {
···8865 collection == pattern
8966 }
9067}
9191-9292-pub type FilterHandle = Arc<ArcSwap<FilterConfig>>;
9393-9494-pub fn new_handle(config: FilterConfig) -> FilterHandle {
9595- Arc::new(ArcSwap::new(Arc::new(config)))
9696-}
9797-9898-/// apply a bool patch or set replacement for a single set update.
9999-#[derive(Debug, Deserialize)]
100100-#[serde(untagged)]
101101-pub enum SetUpdate {
102102- /// replace the entire set with this list
103103- Set(Vec<String>),
104104- /// patch: true = add, false = remove
105105- Patch(std::collections::HashMap<String, bool>),
106106-}
107107-108108-pub fn filter_key(prefix: u8, val: &str) -> Vec<u8> {
109109- let mut key = Vec::with_capacity(2 + val.len());
110110- key.push(prefix);
111111- key.push(SEP);
112112- key.extend_from_slice(val.as_bytes());
113113- key
114114-}
+22-22
src/ingest/firehose.rs
···11-use crate::db::{self, Db, keys};
11+use crate::db;
22use crate::filter::{FilterHandle, FilterMode};
33use crate::ingest::{BufferTx, IngestMessage};
44use crate::state::AppState;
···118118 async fn should_process(&self, did: &Did<'_>) -> Result<bool> {
119119 let filter = self.filter.load();
120120121121- let excl_key = crate::filter::filter_key(crate::filter::EXCLUDE_PREFIX, did.as_str());
121121+ let excl_key =
122122+ crate::db::filter::filter_key(crate::db::filter::EXCLUDE_PREFIX, did.as_str());
122123 if self
123124 .state
124125 .db
···131132132133 match filter.mode {
133134 FilterMode::Full => Ok(true),
134134- FilterMode::Dids | FilterMode::Signal => {
135135- let did_key = crate::filter::filter_key(crate::filter::DID_PREFIX, did.as_str());
136136- if self
137137- .state
138138- .db
139139- .filter
140140- .contains_key(&did_key)
141141- .into_diagnostic()?
142142- {
143143- debug!("{did} is in DID allowlist, processing");
144144- return Ok(true);
135135+ FilterMode::Filter => {
136136+ let repo_key = crate::db::keys::repo_key(did);
137137+ if let Some(state_bytes) = self.state.db.repos.get(&repo_key).into_diagnostic()? {
138138+ let repo_state: crate::types::RepoState =
139139+ rmp_serde::from_slice(&state_bytes).into_diagnostic()?;
140140+141141+ if repo_state.tracked {
142142+ debug!("{did} is a tracked repo, processing");
143143+ return Ok(true);
144144+ } else {
145145+ debug!("{did} is known but explicitly untracked, skipping");
146146+ return Ok(false);
147147+ }
145148 }
146146- let known =
147147- Db::contains_key(self.state.db.repos.clone(), keys::repo_key(did)).await?;
148148- if known {
149149- debug!("{did} is a known repo, processing");
149149+150150+ if !filter.signals.is_empty() {
151151+ debug!("{did} is unknown — passing to worker for signal check");
152152+ Ok(true)
150153 } else {
151151- debug!(
152152- "{did} is unknown — passing to worker for signal check (mode={:?})",
153153- filter.mode
154154- );
154154+ debug!("{did} is unknown and no signals configured, skipping");
155155+ Ok(false)
155156 }
156156- Ok(known || filter.mode == FilterMode::Signal)
157157 }
158158 }
159159 }
+7-2
src/ingest/worker.rs
···380380 match &account.status {
381381 Some(AccountStatus::Deleted) => {
382382 debug!("account {did} deleted, wiping data");
383383- ops::delete_repo(ctx.batch, &ctx.state.db, did, repo_state)?;
383383+ crate::ops::delete_repo(ctx.batch, &ctx.state.db, did, &repo_state)?;
384384 return Ok(RepoProcessResult::Deleted);
385385 }
386386 status => {
···530530 let Some(state_bytes) = ctx.state.db.repos.get(&repo_key).into_diagnostic()? else {
531531 let filter = ctx.state.filter.load();
532532533533- if filter.mode == FilterMode::Signal {
533533+ if filter.mode == FilterMode::Filter && !filter.signals.is_empty() {
534534 let commit = match msg {
535535 SubscribeReposMessage::Commit(c) => c,
536536 _ => return Ok(RepoProcessResult::Syncing(None)),
···582582 return Ok(RepoProcessResult::Syncing(None));
583583 };
584584 let mut repo_state = crate::db::deser_repo_state(&state_bytes)?.into_static();
585585+586586+ if !repo_state.tracked {
587587+ debug!("ignoring active status for {did} as it is explicitly untracked");
588588+ return Ok(RepoProcessResult::Syncing(None));
589589+ }
585590586591 // if we are backfilling or it is new, DON'T mark it as synced yet
587592 // the backfill worker will do that when it finishes
+32-20
src/main.rs
···11-use futures::{FutureExt, TryFutureExt, future::BoxFuture};
11+use futures::{FutureExt, future::BoxFuture};
22use hydrant::config::{Config, SignatureVerification};
33-use hydrant::crawler::Crawler;
43use hydrant::db::{self, set_firehose_cursor};
54use hydrant::ingest::firehose::FirehoseIngestor;
65use hydrant::state::AppState;
···3029 let filter_ks = state.db.filter.clone();
3130 let inner = state.db.inner.clone();
3231 tokio::task::spawn_blocking(move || {
3333- use hydrant::filter::{FilterMode, MODE_KEY};
3232+ use hydrant::db::filter::MODE_KEY;
3333+ use hydrant::filter::FilterMode;
3434 let mut batch = inner.batch();
3535 batch.insert(
3636 &filter_ks,
···4949 let (buffer_tx, buffer_rx) = mpsc::unbounded_channel();
5050 let state = Arc::new(state);
51515252- if !cfg.disable_backfill {
5252+ if cfg.enable_backfill {
5353 tokio::spawn({
5454 let state = state.clone();
5555 let timeout = cfg.repo_fetch_timeout;
···144144 }
145145 });
146146147147- if let hydrant::filter::FilterMode::Full | hydrant::filter::FilterMode::Signal =
148148- state.filter.load().mode
149149- {
150150- tokio::spawn(
151151- Crawler::new(
152152- state.clone(),
153153- cfg.relay_host.clone(),
154154- cfg.crawler_max_pending_repos,
155155- cfg.crawler_resume_pending_repos,
156156- )
157157- .run()
158158- .inspect_err(|e| {
159159- error!("crawler died: {e}");
147147+ info!("starting crawler ({:?})", state.filter.load().mode);
148148+ let state_clone = state.clone();
149149+ let relay_host_clone = cfg.relay_host.clone();
150150+ let crawler_max_pending = cfg.crawler_max_pending_repos;
151151+ let crawler_resume_pending = cfg.crawler_resume_pending_repos;
152152+153153+ let should_run_crawler = match cfg.enable_crawler {
154154+ Some(true) => true,
155155+ Some(false) => false,
156156+ None => state.filter.load().mode == hydrant::filter::FilterMode::Full,
157157+ };
158158+159159+ if should_run_crawler {
160160+ tokio::spawn(async move {
161161+ // the crawler is responsible for finding new repos
162162+ let crawler = hydrant::crawler::Crawler::new(
163163+ state_clone,
164164+ relay_host_clone,
165165+ crawler_max_pending,
166166+ crawler_resume_pending,
167167+ );
168168+ if let Err(e) = crawler.run().await {
169169+ error!("crawler error: {e}");
160170 db::check_poisoned_report(&e);
161161- }),
162162- );
171171+ }
172172+ });
173173+ } else {
174174+ info!("crawler disabled by config or filter mode");
163175 }
164176165165- let mut tasks = if !cfg.disable_firehose {
177177+ let mut tasks = if cfg.enable_firehose {
166178 let firehose_worker = std::thread::spawn({
167179 let state = state.clone();
168180 let handle = tokio::runtime::Handle::current();
···5252# build the hydrant binary
5353export def build-hydrant [] {
5454 print "building hydrant..."
5555- cargo build --release --quiet
5555+ cargo build --release
5656 "target/release/hydrant"
5757}
5858···6161 let log_file = $"($db_path)/hydrant.log"
6262 print $"starting hydrant - logs at ($log_file)..."
63636464- let pid = (
6565- with-env {
6666- HYDRANT_DATABASE_PATH: ($db_path),
6767- HYDRANT_FULL_NETWORK: "false",
6868- HYDRANT_API_PORT: ($port | into string),
6969- HYDRANT_ENABLE_DEBUG: "true",
7070- HYDRANT_DEBUG_PORT: ($port + 1 | into string),
7171- HYDRANT_LOG_LEVEL: "debug"
7272- } {
7373- sh -c $"($binary) >($log_file) 2>&1 & echo $!" | str trim | into int
7474- }
7575- )
6464+ let hydrant_vars = ($env | transpose k v | where k =~ "HYDRANT_" | reduce -f {} { |it, acc| $acc | upsert $it.k $it.v })
6565+ let env_vars = {
6666+ HYDRANT_DATABASE_PATH: ($db_path),
6767+ HYDRANT_FULL_NETWORK: "false",
6868+ HYDRANT_API_PORT: ($port | into string),
6969+ HYDRANT_ENABLE_DEBUG: "true",
7070+ HYDRANT_DEBUG_PORT: ($port + 1 | into string),
7171+ HYDRANT_LOG_LEVEL: "debug"
7272+ } | merge $hydrant_vars
7373+7474+ let pid = (with-env $env_vars {
7575+ sh -c $"($binary) >($log_file) 2>&1 & echo $!" | str trim | into int
7676+ })
76777778 print $"hydrant started with pid: ($pid)"
7879 { pid: $pid, log: $log_file }
+3-2
tests/debug_endpoints.nu
···1818 if (wait-for-api $url) {
1919 # Trigger backfill to populate some data
2020 print $"adding repo ($did) to tracking..."
2121- http patch -t application/json $"($url)/filter" { dids: { ($did): true } }
2121+ http put -t application/json $"($url)/repos" [ { did: ($did) } ]
22222323 if (wait-for-backfill $url) {
2424 print "backfill complete, testing debug endpoints"
···46464747 # 2. Test /debug/get with that key (sent as string)
4848 print "testing /debug/get"
4949- let get_res = http get $"($debug_url)/debug/get?partition=records&key=($key_str)"
4949+ let encoded_key = ($key_str | url encode)
5050+ let get_res = http get $"($debug_url)/debug/get?partition=records&key=($encoded_key)"
50515152 if $get_res.value != $value_cid {
5253 print $"FAILED: /debug/get returned different value. expected: ($value_cid), got: ($get_res.value)"
+1-1
tests/repo_sync_integrity.nu
···112112 if (wait-for-api $url) {
113113 # track the repo via API
114114 print $"adding repo ($did) to tracking..."
115115- http patch -t application/json $"($url)/filter" { dids: { ($did): true } }
115115+ http put -t application/json $"($url)/repos" [ { did: ($did) } ]
116116117117 if (wait-for-backfill $url) {
118118 # Run both consistency checks
+19-15
tests/signal_filter_test.nu
···1111 exit 1
1212 }
13131414- let port = 3007
1414+ let port = 3011
1515 let url = $"http://localhost:($port)"
1616 let db_path = (mktemp -d -t hydrant_signal_test.XXXXXX)
1717- let collection = "app.bsky.feed.post"
1717+1818+ let random_str = (random chars -l 6)
1919+ let collection = $"systems.hydrant.test.($random_str)"
18201921 print $"database path: ($db_path)"
2022···2628 print "authenticated"
27292830 let binary = build-hydrant
3131+ $env.HYDRANT_RELAY_HOST = "wss://bsky.network/"
2932 let instance = start-hydrant $binary $db_path $port
30333134 mut test_passed = false
32353336 if (wait-for-api $url) {
3434- # configure signal mode: index app.bsky.feed.post from anyone on the network
3535- print "configuring signal mode..."
3737+ # configure filter mode: index app.bsky.feed.post from anyone on the network
3838+ print "configuring filter mode..."
3639 http patch -t application/json $"($url)/filter" {
3737- mode: "signal",
4040+ mode: "filter",
3841 signals: [$collection]
3942 }
4043···4245 let filter = (http get $"($url)/filter")
4346 print $"filter state: ($filter | to json)"
44474545- if $filter.mode != "signal" {
4646- print "FAILED: mode was not set to signal"
4848+ if $filter.mode != "filter" {
4949+ print "FAILED: mode was not set to filter"
4750 } else if not ($filter.signals | any { |s| $s == $collection }) {
4851 print $"FAILED: ($collection) not in signals"
4952 } else {
···5255 # wait a moment for the firehose to connect and the filter to take effect
5356 sleep 3sec
54575555- let timestamp = (date now | format date "%Y-%m-%dT%H:%M:%SZ")
5656- let record_data = {
5757- "$type": $collection,
5858- text: $"hydrant signal filter test ($timestamp)",
5959- createdAt: $timestamp
6060- }
5858+ let timestamp = (date now | format date "%Y-%m-%dT%H:%M:%SZ")
5959+ let record_data = {
6060+ "$type": $collection,
6161+ text: $"hydrant signal filter test ($timestamp) - bsky.network relay",
6262+ createdAt: $timestamp
6363+ }
61646265 print "creating post..."
6366 let create_res = (http post -t application/json -H ["Authorization" $"Bearer ($jwt)"] $"($pds_url)/xrpc/com.atproto.repo.createRecord" {
6467 repo: $did,
6568 collection: $collection,
6969+ validate: false,
6670 record: $record_data
6771 })
6872 let rkey = ($create_res.uri | split row "/" | last)
6973 print $"created: ($create_res.uri)"
70747171- # give hydrant time to receive and process the firehose event
7272- sleep 5sec
7575+ # give hydrant time to receive and process the firehose event and backfill
7676+ sleep 10sec
73777478 # verify the record was indexed
7579 print "checking indexed record..."