A rust implementation of skywatch-phash
1# Processor Module
2
3## Purpose
4
5This module is the core logic engine of the service. It is responsible for downloading images, computing their perceptual hashes (phashes), and matching those hashes against a list of known problematic images.
6
7## Key Components
8
9### `phash.rs`
10
11This file contains the fundamental perceptual hashing logic.
12
13- **`compute_phash(image_bytes)`**: Takes raw image bytes, decodes them, and computes a 64-bit "average hash" (aHash) using the `image_hasher` crate. It returns the hash as a 16-character hex string.
14- **`hamming_distance(hash1, hash2)`**: A highly efficient function that calculates the number of differing bits between two hashes. A distance of `0` means the hashes are identical, while a small distance (e.g., 1-5) indicates the images are visually similar.
15- **`PhashError`**: A well-defined error enum using `thiserror` for handling issues like image decoding failures or invalid hash formats.
16
17### `matcher.rs`
18
19This file orchestrates the entire processing workflow for a given image blob.
20
21- **`load_blob_checks(path)`**: Loads the rule set (a `Vec<BlobCheck>`) from the `rules/blobs.json` file at startup.
22- **`download_blob(..., did, cid)`**: Fetches the image data. It implements a **CDN-first** strategy, first attempting to download the image from `cdn.bsky.app` (trying multiple formats like `jpeg`, `png`) before falling back to a direct `com.atproto.sync.getBlob` call on the PDS if the CDN fails. This improves performance and reduces PDS load.
23- **`match_phash(phash, checks, ...)`**: Compares a computed phash against all loaded `BlobCheck` rules. It iterates through each rule, calculates the hamming distance, and returns a `MatchResult` if the distance is within the rule's specified threshold.
24
25## Processing Workflow
26
27The logic follows a clear sequence, primarily orchestrated by the `queue::worker` module which calls into this processor:
28
291. **Check Cache**: The worker first checks the `cache` module for a pre-computed phash for the blob's CID.
302. **Download (if needed)**: If it's a cache miss, the worker calls `processor::matcher::download_blob` to fetch the image bytes.
313. **Compute (if needed)**: The downloaded bytes are passed to `processor::phash::compute_phash` to generate the 64-bit hash string.
324. **Store in Cache**: The newly computed hash is stored in the `cache` to avoid this work in the future.
335. **Match**: The hash is passed to `processor::matcher::match_phash`. This function iterates through all rules and uses `hamming_distance` to check for similarity.
346. **Return Result**: If a match is found that is within the allowed hamming distance threshold, a `MatchResult` struct is returned to the worker, which then triggers the `moderation` module.