A rust implementation of skywatch-phash
at main 34 lines 2.8 kB view raw view rendered
1# Processor Module 2 3## Purpose 4 5This module is the core logic engine of the service. It is responsible for downloading images, computing their perceptual hashes (phashes), and matching those hashes against a list of known problematic images. 6 7## Key Components 8 9### `phash.rs` 10 11This file contains the fundamental perceptual hashing logic. 12 13- **`compute_phash(image_bytes)`**: Takes raw image bytes, decodes them, and computes a 64-bit "average hash" (aHash) using the `image_hasher` crate. It returns the hash as a 16-character hex string. 14- **`hamming_distance(hash1, hash2)`**: A highly efficient function that calculates the number of differing bits between two hashes. A distance of `0` means the hashes are identical, while a small distance (e.g., 1-5) indicates the images are visually similar. 15- **`PhashError`**: A well-defined error enum using `thiserror` for handling issues like image decoding failures or invalid hash formats. 16 17### `matcher.rs` 18 19This file orchestrates the entire processing workflow for a given image blob. 20 21- **`load_blob_checks(path)`**: Loads the rule set (a `Vec<BlobCheck>`) from the `rules/blobs.json` file at startup. 22- **`download_blob(..., did, cid)`**: Fetches the image data. It implements a **CDN-first** strategy, first attempting to download the image from `cdn.bsky.app` (trying multiple formats like `jpeg`, `png`) before falling back to a direct `com.atproto.sync.getBlob` call on the PDS if the CDN fails. This improves performance and reduces PDS load. 23- **`match_phash(phash, checks, ...)`**: Compares a computed phash against all loaded `BlobCheck` rules. It iterates through each rule, calculates the hamming distance, and returns a `MatchResult` if the distance is within the rule's specified threshold. 24 25## Processing Workflow 26 27The logic follows a clear sequence, primarily orchestrated by the `queue::worker` module which calls into this processor: 28 291. **Check Cache**: The worker first checks the `cache` module for a pre-computed phash for the blob's CID. 302. **Download (if needed)**: If it's a cache miss, the worker calls `processor::matcher::download_blob` to fetch the image bytes. 313. **Compute (if needed)**: The downloaded bytes are passed to `processor::phash::compute_phash` to generate the 64-bit hash string. 324. **Store in Cache**: The newly computed hash is stored in the `cache` to avoid this work in the future. 335. **Match**: The hash is passed to `processor::matcher::match_phash`. This function iterates through all rules and uses `hamming_distance` to check for similarity. 346. **Return Result**: If a match is found that is within the allowed hamming distance threshold, a `MatchResult` struct is returned to the worker, which then triggers the `moderation` module.