A rust implementation of skywatch-phash
Processor Module#
Purpose#
This module is the core logic engine of the service. It is responsible for downloading images, computing their perceptual hashes (phashes), and matching those hashes against a list of known problematic images.
Key Components#
phash.rs#
This file contains the fundamental perceptual hashing logic.
compute_phash(image_bytes): Takes raw image bytes, decodes them, and computes a 64-bit "average hash" (aHash) using theimage_hashercrate. It returns the hash as a 16-character hex string.hamming_distance(hash1, hash2): A highly efficient function that calculates the number of differing bits between two hashes. A distance of0means the hashes are identical, while a small distance (e.g., 1-5) indicates the images are visually similar.PhashError: A well-defined error enum usingthiserrorfor handling issues like image decoding failures or invalid hash formats.
matcher.rs#
This file orchestrates the entire processing workflow for a given image blob.
load_blob_checks(path): Loads the rule set (aVec<BlobCheck>) from therules/blobs.jsonfile at startup.download_blob(..., did, cid): Fetches the image data. It implements a CDN-first strategy, first attempting to download the image fromcdn.bsky.app(trying multiple formats likejpeg,png) before falling back to a directcom.atproto.sync.getBlobcall on the PDS if the CDN fails. This improves performance and reduces PDS load.match_phash(phash, checks, ...): Compares a computed phash against all loadedBlobCheckrules. It iterates through each rule, calculates the hamming distance, and returns aMatchResultif the distance is within the rule's specified threshold.
Processing Workflow#
The logic follows a clear sequence, primarily orchestrated by the queue::worker module which calls into this processor:
- Check Cache: The worker first checks the
cachemodule for a pre-computed phash for the blob's CID. - Download (if needed): If it's a cache miss, the worker calls
processor::matcher::download_blobto fetch the image bytes. - Compute (if needed): The downloaded bytes are passed to
processor::phash::compute_phashto generate the 64-bit hash string. - Store in Cache: The newly computed hash is stored in the
cacheto avoid this work in the future. - Match: The hash is passed to
processor::matcher::match_phash. This function iterates through all rules and useshamming_distanceto check for similarity. - Return Result: If a match is found that is within the allowed hamming distance threshold, a
MatchResultstruct is returned to the worker, which then triggers themoderationmodule.