A rust implementation of skywatch-phash

Processor Module#

Purpose#

This module is the core logic engine of the service. It is responsible for downloading images, computing their perceptual hashes (phashes), and matching those hashes against a list of known problematic images.

Key Components#

phash.rs#

This file contains the fundamental perceptual hashing logic.

  • compute_phash(image_bytes): Takes raw image bytes, decodes them, and computes a 64-bit "average hash" (aHash) using the image_hasher crate. It returns the hash as a 16-character hex string.
  • hamming_distance(hash1, hash2): A highly efficient function that calculates the number of differing bits between two hashes. A distance of 0 means the hashes are identical, while a small distance (e.g., 1-5) indicates the images are visually similar.
  • PhashError: A well-defined error enum using thiserror for handling issues like image decoding failures or invalid hash formats.

matcher.rs#

This file orchestrates the entire processing workflow for a given image blob.

  • load_blob_checks(path): Loads the rule set (a Vec<BlobCheck>) from the rules/blobs.json file at startup.
  • download_blob(..., did, cid): Fetches the image data. It implements a CDN-first strategy, first attempting to download the image from cdn.bsky.app (trying multiple formats like jpeg, png) before falling back to a direct com.atproto.sync.getBlob call on the PDS if the CDN fails. This improves performance and reduces PDS load.
  • match_phash(phash, checks, ...): Compares a computed phash against all loaded BlobCheck rules. It iterates through each rule, calculates the hamming distance, and returns a MatchResult if the distance is within the rule's specified threshold.

Processing Workflow#

The logic follows a clear sequence, primarily orchestrated by the queue::worker module which calls into this processor:

  1. Check Cache: The worker first checks the cache module for a pre-computed phash for the blob's CID.
  2. Download (if needed): If it's a cache miss, the worker calls processor::matcher::download_blob to fetch the image bytes.
  3. Compute (if needed): The downloaded bytes are passed to processor::phash::compute_phash to generate the 64-bit hash string.
  4. Store in Cache: The newly computed hash is stored in the cache to avoid this work in the future.
  5. Match: The hash is passed to processor::matcher::match_phash. This function iterates through all rules and uses hamming_distance to check for similarity.
  6. Return Result: If a match is found that is within the allowed hamming distance threshold, a MatchResult struct is returned to the worker, which then triggers the moderation module.