High-performance implementation of plcbundle written in Rust

Bundle File Format#

Overview#

plcbundle uses an efficient multi-frame zstd format with embedded metadata for fast random access to individual operations.

Format Structure#

┌─────────────────────────────────────────┐
│  Skippable Frame (Metadata)             │  ← Magic: 0x184D2A50
│  - Format version                       │
│  - Bundle info                          │
│  - Frame offsets (relative)             │
├─────────────────────────────────────────┤
│  Data Frame 0 (Operations 0-99)         │  ← Zstd compressed
├─────────────────────────────────────────┤
│  Data Frame 1 (Operations 100-199)      │  ← Zstd compressed
├─────────────────────────────────────────┤
│  Data Frame 2 (Operations 200-299)      │  ← Zstd compressed
│  ...                                    │
├─────────────────────────────────────────┤
│  Data Frame 99 (Operations 9900-9999)   │  ← Zstd compressed
└─────────────────────────────────────────┘

Skippable Frame#

Zstd supports "skippable frames" (magic range 0x184D2A50 - 0x184D2A5F) that are ignored during normal decompression but can be read for metadata.

Frame Structure#

[4 bytes] Magic number (0x184D2A50, little-endian)
[4 bytes] Frame size in bytes (little-endian)
[N bytes] Payload data (JSON metadata)

Metadata JSON#

{
  "format": "plcbundle-v1",
  "bundle_number": 42,
  "origin": "https://plc.directory",
  "content_hash": "sha256...",
  "parent_hash": "sha256...",
  "operation_count": 10000,
  "did_count": 5234,
  "start_time": "2024-01-01T00:00:00Z",
  "end_time": "2024-01-01T23:59:59Z",
  "created_at": "2024-01-02T00:00:00Z",
  "created_by": "plcbundle/0.9.0",
  "frame_count": 100,
  "frame_size": 100,
  "frame_offsets": [0, 12450, 24800, ...]
}

Important: frame_offsets are RELATIVE to the first data frame (after metadata).

Data Frames#

Each data frame contains exactly 100 operations (except possibly the last frame) compressed with zstd level 1.

Frame Content#

Each frame is standard JSONL (newline-delimited JSON):

{"did":"...","operation":{...},"cid":"...","nullified":false,"createdAt":"..."}
{"did":"...","operation":{...},"cid":"...","nullified":false,"createdAt":"..."}
...

Random Access Algorithm#

To access operation at position N:

  1. Calculate frame: frame_index = N / 100
  2. Position in frame: line_in_frame = N % 100
  3. Get frame offset:
    • Read metadata skippable frame
    • Convert relative offset to absolute: absolute = metadata_frame_size + relative_offsets[frame_index]
  4. Read frame:
    • Seek to absolute offset
    • Read offsets[frame_index+1] - offsets[frame_index] bytes
  5. Decompress: Use zstd to decompress frame data
  6. Extract line: Scan to line line_in_frame in decompressed data

Performance#

Access Pattern Old Format New Format Speedup
Position 0 54µs ~10µs 5.4x
Position 5000 ~2ms ~10µs 200x
Position 9999 ~3.5ms ~10µs 350x

The new format provides constant-time access regardless of position.

Implementation#

Reading Operations#

use plcbundle::bundle_format;

// Extract metadata (no decompression)
let metadata = bundle_format::extract_metadata_from_file(path)?;

// Access single operation
let json = bundle_format::load_operation_at_position(
    &mut file,
    position,
    &metadata.frame_offsets,
    metadata_frame_size,
)?;

BundleManager Integration#

The BundleManager::get_operation_raw() method automatically:

  1. Tries frame-based access (new format)
  2. Falls back to legacy sequential scan (old format)
  3. Transparent to caller
let manager = BundleManager::new(dir)?;

// Works with both old and new format bundles
let json = manager.get_operation_raw(bundle_num, position)?;

Format Versions#

Legacy Format (< v1)#

  • Single zstd frame containing all operations
  • No metadata
  • Sequential access only
  • Supported for backward compatibility

Current Format (v1)#

  • Metadata in skippable frame
  • Multiple data frames (100 ops each)
  • Frame offsets for random access
  • Default for new bundles

Migration#

Existing bundles in legacy format continue to work via automatic fallback. To upgrade:

// Re-save bundle with new format (future feature)
let operations = manager.load_bundle(num, LoadOptions::default())?;
manager.save_bundle_with_frames(num, operations)?;

Zstd Frame Details#

Skippable Frame Magic Numbers#

The zstd spec reserves 0x184D2A50 through 0x184D2A5F for user-defined skippable frames. We use 0x184D2A50 for metadata.

Why Multiple Frames?#

  1. Random Access: Jump to any operation in ~10µs
  2. Partial Decompression: Only decompress needed 100-operation chunk
  3. Memory Efficiency: Don't need to hold entire bundle in memory
  4. Streaming: Can process frame-by-frame

Frame Size Choice#

100 operations per frame balances:

  • Smaller frames = faster random access but more overhead
  • Larger frames = better compression but slower random access

100 operations (~50KB compressed) provides:

  • <10µs random access
  • ~10ms full bundle scan (100 frame decompressions)
  • Reasonable compression ratio (still ~90% compression)

Compatibility#

Reading#

  • New code reads both old and new formats
  • Automatic fallback to legacy mode
  • No user intervention needed

Writing#

  • New bundles use frame format by default
  • Old bundles remain in legacy format until re-written
  • Mixed format repositories work fine

Implementation Status#

✅ Completed#

  • Skippable frame reading/writing
  • Metadata struct and serialization
  • Frame-based operation loading
  • BundleManager integration
  • Legacy format fallback

🚧 In Progress#

  • Bundle writing with frame format
  • Frame offset calculation
  • Multi-frame compression

📋 TODO#

  • Migration tool (legacy → framed)
  • Benchmark suite
  • Format validation tool

See Also#