Bundle File Format#

Overview#

plcbundle uses an efficient multi-frame zstd format with embedded metadata for fast random access to individual operations.

Format Structure#

┌─────────────────────────────────────────┐
│  Skippable Frame (Metadata)             │  ← Magic: 0x184D2A50
│  - Format version                       │
│  - Bundle info                          │
│  - Frame offsets (relative)             │
├─────────────────────────────────────────┤
│  Data Frame 0 (Operations 0-99)         │  ← Zstd compressed
├─────────────────────────────────────────┤
│  Data Frame 1 (Operations 100-199)      │  ← Zstd compressed
├─────────────────────────────────────────┤
│  Data Frame 2 (Operations 200-299)      │  ← Zstd compressed
│  ...                                    │
├─────────────────────────────────────────┤
│  Data Frame 99 (Operations 9900-9999)   │  ← Zstd compressed
└─────────────────────────────────────────┘

Skippable Frame#

Zstd supports "skippable frames" (magic range 0x184D2A50 - 0x184D2A5F) that are ignored during normal decompression but can be read for metadata.

Frame Structure#

[4 bytes] Magic number (0x184D2A50, little-endian)
[4 bytes] Frame size in bytes (little-endian)
[N bytes] Payload data (JSON metadata)

Metadata JSON#

{
  "format": "plcbundle-v1",
  "bundle_number": 42,
  "origin": "https://plc.directory",
  "content_hash": "sha256...",
  "parent_hash": "sha256...",
  "operation_count": 10000,
  "did_count": 5234,
  "start_time": "2024-01-01T00:00:00Z",
  "end_time": "2024-01-01T23:59:59Z",
  "created_at": "2024-01-02T00:00:00Z",
  "created_by": "plcbundle/0.9.0",
  "frame_count": 100,
  "frame_size": 100,
  "frame_offsets": [0, 12450, 24800, ...]
}

Important: frame_offsets are RELATIVE to the first data frame (after metadata).

Data Frames#

Each data frame contains exactly 100 operations (except possibly the last frame) compressed with zstd level 1.

Frame Content#

Each frame is standard JSONL (newline-delimited JSON):

{"did":"...","operation":{...},"cid":"...","nullified":false,"createdAt":"..."}
{"did":"...","operation":{...},"cid":"...","nullified":false,"createdAt":"..."}
...

Random Access Algorithm#

To access operation at position N:

Calculate frame: frame_index = N / 100
Position in frame: line_in_frame = N % 100
Get frame offset:
- Read metadata skippable frame
- Convert relative offset to absolute: absolute = metadata_frame_size + relative_offsets[frame_index]
Read frame:
- Seek to absolute offset
- Read offsets[frame_index+1] - offsets[frame_index] bytes
Decompress: Use zstd to decompress frame data
Extract line: Scan to line line_in_frame in decompressed data

Performance#

Access Pattern	Old Format	New Format	Speedup
Position 0	54µs	~10µs	5.4x
Position 5000	~2ms	~10µs	200x
Position 9999	~3.5ms	~10µs	350x

The new format provides constant-time access regardless of position.

Implementation#

Reading Operations#

use plcbundle::bundle_format;

// Extract metadata (no decompression)
let metadata = bundle_format::extract_metadata_from_file(path)?;

// Access single operation
let json = bundle_format::load_operation_at_position(
    &mut file,
    position,
    &metadata.frame_offsets,
    metadata_frame_size,
)?;

BundleManager Integration#

The BundleManager::get_operation_raw() method automatically:

Tries frame-based access (new format)
Falls back to legacy sequential scan (old format)
Transparent to caller

let manager = BundleManager::new(dir)?;

// Works with both old and new format bundles
let json = manager.get_operation_raw(bundle_num, position)?;

Format Versions#

Legacy Format (< v1)#

Single zstd frame containing all operations
No metadata
Sequential access only
Supported for backward compatibility

Current Format (v1)#

Metadata in skippable frame
Multiple data frames (100 ops each)
Frame offsets for random access
Default for new bundles

Migration#

Existing bundles in legacy format continue to work via automatic fallback. To upgrade:

// Re-save bundle with new format (future feature)
let operations = manager.load_bundle(num, LoadOptions::default())?;
manager.save_bundle_with_frames(num, operations)?;

Zstd Frame Details#

Skippable Frame Magic Numbers#

The zstd spec reserves 0x184D2A50 through 0x184D2A5F for user-defined skippable frames. We use 0x184D2A50 for metadata.

Why Multiple Frames?#

Random Access: Jump to any operation in ~10µs
Partial Decompression: Only decompress needed 100-operation chunk
Memory Efficiency: Don't need to hold entire bundle in memory
Streaming: Can process frame-by-frame

Frame Size Choice#

100 operations per frame balances:

Smaller frames = faster random access but more overhead
Larger frames = better compression but slower random access

100 operations (~50KB compressed) provides:

<10µs random access
~10ms full bundle scan (100 frame decompressions)
Reasonable compression ratio (still ~90% compression)

Compatibility#

Reading#

New code reads both old and new formats
Automatic fallback to legacy mode
No user intervention needed

Writing#

New bundles use frame format by default
Old bundles remain in legacy format until re-written
Mixed format repositories work fine

Implementation Status#

✅ Completed#

Skippable frame reading/writing
Metadata struct and serialization
Frame-based operation loading
BundleManager integration
Legacy format fallback

🚧 In Progress#

Bundle writing with frame format
Frame offset calculation
Multi-frame compression

📋 TODO#

Migration tool (legacy → framed)
Benchmark suite
Format validation tool