Bundle File Format#
Overview#
plcbundle uses an efficient multi-frame zstd format with embedded metadata for fast random access to individual operations.
Format Structure#
┌─────────────────────────────────────────┐
│ Skippable Frame (Metadata) │ ← Magic: 0x184D2A50
│ - Format version │
│ - Bundle info │
│ - Frame offsets (relative) │
├─────────────────────────────────────────┤
│ Data Frame 0 (Operations 0-99) │ ← Zstd compressed
├─────────────────────────────────────────┤
│ Data Frame 1 (Operations 100-199) │ ← Zstd compressed
├─────────────────────────────────────────┤
│ Data Frame 2 (Operations 200-299) │ ← Zstd compressed
│ ... │
├─────────────────────────────────────────┤
│ Data Frame 99 (Operations 9900-9999) │ ← Zstd compressed
└─────────────────────────────────────────┘
Skippable Frame#
Zstd supports "skippable frames" (magic range 0x184D2A50 - 0x184D2A5F) that are ignored during normal decompression but can be read for metadata.
Frame Structure#
[4 bytes] Magic number (0x184D2A50, little-endian)
[4 bytes] Frame size in bytes (little-endian)
[N bytes] Payload data (JSON metadata)
Metadata JSON#
{
"format": "plcbundle-v1",
"bundle_number": 42,
"origin": "https://plc.directory",
"content_hash": "sha256...",
"parent_hash": "sha256...",
"operation_count": 10000,
"did_count": 5234,
"start_time": "2024-01-01T00:00:00Z",
"end_time": "2024-01-01T23:59:59Z",
"created_at": "2024-01-02T00:00:00Z",
"created_by": "plcbundle/0.9.0",
"frame_count": 100,
"frame_size": 100,
"frame_offsets": [0, 12450, 24800, ...]
}
Important: frame_offsets are RELATIVE to the first data frame (after metadata).
Data Frames#
Each data frame contains exactly 100 operations (except possibly the last frame) compressed with zstd level 1.
Frame Content#
Each frame is standard JSONL (newline-delimited JSON):
{"did":"...","operation":{...},"cid":"...","nullified":false,"createdAt":"..."}
{"did":"...","operation":{...},"cid":"...","nullified":false,"createdAt":"..."}
...
Random Access Algorithm#
To access operation at position N:
- Calculate frame:
frame_index = N / 100 - Position in frame:
line_in_frame = N % 100 - Get frame offset:
- Read metadata skippable frame
- Convert relative offset to absolute:
absolute = metadata_frame_size + relative_offsets[frame_index]
- Read frame:
- Seek to
absoluteoffset - Read
offsets[frame_index+1] - offsets[frame_index]bytes
- Seek to
- Decompress: Use zstd to decompress frame data
- Extract line: Scan to line
line_in_framein decompressed data
Performance#
| Access Pattern | Old Format | New Format | Speedup |
|---|---|---|---|
| Position 0 | 54µs | ~10µs | 5.4x |
| Position 5000 | ~2ms | ~10µs | 200x |
| Position 9999 | ~3.5ms | ~10µs | 350x |
The new format provides constant-time access regardless of position.
Implementation#
Reading Operations#
use plcbundle::bundle_format;
// Extract metadata (no decompression)
let metadata = bundle_format::extract_metadata_from_file(path)?;
// Access single operation
let json = bundle_format::load_operation_at_position(
&mut file,
position,
&metadata.frame_offsets,
metadata_frame_size,
)?;
BundleManager Integration#
The BundleManager::get_operation_raw() method automatically:
- Tries frame-based access (new format)
- Falls back to legacy sequential scan (old format)
- Transparent to caller
let manager = BundleManager::new(dir)?;
// Works with both old and new format bundles
let json = manager.get_operation_raw(bundle_num, position)?;
Format Versions#
Legacy Format (< v1)#
- Single zstd frame containing all operations
- No metadata
- Sequential access only
- Supported for backward compatibility
Current Format (v1)#
- Metadata in skippable frame
- Multiple data frames (100 ops each)
- Frame offsets for random access
- Default for new bundles
Migration#
Existing bundles in legacy format continue to work via automatic fallback. To upgrade:
// Re-save bundle with new format (future feature)
let operations = manager.load_bundle(num, LoadOptions::default())?;
manager.save_bundle_with_frames(num, operations)?;
Zstd Frame Details#
Skippable Frame Magic Numbers#
The zstd spec reserves 0x184D2A50 through 0x184D2A5F for user-defined skippable frames.
We use 0x184D2A50 for metadata.
Why Multiple Frames?#
- Random Access: Jump to any operation in ~10µs
- Partial Decompression: Only decompress needed 100-operation chunk
- Memory Efficiency: Don't need to hold entire bundle in memory
- Streaming: Can process frame-by-frame
Frame Size Choice#
100 operations per frame balances:
- Smaller frames = faster random access but more overhead
- Larger frames = better compression but slower random access
100 operations (~50KB compressed) provides:
- <10µs random access
- ~10ms full bundle scan (100 frame decompressions)
- Reasonable compression ratio (still ~90% compression)
Compatibility#
Reading#
- New code reads both old and new formats
- Automatic fallback to legacy mode
- No user intervention needed
Writing#
- New bundles use frame format by default
- Old bundles remain in legacy format until re-written
- Mixed format repositories work fine
Implementation Status#
✅ Completed#
- Skippable frame reading/writing
- Metadata struct and serialization
- Frame-based operation loading
- BundleManager integration
- Legacy format fallback
🚧 In Progress#
- Bundle writing with frame format
- Frame offset calculation
- Multi-frame compression
📋 TODO#
- Migration tool (legacy → framed)
- Benchmark suite
- Format validation tool
See Also#
- Zstd Skippable Frames Spec
- Go implementation:
/Users/tree/Projects/atscan/plcbundle/internal/storage/