commits
- COLLECTION_BUNDLE now writes to ing.dasl.masl instead of
systems.witchcraft.archive.bundle
- record format: name + resources at top level (MASL required),
archive metadata under systems.witchcraft.archive namespace
- blob refs inlined into resources[path].src (tiles spec compliant)
- simplified list/search/verify to use single format
- added migrate_bundles.py for migrating old bundle records
- fixed site_archive.py auth (pass env vars to subprocess)
BFS crawls a site, archives each page as a bundle via web_archive.py,
creates a site manifest record (systems.witchcraft.archive.site) linking
all pages together. internal links rewritten to sibling captures.
features:
- configurable depth and max pages
- dry-run mode to preview crawl
- polite crawl delay between requests
- skips binary files (images, fonts, etc)
- site manifest with page list and link map
- new collection: systems.witchcraft.archive.bundle (was ing.dasl.masl)
- MASL data in .masl field with CID strings in src (spec-conformant)
- ATProto blob refs in separate .blobs map for content retrieval
- archive metadata (url, title, capturedAt) at record top level
- backwards-compatible: list/search/verify handle legacy records
- thanks to nel.pet for the spec feedback!
- COLLECTION_BUNDLE now writes to ing.dasl.masl instead of
systems.witchcraft.archive.bundle
- record format: name + resources at top level (MASL required),
archive metadata under systems.witchcraft.archive namespace
- blob refs inlined into resources[path].src (tiles spec compliant)
- simplified list/search/verify to use single format
- added migrate_bundles.py for migrating old bundle records
- fixed site_archive.py auth (pass env vars to subprocess)
BFS crawls a site, archives each page as a bundle via web_archive.py,
creates a site manifest record (systems.witchcraft.archive.site) linking
all pages together. internal links rewritten to sibling captures.
features:
- configurable depth and max pages
- dry-run mode to preview crawl
- polite crawl delay between requests
- skips binary files (images, fonts, etc)
- site manifest with page list and link map
- new collection: systems.witchcraft.archive.bundle (was ing.dasl.masl)
- MASL data in .masl field with CID strings in src (spec-conformant)
- ATProto blob refs in separate .blobs map for content retrieval
- archive metadata (url, title, capturedAt) at record top level
- backwards-compatible: list/search/verify handle legacy records
- thanks to nel.pet for the spec feedback!