Abstract#
This proposal presents a comprehensive plan to evolve Tangled into a sovereign, unified AI development platform by combining:
- GitHub-style code collaboration
- Hugging Face ecosystem compatibility
- Petabyte-scale model and dataset hosting
- AI inference and training capabilities
- Oxen-based model and dataset version control
- PB-scale storage with Xet-style deduplication on S3/MinIO
The platform is fully open-source, relying 100% on Hugging Face components and Oxen for efficient versioning, ensuring interoperability and developer sovereignty.
Motivation#
Modern AI development is fragmented:
| Layer | Platform |
|---|---|
| Source code | GitHub |
| Models & datasets | Hugging Face Hub |
| Inference & demos | Cloud vendors |
Problems:
- Ecosystem fragmentation and silos
- Platform lock-in
- Inefficient large model storage
- Lack of self-hosted, sovereign infrastructure
Tangled’s decentralized architecture and Knot nodes are ideal to unify these layers while maintaining:
- HF SDK compatibility
- Git-native workflow
- Self-hosted, secure infrastructure
Design Principles#
- Strict Upstream Dependency: All AI workflows rely on Hugging Face open-source components:
- Transformers
- Diffusers
- Datasets
- huggingface_hub
- SafeTensors
- Tokenizers
- Text Generation Inference
- Accelerate
- PEFT
- TRL
- Gradio
- Evaluate
-
HF API Compatibility: Implement a Hub-compatible REST API layer to allow HF SDKs to interact with Tangled nodes without modifications.
-
Infrastructure Separation:
| Layer | Responsibility |
|---|---|
| Tangled | Git repos, identity, storage |
| HF Components | AI models, datasets, inference, evaluation |
| Runtime | Inference & application deployment |
Core Platform Architecture#
Developer
│
HF SDK / Git Clients
│
HF Compatibility Layer (REST API)
│
Knot Node
┌─────────────┬───────────────┬──────────────┐
│ │ │ │
Model Hub Dataset Hub Code Repos AI Runtime
│ │ │ │
Transformers Datasets Git TGI / Candle
│ │ │ │
SafeTensors Apache Arrow CI/CD Gradio Apps
Unified Repository Model#
| Repo Type | Content |
|---|---|
| Code Repo | Source code, scripts, workflows |
| Model Repo | HF models (.safetensors), config, tokenizer |
| Dataset Repo | HF datasets (.arrow / streaming) |
Example repository structure:
repo/
├── model/
│ ├── config.json
│ ├── tokenizer.json
│ └── model.safetensors
│
├── dataset/
│ └── data.arrow
│
├── code/
│ └── training scripts
│
└── README.md
Model Version Control (Oxen-based)#
AI models require advanced version control for large binaries and delta tracking.
-
Integrate Oxen
-
Oxen enables:
- Binary delta tracking
- Dataset diff
- Efficient cloning
- Fine-tuning version management (LoRA / QLoRA)
Example workflow:
Base model
│
commit v1
│
Fine-tuned model
│
commit v2
Only modified chunks are stored, drastically reducing storage footprint.
PB-Scale Model Storage Architecture#
Content-Addressable Storage (CAS)#
- All models and datasets are chunked and stored using hash-based identifiers
- Deduplication is applied to shared layers and dataset blocks
model file
│
chunked
│
hash storage
Xet-style Deduplication#
- Use Content Defined Chunking (CDC)
- Shared model layers across variants are stored once
- LoRA or quantized variants reuse base layers
Base LLM
120GB
Fine-tuned variant
+2GB delta
Object Storage Backend#
- S3-compatible storage (MinIO)
- Horizontal scaling, high throughput
- Distributed replication
Storage Layout:
Object Storage (S3 / MinIO)
models/
chunk_hash_1
chunk_hash_2
datasets/
arrow_blocks
metadata/
repo_index
Knot Node AI Extension Architecture#
Each Knot node supports modular AI extensions:
- Model Service: stores, versions, and serves HF models
- Dataset Service: streams and versions datasets
- Inference Service: LLM and diffusion inference (TGI / Candle)
- Application Service: hosts Gradio/Streamlit demos
- Training Service: orchestrates distributed training via Accelerate / PEFT / TRL
Node Architecture:
Developer
│
HF SDK / Git
│
HF Compatibility API
│
Knot Node
┌─────────────┬───────────────┬──────────────┐
│ │ │ │
Model Hub Dataset Hub Code Repo AI Runtime
│ │ │ │
Transformers Datasets Git TGI / Candle
│ │ │ │
SafeTensors Apache Arrow CI/CD Gradio Apps
Phased Implementation Roadmap#
Phase 1 — Hugging Face Compatibility#
- HF Hub API layer
- Model repositories
- Dataset repositories
- Basic inference
Phase 2 — AI Development Platform#
- Distributed inference (TGI / Candle)
- AI application hosting (Gradio)
- Training pipelines (Accelerate / PEFT / TRL)
- Oxen-based model & dataset versioning
Phase 3 — Global AI Infrastructure#
- Federated Knot nodes
- Global dataset distribution
- Petabyte-scale storage with Xet deduplication
- Enterprise-scale deployment
Expected Impact#
Tangled becomes a sovereign, fully integrated AI platform:
- GitHub-style collaboration
- HF ecosystem compatibility
- PB-scale model storage with deduplication
- Oxen-based advanced version control
- AI inference, demo hosting, and distributed training
- Federation and global scaling
Conclusion#
This PR establishes Tangled as a next-generation AI infrastructure platform, bridging the best of:
- GitHub (developer workflow)
- Hugging Face Hub (AI models, datasets, training, inference)
All while maintaining open-source fidelity, sovereignty, and scalable storage/compute infrastructure.
S-tier slop. Thanks.