RFC: Transform Tangled into a Unified AI Infrastructure Platform (GitHub + Hugging Face Ecosystem) #440

Abstract#

This proposal presents a comprehensive plan to evolve Tangled into a sovereign, unified AI development platform by combining:

GitHub-style code collaboration
Hugging Face ecosystem compatibility
Petabyte-scale model and dataset hosting
AI inference and training capabilities
Oxen-based model and dataset version control
PB-scale storage with Xet-style deduplication on S3/MinIO

The platform is fully open-source, relying 100% on Hugging Face components and Oxen for efficient versioning, ensuring interoperability and developer sovereignty.

Motivation#

Modern AI development is fragmented:

Layer	Platform
Source code	GitHub
Models & datasets	Hugging Face Hub
Inference & demos	Cloud vendors

Problems:

Ecosystem fragmentation and silos
Platform lock-in
Inefficient large model storage
Lack of self-hosted, sovereign infrastructure

Tangled’s decentralized architecture and Knot nodes are ideal to unify these layers while maintaining:

HF SDK compatibility
Git-native workflow
Self-hosted, secure infrastructure

Design Principles#

Strict Upstream Dependency: All AI workflows rely on Hugging Face open-source components:

Transformers
Diffusers
Datasets
huggingface_hub
SafeTensors
Tokenizers
Text Generation Inference
Accelerate
PEFT
TRL
Gradio
Evaluate

HF API Compatibility: Implement a Hub-compatible REST API layer to allow HF SDKs to interact with Tangled nodes without modifications.
Infrastructure Separation:

Layer	Responsibility
Tangled	Git repos, identity, storage
HF Components	AI models, datasets, inference, evaluation
Runtime	Inference & application deployment

Core Platform Architecture#

Developer
   │
HF SDK / Git Clients
   │
HF Compatibility Layer (REST API)
   │
Knot Node
 ┌─────────────┬───────────────┬──────────────┐
 │             │               │              │
Model Hub   Dataset Hub    Code Repos      AI Runtime
 │             │               │              │
Transformers  Datasets        Git           TGI / Candle
 │             │               │              │
SafeTensors   Apache Arrow   CI/CD        Gradio Apps

Unified Repository Model#

Repo Type	Content
Code Repo	Source code, scripts, workflows
Model Repo	HF models (`.safetensors`), config, tokenizer
Dataset Repo	HF datasets (`.arrow` / streaming)

Example repository structure:

repo/
 ├── model/
 │   ├── config.json
 │   ├── tokenizer.json
 │   └── model.safetensors
 │
 ├── dataset/
 │   └── data.arrow
 │
 ├── code/
 │   └── training scripts
 │
 └── README.md

Model Version Control (Oxen-based)#

AI models require advanced version control for large binaries and delta tracking.

Integrate Oxen
Oxen enables:
- Binary delta tracking
- Dataset diff
- Efficient cloning
- Fine-tuning version management (LoRA / QLoRA)

Example workflow:

Base model
   │
commit v1
   │
Fine-tuned model
   │
commit v2

Only modified chunks are stored, drastically reducing storage footprint.

PB-Scale Model Storage Architecture#

Content-Addressable Storage (CAS)#

All models and datasets are chunked and stored using hash-based identifiers
Deduplication is applied to shared layers and dataset blocks

model file
   │
chunked
   │
hash storage

Xet-style Deduplication#

Use Content Defined Chunking (CDC)
Shared model layers across variants are stored once
LoRA or quantized variants reuse base layers

Base LLM
 120GB

Fine-tuned variant
 +2GB delta

Object Storage Backend#

S3-compatible storage (MinIO)
Horizontal scaling, high throughput
Distributed replication

Storage Layout:

Object Storage (S3 / MinIO)

models/
   chunk_hash_1
   chunk_hash_2

datasets/
   arrow_blocks

metadata/
   repo_index

Knot Node AI Extension Architecture#

Each Knot node supports modular AI extensions:

Model Service: stores, versions, and serves HF models
Dataset Service: streams and versions datasets
Inference Service: LLM and diffusion inference (TGI / Candle)
Application Service: hosts Gradio/Streamlit demos
Training Service: orchestrates distributed training via Accelerate / PEFT / TRL

Node Architecture:

Developer
   │
HF SDK / Git
   │
HF Compatibility API
   │
Knot Node
 ┌─────────────┬───────────────┬──────────────┐
 │             │               │              │
Model Hub   Dataset Hub    Code Repo      AI Runtime
 │             │               │              │
Transformers  Datasets        Git           TGI / Candle
 │             │               │              │
SafeTensors   Apache Arrow   CI/CD        Gradio Apps

Phased Implementation Roadmap#

Phase 1 — Hugging Face Compatibility#

HF Hub API layer
Model repositories
Dataset repositories
Basic inference

Phase 2 — AI Development Platform#

Distributed inference (TGI / Candle)
AI application hosting (Gradio)
Training pipelines (Accelerate / PEFT / TRL)
Oxen-based model & dataset versioning

Phase 3 — Global AI Infrastructure#

Federated Knot nodes
Global dataset distribution
Petabyte-scale storage with Xet deduplication
Enterprise-scale deployment

Expected Impact#

Tangled becomes a sovereign, fully integrated AI platform:

GitHub-style collaboration
HF ecosystem compatibility
PB-scale model storage with deduplication
Oxen-based advanced version control
AI inference, demo hosting, and distributed training
Federation and global scaling

Conclusion#

This PR establishes Tangled as a next-generation AI infrastructure platform, bridging the best of:

GitHub (developer workflow)
Hugging Face Hub (AI models, datasets, training, inference)

All while maintaining open-source fidelity, sovereignty, and scalable storage/compute infrastructure.