A better Rust ATProto crate
1# Lexicon Codegen Plan 2 3## Goal 4Generate idiomatic Rust types from AT Protocol lexicon schemas with minimal nesting/indirection. 5 6## Existing Infrastructure 7 8### Already Implemented 9- **lexicon.rs**: Complete lexicon parsing types (`LexiconDoc`, `LexUserType`, `LexObject`, etc) 10- **fs.rs**: Directory walking for finding `.json` lexicon files 11- **schema.rs**: `find_ref_unions()` - collects union fields from a single lexicon 12- **output.rs**: Partial - has string type mapping and doc comment generation 13 14### Attribute Macros 15- `#[lexicon]` - adds `extra_data` field to structs 16- `#[open_union]` - adds `Unknown(Data<'s>)` variant to enums 17 18## Design Decisions 19 20### Module/File Structure 21- NSID `app.bsky.feed.post``app_bsky/feed/post.rs` 22- Flat module names (no `app::bsky`, just `app_bsky`) 23- Parent modules: `app_bsky/feed.rs` with `pub mod post;` 24 25### Type Naming 26- **Main def**: Use last segment of NSID 27 - `app.bsky.feed.post#main``Post` 28- **Other defs**: Pascal-case the def name 29 - `replyRef``ReplyRef` 30- **Union variants**: Use last segment of ref NSID 31 - `app.bsky.embed.images``Images` 32 - Collisions resolved by module path, not type name 33- **No proliferation of `Main` types** like atrium has 34 35### Type Generation 36 37#### Records (lexRecord) 38```rust 39#[lexicon] 40#[derive(Serialize, Deserialize, Debug, Clone, PartialEq, Eq)] 41#[serde(rename_all = "camelCase")] 42pub struct Post<'s> { 43 /// Client-declared timestamp... 44 pub created_at: Datetime, 45 #[serde(skip_serializing_if = "Option::is_none")] 46 pub embed: Option<RecordEmbed<'s>>, 47 pub text: CowStr<'s>, 48} 49``` 50 51#### Objects (lexObject) 52Same as records but without `#[lexicon]` if inline/not a top-level def. 53 54#### Unions (lexRefUnion) 55```rust 56#[open_union] 57#[derive(Serialize, Deserialize, Debug, Clone, PartialEq, Eq)] 58#[serde(tag = "$type")] 59pub enum RecordEmbed<'s> { 60 #[serde(rename = "app.bsky.embed.images")] 61 Images(Box<jacquard_api::app_bsky::embed::Images<'s>>), 62 #[serde(rename = "app.bsky.embed.video")] 63 Video(Box<jacquard_api::app_bsky::embed::Video<'s>>), 64} 65``` 66 67- Use `Box<T>` for all variants (handles circular refs) 68- `#[open_union]` adds `Unknown(Data<'s>)` catch-all 69 70#### Queries (lexXrpcQuery) 71```rust 72pub struct GetAuthorFeedParams<'s> { 73 pub actor: AtIdentifier<'s>, 74 pub limit: Option<i64>, 75 pub cursor: Option<CowStr<'s>>, 76} 77 78pub struct GetAuthorFeedOutput<'s> { 79 pub cursor: Option<CowStr<'s>>, 80 pub feed: Vec<FeedViewPost<'s>>, 81} 82``` 83 84- Flat params/output structs 85- No nesting like `Input { params: {...} }` 86 87#### Procedures (lexXrpcProcedure) 88Same as queries but with both `Input` and `Output` structs. 89 90### Field Handling 91 92#### Optional Fields 93- Fields not in `required: []``Option<T>` 94- Add `#[serde(skip_serializing_if = "Option::is_none")]` 95 96#### Lifetimes 97- All types have `'a` lifetime for borrowing from input 98- `#[serde(borrow)]` where needed for zero-copy 99 100#### Type Mapping 101- `LexString` with format → specific types (`Datetime`, `Did`, etc) 102- `LexString` without format → `CowStr<'a>` 103- `LexInteger``i64` 104- `LexBoolean``bool` 105- `LexBytes``Bytes` 106- `LexCidLink``CidLink<'a>` 107- `LexBlob``Blob<'a>` 108- `LexRef` → resolve to actual type path 109- `LexRefUnion` → generate enum 110- `LexArray``Vec<T>` 111- `LexUnknown``Data<'a>` 112 113### Reference Resolution 114 115#### Known Refs 116- Check corpus for ref existence 117- `#ref: "app.bsky.embed.images"``jacquard_api::app_bsky::embed::Images<'a>` 118- Handle fragments: `#ref: "com.example.foo#bar"``jacquard_api::com_example::foo::Bar<'a>` 119 120#### Unknown Refs 121- **In struct fields**: use `Data<'a>` as fallback type 122- **In union variants**: handled by `Unknown(Data<'a>)` variant from `#[open_union]` 123- Optional: log warnings for missing refs 124 125## Implementation Phases 126 127### Phase 1: Corpus Loading & Registry 128**Goal**: Load all lexicons into memory for ref resolution 129 130**Tasks**: 1311. Create `LexiconCorpus` struct 132 - `HashMap<SmolStr, LexiconDoc<'static>>` - NSID → doc 133 - Methods: `load_from_dir()`, `get()`, `resolve_ref()` 1342. Load all `.json` files from lexicon directory 1353. Parse into `LexiconDoc` and insert into registry 1364. Handle fragments in refs (`nsid#def`) 137 138**Output**: Corpus registry that can resolve any ref 139 140### Phase 2: Ref Analysis & Union Collection 141**Goal**: Build complete picture of what refs exist and what unions need 142 143**Tasks**: 1441. Extend `find_ref_unions()` to work across entire corpus 1452. For each union, collect all refs and check existence 1463. Build `UnionRegistry`: 147 - Union name → list of (known refs, unknown refs) 1484. Detect circular refs (optional - or just Box everything) 149 150**Output**: Complete list of unions to generate with their variants 151 152### Phase 3: Code Generation - Core Types 153**Goal**: Generate Rust code for individual types 154 155**Tasks**: 1561. Implement type generators: 157 - `generate_struct()` for records/objects 158 - `generate_enum()` for unions 159 - `generate_field()` for object properties 160 - `generate_type()` for primitives/refs 1612. Handle optional fields (`required` list) 1623. Add doc comments from `description` 1634. Apply `#[lexicon]` / `#[open_union]` macros 1645. Add serde attributes 165 166**Output**: `TokenStream` for each type 167 168### Phase 4: Module Organization 169**Goal**: Organize generated types into module hierarchy 170 171**Tasks**: 1721. Parse NSID into components: `["app", "bsky", "feed", "post"]` 1732. Determine file paths: `app_bsky/feed/post.rs` 1743. Generate module files: `app_bsky/feed.rs` with `pub mod post;` 1754. Generate root module: `app_bsky.rs` 1765. Handle re-exports if needed 177 178**Output**: File path → generated code mapping 179 180### Phase 5: File Writing 181**Goal**: Write generated code to filesystem 182 183**Tasks**: 1841. Format code with `prettyplease` 1852. Create directory structure 1863. Write module files 1874. Write type files 1885. Optional: run `rustfmt` 189 190**Output**: Generated code on disk 191 192### Phase 6: Testing & Validation 193**Goal**: Ensure generated code compiles and works 194 195**Tasks**: 1961. Generate code for test lexicons 1972. Compile generated code 1983. Test serialization/deserialization 1994. Test union variant matching 2005. Test extra_data capture 201 202## Edge Cases & Considerations 203 204### Circular References 205- **Simple approach**: Union variants always use `Box<T>` → handles all circular refs 206- **Alternative**: DFS cycle detection to only Box when needed 207 - Track visited refs and recursion stack 208 - If ref appears in rec_stack → cycle detected 209 - Algorithm: 210 ```rust 211 fn has_cycle(corpus, start_ref, visited, rec_stack) -> bool { 212 visited.insert(start_ref); 213 rec_stack.insert(start_ref); 214 215 for child_ref in collect_refs_from_def(resolve(start_ref)) { 216 if !visited.contains(child_ref) { 217 if has_cycle(corpus, child_ref, visited, rec_stack) { 218 return true; 219 } 220 } else if rec_stack.contains(child_ref) { 221 return true; // back edge = cycle 222 } 223 } 224 225 rec_stack.remove(start_ref); 226 false 227 } 228 ``` 229 - Only box variants that participate in cycles 230- **Recommendation**: Start with simple (always Box), optimize later if needed 231 232### Name Collisions 233- Multiple types with same name in different lexicons 234- Module path disambiguates: `app_bsky::feed::Post` vs `com_example::feed::Post` 235 236### Unknown Refs 237- Fallback to `Data<'s>` in struct fields 238- Caught by `Unknown` variant in unions 239- Warn during generation 240 241### Inline Defs 242- Nested objects/unions in same lexicon 243- Generate as separate types in same file 244- Keep names scoped to parent (e.g., `PostReplyRef`) 245 246### Arrays 247- `Vec<T>` for arrays 248- Handle nested unions in arrays 249 250### Tokens 251- Simple marker types 252- Generate as unit structs or type aliases? 253 254## Traits for Generated Types 255 256### Collection Trait (Records) 257Records implement the existing `Collection` trait from jacquard-common: 258 259```rust 260pub struct Post<'a> { 261 // ... fields 262} 263 264impl Collection for Post<'_> { 265 const NSID: &'static str = "app.bsky.feed.post"; 266 type Record = Post<'static>; 267} 268``` 269 270### XrpcRequest Trait (Queries/Procedures) 271New trait for XRPC endpoints: 272 273```rust 274pub trait XrpcRequest<'x> { 275 /// The NSID for this XRPC method 276 const NSID: &'static str; 277 278 /// HTTP method (GET for queries, POST for procedures) 279 const METHOD: XrpcMethod; 280 281 /// Input encoding (MIME type, e.g., "application/json") 282 /// None for queries (no body) 283 const INPUT_ENCODING: Option<&'static str>; 284 285 /// Output encoding (MIME type) 286 const OUTPUT_ENCODING: &'static str; 287 288 /// Request parameters type (query params or body) 289 type Params: Serialize; 290 291 /// Response output type 292 type Output: Deserialize<'x>; 293} 294 295pub enum XrpcMethod { 296 Query, // GET 297 Procedure, // POST 298} 299``` 300 301**Generated implementation:** 302```rust 303pub struct GetAuthorFeedParams<'a> { 304 pub actor: AtIdentifier<'a>, 305 pub limit: Option<i64>, 306 pub cursor: Option<CowStr<'a>>, 307} 308 309pub struct GetAuthorFeedOutput<'a> { 310 pub cursor: Option<CowStr<'a>>, 311 pub feed: Vec<FeedViewPost<'a>>, 312} 313 314impl XrpcRequest for GetAuthorFeedParams<'_> { 315 const NSID: &'static str = "app.bsky.feed.getAuthorFeed"; 316 const METHOD: XrpcMethod = XrpcMethod::Query; 317 const INPUT_ENCODING: Option<&'static str> = None; // queries have no body 318 const OUTPUT_ENCODING: &'static str = "application/json"; 319 320 type Params = Self; 321 type Output = GetAuthorFeedOutput<'static>; 322} 323``` 324 325**Encoding variations:** 326- Most procedures: `"application/json"` for input/output 327- Blob uploads: `"*/*"` or specific MIME type for input 328- CAR files: `"application/vnd.ipld.car"` for repo operations 329- Read from lexicon's `input.encoding` and `output.encoding` fields 330 331**Trait benefits:** 332- Allows monomorphization (static dispatch) for performance 333- Also supports `dyn XrpcRequest` for dynamic dispatch if needed 334- Client code can be generic over `impl XrpcRequest` 335 336### Subscriptions 337WebSocket streams - defer for now. Will need separate trait with message types. 338 339## Open Questions 340 3411. **Validation**: Generate runtime validation (min/max length, regex, etc)? 3422. **Tokens**: How to represent token types? 3433. **Errors**: How to handle codegen errors (missing refs, invalid schemas)? 3444. **Incremental**: Support incremental codegen (only changed lexicons)? 3455. **Formatting**: Always run rustfmt or rely on prettyplease? 3466. **XrpcRequest location**: Should trait live in jacquard-common or separate jacquard-xrpc crate? 347 348## Success Criteria 349 350- [ ] Generates code for all official AT Protocol lexicons 351- [ ] Generated code compiles without errors 352- [ ] No `Main` proliferation 353- [ ] Union variants have readable names 354- [ ] Unknown refs handled gracefully 355- [ ] `#[lexicon]` and `#[open_union]` applied correctly 356- [ ] Serialization round-trips correctly