Highly ambitious ATProtocol AppView service and sdks
at main 505 lines 27 kB view raw view rendered
1Lexicon Lexicon is a schema definition language used to describe atproto 2records, HTTP endpoints (XRPC), and event stream messages. It builds on top of 3the atproto Data Model. 4 5The schema language is similar to JSON Schema and OpenAPI, but includes some 6atproto-specific features and semantics. 7 8This specification describes version 1 of the Lexicon definition language. 9 10Overview of Types Lexicon Type Data Model Type Category null Null concrete 11boolean Boolean concrete integer Integer concrete string String concrete 12bytes Bytes concrete cid-link Link concrete blob Blob concrete 13array Array container object Object container params container token meta 14ref meta union meta unknown meta record primary query primary 15procedure primary subscription primary Lexicon Files Lexicons are JSON files 16associated with a single NSID. A file contains one or more definitions, each 17with a distinct short name. A definition with the name main optionally describes 18the "primary" definition for the entire file. A Lexicon with zero definitions is 19invalid. 20 21A Lexicon JSON file is an object with the following fields: 22 23lexicon (integer, required): indicates Lexicon language version. In this 24version, a fixed value of 1 id (string, required): the NSID of the Lexicon 25description (string, optional): short overview of the Lexicon, usually one or 26two sentences defs (map of strings-to-objects, required): set of definitions, 27each with a distinct name (key) Schema definitions under defs all have a type 28field to distinguish their type. A file can have at most one definition with one 29of the "primary" types. Primary types should always have the name main. It is 30possible for main to describe a non-primary type. 31 32References to specific definitions within a Lexicon use fragment syntax, like 33com.example.defs#someView. If a main definition exists, it can be referenced 34without a fragment, just using the NSID. For references in the $type fields in 35data objects themselves (eg, records or contents of a union), this is a "must" 36(use of a #main suffix is invalid). For example, com.example.record not 37com.example.record#main. 38 39Related Lexicons are often grouped together in the NSID hierarchy. As a 40convention, any definitions used by multiple Lexicons are defined in a dedicated 41*.defs Lexicon (eg, com.atproto.server.defs) within the group. A *.defs Lexicon 42should generally not include a definition named main, though it is not strictly 43invalid to do so. 44 45Primary Type Definitions The primary types are: 46 47query: describes an XRPC Query (HTTP GET) procedure: describes an XRPC Procedure 48(HTTP POST) subscription: Event Stream (WebSocket) record: describes an object 49that can be stored in a repository record Each primary definition schema object 50includes these fields: 51 52type (string, required): the type value (eg, record for records) description 53(string, optional): short, usually only a sentence or two Record Type-specific 54fields: 55 56key (string, required): specifies the Record Key type record (object, required): 57a schema definition with type object, which specifies this type of record Query 58and Procedure (HTTP API) Type-specific fields: 59 60parameters (object, optional): a schema definition with type params, describing 61the HTTP query parameters for this endpoint output (object, optional): describes 62the HTTP response body description (string, optional): short description 63encoding (string, required): MIME type for body contents. Use application/json 64for JSON responses. schema (object, optional): schema definition, either an 65object, a ref, or a union of refs. Used to describe JSON encoded responses, 66though schema is optional even for JSON responses. input (object, optional, only 67for procedure): describes HTTP request body schema, with the same format as the 68output field errors (array of objects, optional): set of string error codes 69which might be returned name (string, required): short name for the error type, 70with no whitespace description (string, optional): short description, one or two 71sentences Subscription (Event Stream) Type-specific fields: 72 73parameters (object, optional): same as Query and Procedure message (object, 74optional): specifies what messages can be description (string, optional): short 75description schema (object, required): schema definition, which must be a union 76of refs errors (array of objects, optional): same as Query and Procedure 77Subscription schemas (referenced by the schema field under message) must be a 78union of refs, not an object type. 79 80Field Type Definitions As with the primary definitions, every schema object 81includes these fields: 82 83type (string, required): fixed value for each type description (string, 84optional): short, usually only a sentence or two null No additional fields. 85 86boolean Type-specific fields: 87 88default (boolean, optional): a default value for this field const (boolean, 89optional): a fixed (constant) value for this field When included as an HTTP 90query parameter, should be rendered as true or false (no quotes). 91 92integer A signed integer number. 93 94Type-specific fields: 95 96minimum (integer, optional): minimum acceptable value maximum (integer, 97optional): maximum acceptable value enum (array of integers, optional): a closed 98set of allowed values default (integer, optional): a default value for this 99field const (integer, optional): a fixed (constant) value for this field string 100Type-specific fields: 101 102format (string, optional): string format restriction maxLength (integer, 103optional): maximum length of value, in UTF-8 bytes minLength (integer, 104optional): minimum length of value, in UTF-8 bytes maxGraphemes (integer, 105optional): maximum length of value, counted as Unicode Grapheme Clusters 106minGraphemes (integer, optional): minimum length of value, counted as Unicode 107Grapheme Clusters knownValues (array of strings, optional): a set of suggested 108or common values for this field. Values are not limited to this set (aka, not a 109closed enum). enum (array of strings, optional): a closed set of allowed values 110default (string, optional): a default value for this field const (string, 111optional): a fixed (constant) value for this field Strings are Unicode. For 112non-Unicode encodings, use bytes instead. The basic minLength/maxLength 113validation constraints are counted as UTF-8 bytes. Note that Javascript stores 114strings with UTF-16 by default, and it is necessary to re-encode to count 115accurately. The minGraphemes/maxGraphemes validation constraints work with 116Grapheme Clusters, which have a complex technical and linguistic definition, but 117loosely correspond to "distinct visual characters" like Latin letters, CJK 118characters, punctuation, digits, or emoji (which might comprise multiple Unicode 119codepoints and many UTF-8 bytes). 120 121format constrains the string format and provides additional semantic context. 122Refer to the Data Model specification for the available format types and their 123definitions. 124 125const and default are mutually exclusive. 126 127bytes Type-specific fields: 128 129minLength (integer, optional): minimum size of value, as raw bytes with no 130encoding maxLength (integer, optional): maximum size of value, as raw bytes with 131no encoding cid-link No type-specific fields. 132 133See Data Model spec for CID restrictions. 134 135array Type-specific fields: 136 137items (object, required): describes the schema elements of this array minLength 138(integer, optional): minimum count of elements in array maxLength (integer, 139optional): maximum count of elements in array In theory arrays have homogeneous 140types (meaning every element as the same type). However, with union types this 141restriction is meaningless, so implementations can not assume that all the 142elements have the same type. 143 144object A generic object schema which can be nested inside other definitions by 145reference. 146 147Type-specific fields: 148 149properties (map of strings-to-objects, required): defines the properties 150(fields) by name, each with their own schema required (array of strings, 151optional): indicates which properties are required nullable (array of strings, 152optional): indicates which properties can have null as a value As described in 153the data model specification, there is a semantic difference in data between 154omitting a field; including the field with the value null; and including the 155field with a "false-y" value (false, 0, empty array, etc). 156 157blob Type-specific fields: 158 159accept (array of strings, optional): list of acceptable MIME types. Each may end 160in * as a glob pattern (eg, image/*). Use _/_ to indicate that any MIME type is 161accepted. maxSize (integer, optional): maximum size in bytes params This is a 162limited-scope type which is only ever used for the parameters field on query, 163procedure, and subscription primary types. These map to HTTP query parameters. 164 165Type-specific fields: 166 167required (array of strings, optional): same semantics as field on object 168properties: similar to properties under object, but can only include the types 169boolean, integer, string, and unknown; or an array of one of these types Note 170that unlike object, there is no nullable field on params. 171 172token Tokens are empty data values which exist only to be referenced by name. 173They are used to define a set of values with specific meanings. The description 174field should clarify the meaning of the token. Tokens encode as string data, 175with the string being the fully-qualified reference to the token itself (NSID 176followed by an optional fragment). 177 178Tokens are similar to the concept of a "symbol" in some programming languages, 179distinct from strings, variables, built-in keywords, or other identifiers. 180 181For example, tokens could be defined to represent the state of an entity (in a 182state machine), or to enumerate a list of categories. 183 184No type-specific fields. 185 186ref Type-specific fields: 187 188ref (string, required): reference to another schema definition Refs are a 189mechanism for re-using a schema definition in multiple places. The ref string 190can be a global reference to a Lexicon type definition (an NSID, optionally with 191a #-delimited name indicating a definition other than main), or can indicate a 192local definition within the same Lexicon file (a # followed by a name). 193 194union Type-specific fields: 195 196refs (array of strings, required): references to schema definitions closed 197(boolean, optional): indicates if a union is "open" or "closed". defaults to 198false (open union) Unions represent that multiple possible types could be 199present at this location in the schema. The references follow the same syntax as 200ref, allowing references to both global or local schema definitions. Actual data 201will validate against a single specific type: the union does not combine fields 202from multiple schemas, or define a new hybrid data type. The different types are 203referred to as variants. 204 205By default unions are "open", meaning that future revisions of the schema could 206add more types to the list of refs (though can not remove types). This means 207that implementations should be permissive when validating, in case they do not 208have the most recent version of the Lexicon. The closed flag (boolean) can 209indicate that the set of types is fixed and can not be extended in the future. 210 211A union schema definition with no refs is allowed and similar to unknown, as 212long as the closed flag is false (the default). The main difference is that the 213data would be required to have the $type field. An empty refs list with closed 214set to true is an invalid schema. 215 216The schema definitions pointed to by a union are objects or types with a clear 217mapping to an object, like a record. All the variants must be represented by a 218CBOR map (or JSON Object) and must include a $type field indicating the variant 219type. Because the data must be an object, unions can not reference token (which 220would correspond to string data). 221 222unknown Indicates than any data object could appear at this location, with no 223specific validation. The top-level data must be an object (not a string, 224boolean, etc). As with all other data types, the value null is not allowed 225unless the field is specifically marked as nullable. 226 227The data object may contain a 228$type field indicating the schema of the data, but this is not currently required. The top-level data object must not have the structure of a compound data type, like blob ($type: 229blob) or CID link ($link). 230 231The (nested) contents of the data object must still be valid under the atproto 232data model. For example, it should not contain floats. Nested compound types 233like blobs and CID links should be validated and transformed as expected. 234 235Lexicon designers are strongly recommended to not use unknown fields in record 236objects for now. 237 238No type-specific fields. 239 240String Formats Strings can optionally be constrained to one of the following 241format types: 242 243at-identifier: either a Handle or a DID, details described below at-uri: AT-URI 244cid: CID in string format, details specified in Data Model datetime: timestamp, 245details specified below did: generic DID Identifier handle: Handle Identifier 246nsid: Namespaced Identifier tid: Timestamp Identifier (TID) record-key: Record 247Key, matching the general syntax ("any") uri: generic URI, details specified 248below language: language code, details specified below For the various 249identifier formats, when doing Lexicon schema validation the most expansive 250identifier syntax format should be permitted. Problems with identifiers which do 251pass basic syntax validation should be reported as application errors, not 252lexicon data validation errors. For example, data with any kind of DID in a did 253format string field should pass Lexicon validation, with unsupported DID methods 254being raised separately as an application error. 255 256at-identifier A string type which is either a DID (type: did) or a handle 257(handle). Mostly used in XRPC query parameters. It is unambiguous whether an 258at-identifier is a handle or a DID because a DID always starts with did:, and 259the colon character (:) is not allowed in handles. 260 261datetime Full-precision date and time, with timezone information. 262 263This format is intended for use with computer-generated timestamps in the modern 264computing era (eg, after the UNIX epoch). If you need to represent historical or 265ancient events, ambiguity, or far-future times, a different format is probably 266more appropriate. Datetimes before the Current Era (year zero) as specifically 267disallowed. 268 269Datetime format standards are notoriously flexible and overlapping. Datetime 270strings in atproto should meet the intersecting requirements of the RFC 3339, 271ISO 8601, and WHATWG HTML datetime standards. 272 273The character separating "date" and "time" parts must be an upper-case T. 274 275Timezone specification is required. It is strongly preferred to use the UTC 276timezone, and to represent the timezone with a simple capital Z suffix 277(lower-case is not allowed). While hour/minute suffix syntax (like +01:00 or 278-10:30) is supported, "negative zero" (-00:00) is specifically disallowed (by 279ISO 8601). 280 281Whole seconds precision is required, and arbitrary fractional precision digits 282are allowed. Best practice is to use at least millisecond precision, and to pad 283with zeros to the generated precision (eg, trailing :12.340Z instead of 284:12.34Z). Not all datetime formatting libraries support trailing zero 285formatting. Both millisecond and microsecond precision have reasonable 286cross-language support; nanosecond precision does not. 287 288Implementations should be aware when round-tripping records containing datetimes 289of two ambiguities: loss-of-precision, and ambiguity with trailing fractional 290second zeros. If de-serializing Lexicon records into native types, and then 291re-serializing, the string representation may not be the same, which could 292result in broken hash references, sanity check failures, or repository update 293churn. A safer thing to do is to deserialize the datetime as a simple string, 294which ensures round-trip re-serialization. 295 296Implementations "should" validate that the semantics of the datetime are valid. 297For example, a month or day 00 is invalid. 298 299Valid examples: 300 301# preferred 302 3031985-04-12T23:20:50.123Z 1985-04-12T23:20:50.123456Z 1985-04-12T23:20:50.120Z 3041985-04-12T23:20:50.120000Z 305 306# supported 307 3081985-04-12T23:20:50.12345678912345Z 1985-04-12T23:20:50Z 1985-04-12T23:20:50.0Z 3091985-04-12T23:20:50.123+00:00 1985-04-12T23:20:50.123-07:00 310 311Copy Copied! Invalid examples: 312 3131985-04-12 1985-04-12T23:20Z 1985-04-12T23:20:5Z 1985-04-12T23:20:50.123 314+001985-04-12T23:20:50.123Z 23:20:50.123Z -1985-04-12T23:20:50.123Z 3151985-4-12T23:20:50.123Z 01985-04-12T23:20:50.123Z 1985-04-12T23:20:50.123+00 3161985-04-12T23:20:50.123+0000 317 318# ISO-8601 strict capitalization 319 3201985-04-12t23:20:50.123Z 1985-04-12T23:20:50.123z 321 322# RFC-3339, but not ISO-8601 323 3241985-04-12T23:20:50.123-00:00 1985-04-12 23:20:50.123Z 325 326# timezone is required 327 3281985-04-12T23:20:50.123 329 330# syntax looks ok, but datetime is not valid 331 3321985-04-12T23:99:50.123Z 1985-00-12T23:20:50.123Z 333 334Copy Copied! uri Flexible to any URI schema, following the generic RFC-3986 on 335URIs. This includes, but isn’t limited to: did, https, wss, ipfs (for CIDs), 336dns, and of course at. Maximum length in Lexicons is 8 KBytes. 337 338language An IETF Language Tag string, compliant with BCP 47, defined in RFC 5646 339("Tags for Identifying Languages"). This is the same standard used to identify 340languages in HTTP, HTML, and other web standards. The Lexicon string must 341validate as a "well-formed" language tag, as defined in the RFC. Clients should 342ignore language strings which are "well-formed" but not "valid" according to the 343RFC. 344 345As specified in the RFC, ISO 639 two-character and three-character language 346codes can be used on their own, lower-cased, such as ja (Japanese) or ban 347(Balinese). Regional sub-tags can be added, like pt-BR (Brazilian Portuguese). 348Additional subtags can also be added, such as hy-Latn-IT-arevela. 349 350Language codes generally need to be parsed, normalized, and matched 351semantically, not simply string-compared. For example, a search engine might 352simplify language tags to ISO 639 codes for indexing and filtering, while a 353client application (user agent) would retain the full language code for 354presentation (text rendering) locally. 355 356When to use $type Data objects sometimes include a $type field which indicates 357their Lexicon type. The general principle is that this field needs to be 358included any time there could be ambiguity about the content type when 359validating data. 360 361The specific rules are: 362 363record objects must always include $type. While the type is often known from 364context (eg, the collection part of the path for records stored in a 365repository), record objects can also be passed around outside of repositories 366and need to be self-describing union variants must always include $type, except 367at the top level of subscription messages Note that blob objects always include 368$type, which allows generic processing. 369 370As a reminder, main types must be referenced in $type fields as just the NSID, 371not including a #main suffix. 372 373Lexicon Evolution Lexicons are allowed to change over time, within some bounds 374to ensure both forwards and backwards compatibility. The basic principle is that 375all old data must still be valid under the updated Lexicon, and new data must be 376valid under the old Lexicon. 377 378Any new fields must be optional Non-optional fields can not be removed. A best 379practice is to retain all fields in the Lexicon and mark them as deprecated if 380they are no longer used. Types can not change Fields can not be renamed If 381larger breaking changes are necessary, a new Lexicon name must be used. 382 383It can be ambiguous when a Lexicon has been published and becomes "set in 384stone". At a minimum, public adoption and implementation by a third party, even 385without explicit permission, indicates that the Lexicon has been released and 386should not break compatibility. A best practice is to clearly indicate in the 387Lexicon type name any experimental or development status. Eg, 388com.corp.experimental.newRecord. 389 390Authority and Control The authority for a Lexicon is determined by the NSID, and 391rooted in DNS control of the domain authority. That authority has ultimate 392control over the Lexicon definition, and responsibility for maintenance and 393distribution of Lexicon schema definitions. 394 395In a crisis, such as unintentional loss of DNS control to a bad actor, the 396protocol ecosystem could decide to disregard this chain of authority. This 397should only be done in exceptional circumstances, and not as a mechanism to 398subvert an active authority. The primary mechanism for resolving protocol 399disputes is to fork Lexicons in to a new namespace. 400 401Protocol implementations should generally consider data which fails to validate 402against the Lexicon to be entirely invalid, and should not try to repair or do 403partial processing on the individual piece of data. 404 405Unexpected fields in data which otherwise conforms to the Lexicon should be 406ignored. When doing schema validation, they should be treated at worst as 407warnings. This is necessary to allow evolution of the schema by the controlling 408authority, and to be robust in the case of out-of-date Lexicons. 409 410Third parties can technically insert any additional fields they want into data. 411This is not the recommended way to extend applications, but it is not 412specifically disallowed. One danger with this is that the Lexicon may be updated 413to include fields with the same field names but different types, which would 414make existing data invalid. 415 416Lexicon Publication and Resolution Lexicon schemas are published publicly as 417records in atproto repositories, using the com.atproto.lexicon.schema type. The 418domain name authority for NSIDs to specific atproto repositories (identified by 419DID is linked by a DNS TXT record (_lexicon), similar to but distinct from the 420handle resolution system. 421 422The com.atproto.lexicon.schema Lexicon itself is very minimal: it only requires 423the lexicon integer field, which must be 1 for this version of the Lexicon 424language. In practice, same fields as Lexicon Files should be included, along 425with $type. The record key is the NSID of the schema. 426 427A summary of record fields: 428 429$type: must be com.atproto.lexicon.schema (as with all atproto records) lexicon: 430integer, indicates the overall version of the Lexicon (currently 1) id: the NSID 431of this Lexicon. Must be a simple NSID (no fragment), and must match the record 432key defs: the schema definitions themselves, as a map-of-objects. Names should 433not include a # prefix. description: optional description of the overall schema; 434though descriptions are best included on individual defs, not the overall 435schema. The com.atproto.lexicon.schema meta-schema is somewhat unlike other 436Lexicons, in that it is defined and governed as part of the protocol. Future 437versions of the language and protocol might not follow the evolution rules. It 438is an intentional decision to not express the Lexicon schema language itself 439recursively, using the schema language. 440 441Authority for NSID namespaces is done at the "group" level, meaning that all 442NSIDs which differ only by the final "name" part are all published in the same 443repository. Lexicon resolution of NSIDs is not hierarchical: DNS TXT records 444must be created for each authority section, and resolvers should not recurse up 445or down the DNS hierarchy looking for TXT records. 446 447As an example, the NSID edu.university.dept.lab.blogging.getBlogPost has a 448"name" getBlogPost. Removing the name and reversing the rest of the NSID gives 449an "authority domain name" of blogging.lab.dept.university.edu. To link the 450authority to a specific DID (say did:plc:ewvi7nxzyoun6zhxrhs64oiz), a DNS TXT 451record with the name _lexicon.blogging.lab.dept.university.edu and value 452did=did:plc:ewvi7nxzyoun6zhxrhs64oiz (note the did= prefix) would be created. 453Then a record with collection com.atproto.lexicon.schema and record-key 454edu.university.dept.lab.blogging.getBlogPost would be created in that account's 455repository. 456 457A resolving service would start with the NSID 458(edu.university.dept.lab.blogging.getBlogPost) and do a DNS TXT resolution for 459_lexicon.blogging.lab.dept.university.edu. Finding the DID, it would proceed 460with atproto DID resolution, look for a PDS, and then fetch the relevant record. 461The overall AT-URI for the record would be 462at://did:plc:ewvi7nxzyoun6zhxrhs64oiz/com.atproto.lexicon.schema/edu.university.dept.lab.blogging.getBlogPost. 463 464If the DNS TXT resolution for _lexicon.blogging.lab.dept.university.edu failed, 465the resolving service would NOT try _lexicon.lab.dept.university.edu or 466_lexicon.getBlogPost.blogging.lab.dept.university.edu or 467_lexicon.university.edu, or any other domain name. The Lexicon resolution would 468simply fail. 469 470If another NSID edu.university.dept.lab.blogging.getBlogComments was created, it 471would have the same authority name, and must be published in the same atproto 472repository (with a different record key). If a Lexicon for 473edu.university.dept.lab.gallery.photo was published, a new DNS TXT record would 474be required (_lexicon.gallery.lab.dept.university.edu; it could point at the 475same repository (DID), or a different repository. 476 477As a simpler example, an NSID app.toy.record would resolve via _lexicon.toy.app. 478 479A single repository can host Lexicons for multiple authority domains, possibly 480across multiple registered domains and TLDs. Resolution DNS records can change 481over time, moving schema resolution to different repositories, though it may 482take time for DNS and cache changes to propagate. 483 484Note that Lexicon record operations are broadcast over repository event streams 485("firehose"), but that DNS resolution changes do not (unlike handle changes). 486Resolving services should not cache DNS resolution results for long time 487periods. 488 489Usage and Implementation Guidelines It should be possible to translate Lexicon 490schemas to JSON Schema or OpenAPI and use tools and libraries from those 491ecosystems to work with atproto data in JSON format. 492 493Implementations which serialize and deserialize data from JSON or CBOR into 494structures derived from specific Lexicons should be aware of the risk of 495"clobbering" unexpected fields. For example, if a Lexicon is updated to add a 496new (optional) field, old implementations would not be aware of that field, and 497might accidentally strip the data when de-serializing and then re-serializing. 498Depending on the context, one way to avoid this problem is to retain any "extra" 499fields, or to pass-through the original data object instead of re-serializing 500it. 501 502Possible Future Changes The validation rules for unexpected additional fields 503may change. For example, a mechanism for Lexicons to indicate that the schema is 504"closed" and unexpected fields are not allowed, or a convention around field 505name prefixes (x-) to indicate unofficial extension.