context/lexicons_spec.md at main · slices.network/slices

slices.network / slices
fork atom
Highly ambitious ATProtocol AppView service and sdks
fork atom
slices / context / lexicons_spec.md
at main 505 lines 27 kB view raw view rendered
wrap content
chadtmiller.com update docs in api, extract system slice to env var, slice -> slices, and some other cleanup 5mo ago
deac0c31
  1Lexicon Lexicon is a schema definition language used to describe atproto
  2records, HTTP endpoints (XRPC), and event stream messages. It builds on top of
  3the atproto Data Model.
  4
  5The schema language is similar to JSON Schema and OpenAPI, but includes some
  6atproto-specific features and semantics.
  7
  8This specification describes version 1 of the Lexicon definition language.
  9
 10Overview of Types Lexicon Type	Data Model Type	Category null	Null	concrete
 11boolean	Boolean	concrete integer	Integer	concrete string	String	concrete
 12bytes	Bytes	concrete cid-link	Link	concrete blob	Blob	concrete
 13array	Array	container object	Object	container params		container token		meta
 14ref		meta union		meta unknown		meta record		primary query		primary
 15procedure		primary subscription		primary Lexicon Files Lexicons are JSON files
 16associated with a single NSID. A file contains one or more definitions, each
 17with a distinct short name. A definition with the name main optionally describes
 18the "primary" definition for the entire file. A Lexicon with zero definitions is
 19invalid.
 20
 21A Lexicon JSON file is an object with the following fields:
 22
 23lexicon (integer, required): indicates Lexicon language version. In this
 24version, a fixed value of 1 id (string, required): the NSID of the Lexicon
 25description (string, optional): short overview of the Lexicon, usually one or
 26two sentences defs (map of strings-to-objects, required): set of definitions,
 27each with a distinct name (key) Schema definitions under defs all have a type
 28field to distinguish their type. A file can have at most one definition with one
 29of the "primary" types. Primary types should always have the name main. It is
 30possible for main to describe a non-primary type.
 31
 32References to specific definitions within a Lexicon use fragment syntax, like
 33com.example.defs#someView. If a main definition exists, it can be referenced
 34without a fragment, just using the NSID. For references in the $type fields in
 35data objects themselves (eg, records or contents of a union), this is a "must"
 36(use of a #main suffix is invalid). For example, com.example.record not
 37com.example.record#main.
 38
 39Related Lexicons are often grouped together in the NSID hierarchy. As a
 40convention, any definitions used by multiple Lexicons are defined in a dedicated
 41*.defs Lexicon (eg, com.atproto.server.defs) within the group. A *.defs Lexicon
 42should generally not include a definition named main, though it is not strictly
 43invalid to do so.
 44
 45Primary Type Definitions The primary types are:
 46
 47query: describes an XRPC Query (HTTP GET) procedure: describes an XRPC Procedure
 48(HTTP POST) subscription: Event Stream (WebSocket) record: describes an object
 49that can be stored in a repository record Each primary definition schema object
 50includes these fields:
 51
 52type (string, required): the type value (eg, record for records) description
 53(string, optional): short, usually only a sentence or two Record Type-specific
 54fields:
 55
 56key (string, required): specifies the Record Key type record (object, required):
 57a schema definition with type object, which specifies this type of record Query
 58and Procedure (HTTP API) Type-specific fields:
 59
 60parameters (object, optional): a schema definition with type params, describing
 61the HTTP query parameters for this endpoint output (object, optional): describes
 62the HTTP response body description (string, optional): short description
 63encoding (string, required): MIME type for body contents. Use application/json
 64for JSON responses. schema (object, optional): schema definition, either an
 65object, a ref, or a union of refs. Used to describe JSON encoded responses,
 66though schema is optional even for JSON responses. input (object, optional, only
 67for procedure): describes HTTP request body schema, with the same format as the
 68output field errors (array of objects, optional): set of string error codes
 69which might be returned name (string, required): short name for the error type,
 70with no whitespace description (string, optional): short description, one or two
 71sentences Subscription (Event Stream) Type-specific fields:
 72
 73parameters (object, optional): same as Query and Procedure message (object,
 74optional): specifies what messages can be description (string, optional): short
 75description schema (object, required): schema definition, which must be a union
 76of refs errors (array of objects, optional): same as Query and Procedure
 77Subscription schemas (referenced by the schema field under message) must be a
 78union of refs, not an object type.
 79
 80Field Type Definitions As with the primary definitions, every schema object
 81includes these fields:
 82
 83type (string, required): fixed value for each type description (string,
 84optional): short, usually only a sentence or two null No additional fields.
 85
 86boolean Type-specific fields:
 87
 88default (boolean, optional): a default value for this field const (boolean,
 89optional): a fixed (constant) value for this field When included as an HTTP
 90query parameter, should be rendered as true or false (no quotes).
 91
 92integer A signed integer number.
 93
 94Type-specific fields:
 95
 96minimum (integer, optional): minimum acceptable value maximum (integer,
 97optional): maximum acceptable value enum (array of integers, optional): a closed
 98set of allowed values default (integer, optional): a default value for this
 99field const (integer, optional): a fixed (constant) value for this field string
100Type-specific fields:
101
102format (string, optional): string format restriction maxLength (integer,
103optional): maximum length of value, in UTF-8 bytes minLength (integer,
104optional): minimum length of value, in UTF-8 bytes maxGraphemes (integer,
105optional): maximum length of value, counted as Unicode Grapheme Clusters
106minGraphemes (integer, optional): minimum length of value, counted as Unicode
107Grapheme Clusters knownValues (array of strings, optional): a set of suggested
108or common values for this field. Values are not limited to this set (aka, not a
109closed enum). enum (array of strings, optional): a closed set of allowed values
110default (string, optional): a default value for this field const (string,
111optional): a fixed (constant) value for this field Strings are Unicode. For
112non-Unicode encodings, use bytes instead. The basic minLength/maxLength
113validation constraints are counted as UTF-8 bytes. Note that Javascript stores
114strings with UTF-16 by default, and it is necessary to re-encode to count
115accurately. The minGraphemes/maxGraphemes validation constraints work with
116Grapheme Clusters, which have a complex technical and linguistic definition, but
117loosely correspond to "distinct visual characters" like Latin letters, CJK
118characters, punctuation, digits, or emoji (which might comprise multiple Unicode
119codepoints and many UTF-8 bytes).
120
121format constrains the string format and provides additional semantic context.
122Refer to the Data Model specification for the available format types and their
123definitions.
124
125const and default are mutually exclusive.
126
127bytes Type-specific fields:
128
129minLength (integer, optional): minimum size of value, as raw bytes with no
130encoding maxLength (integer, optional): maximum size of value, as raw bytes with
131no encoding cid-link No type-specific fields.
132
133See Data Model spec for CID restrictions.
134
135array Type-specific fields:
136
137items (object, required): describes the schema elements of this array minLength
138(integer, optional): minimum count of elements in array maxLength (integer,
139optional): maximum count of elements in array In theory arrays have homogeneous
140types (meaning every element as the same type). However, with union types this
141restriction is meaningless, so implementations can not assume that all the
142elements have the same type.
143
144object A generic object schema which can be nested inside other definitions by
145reference.
146
147Type-specific fields:
148
149properties (map of strings-to-objects, required): defines the properties
150(fields) by name, each with their own schema required (array of strings,
151optional): indicates which properties are required nullable (array of strings,
152optional): indicates which properties can have null as a value As described in
153the data model specification, there is a semantic difference in data between
154omitting a field; including the field with the value null; and including the
155field with a "false-y" value (false, 0, empty array, etc).
156
157blob Type-specific fields:
158
159accept (array of strings, optional): list of acceptable MIME types. Each may end
160in * as a glob pattern (eg, image/*). Use _/_ to indicate that any MIME type is
161accepted. maxSize (integer, optional): maximum size in bytes params This is a
162limited-scope type which is only ever used for the parameters field on query,
163procedure, and subscription primary types. These map to HTTP query parameters.
164
165Type-specific fields:
166
167required (array of strings, optional): same semantics as field on object
168properties: similar to properties under object, but can only include the types
169boolean, integer, string, and unknown; or an array of one of these types Note
170that unlike object, there is no nullable field on params.
171
172token Tokens are empty data values which exist only to be referenced by name.
173They are used to define a set of values with specific meanings. The description
174field should clarify the meaning of the token. Tokens encode as string data,
175with the string being the fully-qualified reference to the token itself (NSID
176followed by an optional fragment).
177
178Tokens are similar to the concept of a "symbol" in some programming languages,
179distinct from strings, variables, built-in keywords, or other identifiers.
180
181For example, tokens could be defined to represent the state of an entity (in a
182state machine), or to enumerate a list of categories.
183
184No type-specific fields.
185
186ref Type-specific fields:
187
188ref (string, required): reference to another schema definition Refs are a
189mechanism for re-using a schema definition in multiple places. The ref string
190can be a global reference to a Lexicon type definition (an NSID, optionally with
191a #-delimited name indicating a definition other than main), or can indicate a
192local definition within the same Lexicon file (a # followed by a name).
193
194union Type-specific fields:
195
196refs (array of strings, required): references to schema definitions closed
197(boolean, optional): indicates if a union is "open" or "closed". defaults to
198false (open union) Unions represent that multiple possible types could be
199present at this location in the schema. The references follow the same syntax as
200ref, allowing references to both global or local schema definitions. Actual data
201will validate against a single specific type: the union does not combine fields
202from multiple schemas, or define a new hybrid data type. The different types are
203referred to as variants.
204
205By default unions are "open", meaning that future revisions of the schema could
206add more types to the list of refs (though can not remove types). This means
207that implementations should be permissive when validating, in case they do not
208have the most recent version of the Lexicon. The closed flag (boolean) can
209indicate that the set of types is fixed and can not be extended in the future.
210
211A union schema definition with no refs is allowed and similar to unknown, as
212long as the closed flag is false (the default). The main difference is that the
213data would be required to have the $type field. An empty refs list with closed
214set to true is an invalid schema.
215
216The schema definitions pointed to by a union are objects or types with a clear
217mapping to an object, like a record. All the variants must be represented by a
218CBOR map (or JSON Object) and must include a $type field indicating the variant
219type. Because the data must be an object, unions can not reference token (which
220would correspond to string data).
221
222unknown Indicates than any data object could appear at this location, with no
223specific validation. The top-level data must be an object (not a string,
224boolean, etc). As with all other data types, the value null is not allowed
225unless the field is specifically marked as nullable.
226
227The data object may contain a
228$type field indicating the schema of the data, but this is not currently required. The top-level data object must not have the structure of a compound data type, like blob ($type:
229blob) or CID link ($link).
230
231The (nested) contents of the data object must still be valid under the atproto
232data model. For example, it should not contain floats. Nested compound types
233like blobs and CID links should be validated and transformed as expected.
234
235Lexicon designers are strongly recommended to not use unknown fields in record
236objects for now.
237
238No type-specific fields.
239
240String Formats Strings can optionally be constrained to one of the following
241format types:
242
243at-identifier: either a Handle or a DID, details described below at-uri: AT-URI
244cid: CID in string format, details specified in Data Model datetime: timestamp,
245details specified below did: generic DID Identifier handle: Handle Identifier
246nsid: Namespaced Identifier tid: Timestamp Identifier (TID) record-key: Record
247Key, matching the general syntax ("any") uri: generic URI, details specified
248below language: language code, details specified below For the various
249identifier formats, when doing Lexicon schema validation the most expansive
250identifier syntax format should be permitted. Problems with identifiers which do
251pass basic syntax validation should be reported as application errors, not
252lexicon data validation errors. For example, data with any kind of DID in a did
253format string field should pass Lexicon validation, with unsupported DID methods
254being raised separately as an application error.
255
256at-identifier A string type which is either a DID (type: did) or a handle
257(handle). Mostly used in XRPC query parameters. It is unambiguous whether an
258at-identifier is a handle or a DID because a DID always starts with did:, and
259the colon character (:) is not allowed in handles.
260
261datetime Full-precision date and time, with timezone information.
262
263This format is intended for use with computer-generated timestamps in the modern
264computing era (eg, after the UNIX epoch). If you need to represent historical or
265ancient events, ambiguity, or far-future times, a different format is probably
266more appropriate. Datetimes before the Current Era (year zero) as specifically
267disallowed.
268
269Datetime format standards are notoriously flexible and overlapping. Datetime
270strings in atproto should meet the intersecting requirements of the RFC 3339,
271ISO 8601, and WHATWG HTML datetime standards.
272
273The character separating "date" and "time" parts must be an upper-case T.
274
275Timezone specification is required. It is strongly preferred to use the UTC
276timezone, and to represent the timezone with a simple capital Z suffix
277(lower-case is not allowed). While hour/minute suffix syntax (like +01:00 or
278-10:30) is supported, "negative zero" (-00:00) is specifically disallowed (by
279ISO 8601).
280
281Whole seconds precision is required, and arbitrary fractional precision digits
282are allowed. Best practice is to use at least millisecond precision, and to pad
283with zeros to the generated precision (eg, trailing :12.340Z instead of
284:12.34Z). Not all datetime formatting libraries support trailing zero
285formatting. Both millisecond and microsecond precision have reasonable
286cross-language support; nanosecond precision does not.
287
288Implementations should be aware when round-tripping records containing datetimes
289of two ambiguities: loss-of-precision, and ambiguity with trailing fractional
290second zeros. If de-serializing Lexicon records into native types, and then
291re-serializing, the string representation may not be the same, which could
292result in broken hash references, sanity check failures, or repository update
293churn. A safer thing to do is to deserialize the datetime as a simple string,
294which ensures round-trip re-serialization.
295
296Implementations "should" validate that the semantics of the datetime are valid.
297For example, a month or day 00 is invalid.
298
299Valid examples:
300
301# preferred
302
3031985-04-12T23:20:50.123Z 1985-04-12T23:20:50.123456Z 1985-04-12T23:20:50.120Z
3041985-04-12T23:20:50.120000Z
305
306# supported
307
3081985-04-12T23:20:50.12345678912345Z 1985-04-12T23:20:50Z 1985-04-12T23:20:50.0Z
3091985-04-12T23:20:50.123+00:00 1985-04-12T23:20:50.123-07:00
310
311Copy Copied! Invalid examples:
312
3131985-04-12 1985-04-12T23:20Z 1985-04-12T23:20:5Z 1985-04-12T23:20:50.123
314+001985-04-12T23:20:50.123Z 23:20:50.123Z -1985-04-12T23:20:50.123Z
3151985-4-12T23:20:50.123Z 01985-04-12T23:20:50.123Z 1985-04-12T23:20:50.123+00
3161985-04-12T23:20:50.123+0000
317
318# ISO-8601 strict capitalization
319
3201985-04-12t23:20:50.123Z 1985-04-12T23:20:50.123z
321
322# RFC-3339, but not ISO-8601
323
3241985-04-12T23:20:50.123-00:00 1985-04-12 23:20:50.123Z
325
326# timezone is required
327
3281985-04-12T23:20:50.123
329
330# syntax looks ok, but datetime is not valid
331
3321985-04-12T23:99:50.123Z 1985-00-12T23:20:50.123Z
333
334Copy Copied! uri Flexible to any URI schema, following the generic RFC-3986 on
335URIs. This includes, but isn’t limited to: did, https, wss, ipfs (for CIDs),
336dns, and of course at. Maximum length in Lexicons is 8 KBytes.
337
338language An IETF Language Tag string, compliant with BCP 47, defined in RFC 5646
339("Tags for Identifying Languages"). This is the same standard used to identify
340languages in HTTP, HTML, and other web standards. The Lexicon string must
341validate as a "well-formed" language tag, as defined in the RFC. Clients should
342ignore language strings which are "well-formed" but not "valid" according to the
343RFC.
344
345As specified in the RFC, ISO 639 two-character and three-character language
346codes can be used on their own, lower-cased, such as ja (Japanese) or ban
347(Balinese). Regional sub-tags can be added, like pt-BR (Brazilian Portuguese).
348Additional subtags can also be added, such as hy-Latn-IT-arevela.
349
350Language codes generally need to be parsed, normalized, and matched
351semantically, not simply string-compared. For example, a search engine might
352simplify language tags to ISO 639 codes for indexing and filtering, while a
353client application (user agent) would retain the full language code for
354presentation (text rendering) locally.
355
356When to use $type Data objects sometimes include a $type field which indicates
357their Lexicon type. The general principle is that this field needs to be
358included any time there could be ambiguity about the content type when
359validating data.
360
361The specific rules are:
362
363record objects must always include $type. While the type is often known from
364context (eg, the collection part of the path for records stored in a
365repository), record objects can also be passed around outside of repositories
366and need to be self-describing union variants must always include $type, except
367at the top level of subscription messages Note that blob objects always include
368$type, which allows generic processing.
369
370As a reminder, main types must be referenced in $type fields as just the NSID,
371not including a #main suffix.
372
373Lexicon Evolution Lexicons are allowed to change over time, within some bounds
374to ensure both forwards and backwards compatibility. The basic principle is that
375all old data must still be valid under the updated Lexicon, and new data must be
376valid under the old Lexicon.
377
378Any new fields must be optional Non-optional fields can not be removed. A best
379practice is to retain all fields in the Lexicon and mark them as deprecated if
380they are no longer used. Types can not change Fields can not be renamed If
381larger breaking changes are necessary, a new Lexicon name must be used.
382
383It can be ambiguous when a Lexicon has been published and becomes "set in
384stone". At a minimum, public adoption and implementation by a third party, even
385without explicit permission, indicates that the Lexicon has been released and
386should not break compatibility. A best practice is to clearly indicate in the
387Lexicon type name any experimental or development status. Eg,
388com.corp.experimental.newRecord.
389
390Authority and Control The authority for a Lexicon is determined by the NSID, and
391rooted in DNS control of the domain authority. That authority has ultimate
392control over the Lexicon definition, and responsibility for maintenance and
393distribution of Lexicon schema definitions.
394
395In a crisis, such as unintentional loss of DNS control to a bad actor, the
396protocol ecosystem could decide to disregard this chain of authority. This
397should only be done in exceptional circumstances, and not as a mechanism to
398subvert an active authority. The primary mechanism for resolving protocol
399disputes is to fork Lexicons in to a new namespace.
400
401Protocol implementations should generally consider data which fails to validate
402against the Lexicon to be entirely invalid, and should not try to repair or do
403partial processing on the individual piece of data.
404
405Unexpected fields in data which otherwise conforms to the Lexicon should be
406ignored. When doing schema validation, they should be treated at worst as
407warnings. This is necessary to allow evolution of the schema by the controlling
408authority, and to be robust in the case of out-of-date Lexicons.
409
410Third parties can technically insert any additional fields they want into data.
411This is not the recommended way to extend applications, but it is not
412specifically disallowed. One danger with this is that the Lexicon may be updated
413to include fields with the same field names but different types, which would
414make existing data invalid.
415
416Lexicon Publication and Resolution Lexicon schemas are published publicly as
417records in atproto repositories, using the com.atproto.lexicon.schema type. The
418domain name authority for NSIDs to specific atproto repositories (identified by
419DID is linked by a DNS TXT record (_lexicon), similar to but distinct from the
420handle resolution system.
421
422The com.atproto.lexicon.schema Lexicon itself is very minimal: it only requires
423the lexicon integer field, which must be 1 for this version of the Lexicon
424language. In practice, same fields as Lexicon Files should be included, along
425with $type. The record key is the NSID of the schema.
426
427A summary of record fields:
428
429$type: must be com.atproto.lexicon.schema (as with all atproto records) lexicon:
430integer, indicates the overall version of the Lexicon (currently 1) id: the NSID
431of this Lexicon. Must be a simple NSID (no fragment), and must match the record
432key defs: the schema definitions themselves, as a map-of-objects. Names should
433not include a # prefix. description: optional description of the overall schema;
434though descriptions are best included on individual defs, not the overall
435schema. The com.atproto.lexicon.schema meta-schema is somewhat unlike other
436Lexicons, in that it is defined and governed as part of the protocol. Future
437versions of the language and protocol might not follow the evolution rules. It
438is an intentional decision to not express the Lexicon schema language itself
439recursively, using the schema language.
440
441Authority for NSID namespaces is done at the "group" level, meaning that all
442NSIDs which differ only by the final "name" part are all published in the same
443repository. Lexicon resolution of NSIDs is not hierarchical: DNS TXT records
444must be created for each authority section, and resolvers should not recurse up
445or down the DNS hierarchy looking for TXT records.
446
447As an example, the NSID edu.university.dept.lab.blogging.getBlogPost has a
448"name" getBlogPost. Removing the name and reversing the rest of the NSID gives
449an "authority domain name" of blogging.lab.dept.university.edu. To link the
450authority to a specific DID (say did:plc:ewvi7nxzyoun6zhxrhs64oiz), a DNS TXT
451record with the name _lexicon.blogging.lab.dept.university.edu and value
452did=did:plc:ewvi7nxzyoun6zhxrhs64oiz (note the did= prefix) would be created.
453Then a record with collection com.atproto.lexicon.schema and record-key
454edu.university.dept.lab.blogging.getBlogPost would be created in that account's
455repository.
456
457A resolving service would start with the NSID
458(edu.university.dept.lab.blogging.getBlogPost) and do a DNS TXT resolution for
459_lexicon.blogging.lab.dept.university.edu. Finding the DID, it would proceed
460with atproto DID resolution, look for a PDS, and then fetch the relevant record.
461The overall AT-URI for the record would be
462at://did:plc:ewvi7nxzyoun6zhxrhs64oiz/com.atproto.lexicon.schema/edu.university.dept.lab.blogging.getBlogPost.
463
464If the DNS TXT resolution for _lexicon.blogging.lab.dept.university.edu failed,
465the resolving service would NOT try _lexicon.lab.dept.university.edu or
466_lexicon.getBlogPost.blogging.lab.dept.university.edu or
467_lexicon.university.edu, or any other domain name. The Lexicon resolution would
468simply fail.
469
470If another NSID edu.university.dept.lab.blogging.getBlogComments was created, it
471would have the same authority name, and must be published in the same atproto
472repository (with a different record key). If a Lexicon for
473edu.university.dept.lab.gallery.photo was published, a new DNS TXT record would
474be required (_lexicon.gallery.lab.dept.university.edu; it could point at the
475same repository (DID), or a different repository.
476
477As a simpler example, an NSID app.toy.record would resolve via _lexicon.toy.app.
478
479A single repository can host Lexicons for multiple authority domains, possibly
480across multiple registered domains and TLDs. Resolution DNS records can change
481over time, moving schema resolution to different repositories, though it may
482take time for DNS and cache changes to propagate.
483
484Note that Lexicon record operations are broadcast over repository event streams
485("firehose"), but that DNS resolution changes do not (unlike handle changes).
486Resolving services should not cache DNS resolution results for long time
487periods.
488
489Usage and Implementation Guidelines It should be possible to translate Lexicon
490schemas to JSON Schema or OpenAPI and use tools and libraries from those
491ecosystems to work with atproto data in JSON format.
492
493Implementations which serialize and deserialize data from JSON or CBOR into
494structures derived from specific Lexicons should be aware of the risk of
495"clobbering" unexpected fields. For example, if a Lexicon is updated to add a
496new (optional) field, old implementations would not be aware of that field, and
497might accidentally strip the data when de-serializing and then re-serializing.
498Depending on the context, one way to avoid this problem is to retain any "extra"
499fields, or to pass-through the original data object instead of re-serializing
500it.
501
502Possible Future Changes The validation rules for unexpected additional fields
503may change. For example, a mechanism for Lexicons to indicate that the schema is
504"closed" and unexpected fields are not allowed, or a convention around field
505name prefixes (x-) to indicate unofficial extension.