# data

atproto's data model: each user is a signed database.

## repos

a repository is a user's data store. it contains all their records - posts, likes, follows, whatever the applications define.

repos are merkle trees. every commit is signed by the user's key and can be verified by anyone. this is what enables authenticated data gossip - you don't need to trust the messenger, you verify the signature.

## records

records are JSON documents stored in collections:

```
at://did:plc:xyz/app.bsky.feed.post/3jui7akfj2k2a
     └── DID ──┘ └── collection ───┘ └── rkey ──┘
```

- **DID**: whose repo
- **collection**: the record type (lexicon NSID)
- **rkey**: record key within the collection

record keys are typically TIDs (timestamp-based IDs) for records where users have many (posts, likes). for singletons like profiles, the literal `self` is used.

## AT-URIs

the `at://` URI scheme identifies records:

```
at://did:plc:xyz/fm.plyr.track/3jui7akfj2k2a
at://zzstoatzz.io/app.bsky.feed.post/3jui7akfj2k2a  # handle also works
```

these are stable references. the URI uniquely identifies a record across the network.

## CIDs

a CID (Content Identifier) is a hash of a specific version of a record:

```
bafyreig2fjxi3qbp5jvyqx2i4djxfkp...
```

URIs identify *what*, CIDs identify *which version*. when you reference another record and care about the exact content, you include both.

## strongRef

the standard pattern for cross-record references:

```json
{
  "subject": {
    "uri": "at://did:plc:xyz/fm.plyr.track/abc123",
    "cid": "bafyreig..."
  }
}
```

used in likes (referencing tracks), comments (referencing tracks), lists (referencing any records). the CID proves you're referencing a specific version.

from [plyr.fm lexicons](https://github.com/zzstoatzz/plyr.fm/tree/main/lexicons) - likes, comments, and lists all use strongRef.

## collections

records are grouped into collections by type:

```
repo/
├── app.bsky.feed.post/
│   ├── 3jui7akfj2k2a
│   └── 3jui8bklg3l3b
├── app.bsky.feed.like/
│   └── ...
└── fm.plyr.track/
    └── ...
```

each collection corresponds to a lexicon. applications read and write to collections they understand.

## local indexing

querying across PDSes is slow. applications maintain local indexes:

```sql
-- plyr.fm indexes fm.plyr.track records
CREATE TABLE tracks (
    id SERIAL PRIMARY KEY,
    did TEXT NOT NULL,
    rkey TEXT NOT NULL,
    uri TEXT NOT NULL,
    cid TEXT,
    title TEXT NOT NULL,
    artist TEXT NOT NULL,
    -- ... application-specific fields
    UNIQUE(did, rkey)
);
```

when users log in, sync their records from PDS to local database. background jobs keep indexes fresh.

from [plyr.fm](https://github.com/zzstoatzz/plyr.fm) - indexes tracks, likes, comments, playlists locally.

## why this matters

the "each user is one database" model is the foundation of **atmospheric computing**:

- **portability**: your "personal cloud" is yours. if a host fails, you move your data elsewhere.
- **verification**: trust is cryptographic. you verify the data signature, not the provider.
- **aggregation**: applications weave together data from millions of personal clouds into a cohesive "atmosphere."
- **interop**: apps share schemas, so my music player can read your social graph.