📚 Personal bits of knowledge
1# Unified Schema Design
2
3- The goal is to create a single, [unified schema across datasets](http://web.archive.org/web/20240720064056/https://docs.cybersyn.com/getting-started/concepts/unified-schema). This schema aims to strike a balance between flexibility to accommodate arbitrarily shaped data along with consistency in core tables.
4- Datasets are built around two concepts: entities and timeseries.
5 - Entities are concrete things or objects (a geography, a company, a mortgage application).
6 - Timeseries are abstract measures (ie. statistics) related to an entity and a date.
7- The core tables are:
8 - `entities`: Contains the entities that are being tracked. For example, Spain, Madrid, etc.
9 - Should be something like `province_index` or `weather_station_index` to be able to join with the timeseries.
10 - This table contains permanent characteristics describing an entity. E.g: for Provinces, the name, the region.
11 - Each row represents a distinct entity. The table is wide, in that immutable characteristics are expressed in their own fields.
12 - `attributes`: Attributes are descriptors of a timeseries. An attribute is the equivalent of a characteristic except for the abstract timeseries rather than the concrete entity.
13 - Columns:
14 - `variable_id`: Unique identifier for the attribute
15 - `name`: Name of the attribute
16 - `description`: Description of the attribute
17 - `unit`: Unit of the attribute
18 - `source`: Source of the attribute
19 - `frequency`: Frequency of the attribute (daily, monthly, etc.)
20 - `measurement_type`: Type of measurement (e.g. nominal, ordinal, interval, ratio, percentage)
21 - Metadata columns;
22 - `category`: Category of the attribute
23 - `namespace`: Namespace of the attribute
24 - `tags`: JSON with tags of the attribute?
25 - `aggregation_function`: Aggregation function to use when aggregating the attribute
26 - `timeseries`: Timeseries are abstract measures (ie. statistics, metrics) related to an entity and a date. Timeseries are temporal statistics or measures centered around an entity and timestamp. For example, GDP of Spain, population of Madrid, etc. Timeseries are abstract concepts (ie. a measure) rather than a concrete thing.
27 - Could be something like `weather_timeseries` to be able to join with the entities.
28 - Columns:
29 - `variable_id`: Unique identifier for the attribute
30 - `geography_id`: Unique identifier for the geography
31 - `date`: Date of the metric
32 - `value`: Value of the metric
33 - `relationships`: Contains the relationships between entities. For example, Spain is composed of provinces, Madrid is a province, etc.
34 - Relationships can also be temporal – valid for an interval defined by specific start and end dates.
35 - `characteristics`: Descriptors of an entity that are temporal. They have a start date and end date.