📚 Personal bits of knowledge
at main 35 lines 2.9 kB view raw view rendered
1# Unified Schema Design 2 3- The goal is to create a single, [unified schema across datasets](http://web.archive.org/web/20240720064056/https://docs.cybersyn.com/getting-started/concepts/unified-schema). This schema aims to strike a balance between flexibility to accommodate arbitrarily shaped data along with consistency in core tables. 4- Datasets are built around two concepts: entities and timeseries. 5 - Entities are concrete things or objects (a geography, a company, a mortgage application). 6 - Timeseries are abstract measures (ie. statistics) related to an entity and a date. 7- The core tables are: 8 - `entities`: Contains the entities that are being tracked. For example, Spain, Madrid, etc. 9 - Should be something like `province_index` or `weather_station_index` to be able to join with the timeseries. 10 - This table contains permanent characteristics describing an entity. E.g: for Provinces, the name, the region. 11 - Each row represents a distinct entity. The table is wide, in that immutable characteristics are expressed in their own fields. 12 - `attributes`: Attributes are descriptors of a timeseries. An attribute is the equivalent of a characteristic except for the abstract timeseries rather than the concrete entity. 13 - Columns: 14 - `variable_id`: Unique identifier for the attribute 15 - `name`: Name of the attribute 16 - `description`: Description of the attribute 17 - `unit`: Unit of the attribute 18 - `source`: Source of the attribute 19 - `frequency`: Frequency of the attribute (daily, monthly, etc.) 20 - `measurement_type`: Type of measurement (e.g. nominal, ordinal, interval, ratio, percentage) 21 - Metadata columns; 22 - `category`: Category of the attribute 23 - `namespace`: Namespace of the attribute 24 - `tags`: JSON with tags of the attribute? 25 - `aggregation_function`: Aggregation function to use when aggregating the attribute 26 - `timeseries`: Timeseries are abstract measures (ie. statistics, metrics) related to an entity and a date. Timeseries are temporal statistics or measures centered around an entity and timestamp. For example, GDP of Spain, population of Madrid, etc. Timeseries are abstract concepts (ie. a measure) rather than a concrete thing. 27 - Could be something like `weather_timeseries` to be able to join with the entities. 28 - Columns: 29 - `variable_id`: Unique identifier for the attribute 30 - `geography_id`: Unique identifier for the geography 31 - `date`: Date of the metric 32 - `value`: Value of the metric 33 - `relationships`: Contains the relationships between entities. For example, Spain is composed of provinces, Madrid is a province, etc. 34 - Relationships can also be temporal – valid for an interval defined by specific start and end dates. 35 - `characteristics`: Descriptors of an entity that are temporal. They have a start date and end date.