2023-04-26 Implementation Plan: Document all media properties in the catalog¶

Reviewers¶

[x] @AetherUnbound
[x] @stacimc

Project links¶

Overview¶

A proposal to create a documentation page with description of the media properties collected by the Catalog, and keep it updated using CI.

Background¶

The catalog collects and transforms data from many different sources. The kind of data collected, transformations, and the expected data shape have changed over time. This plan will outline the creation of the documentation page about the media properties in the catalog.

User stories and user questions¶

As a maintainer/contributor, I need to know the list of available properties for each media type in the catalog database so that I can better understand how different media types are represented within the catalog, can identify any gaps or inconsistencies in the properties of media types.
As a contributor, I need to know what is the expected data type for the specific property so that I can correctly save the data in the catalog, or use the data in the API.
As a contributor, I need to know how the data property is transformed and validated so that I can ensure that I am collecting the correct data, or can identify any potential improvements.
As a contributor writing a provider API script, I need to know what information to save for each data property to collect from providers so that the script collects data that is complete, valid for attribution and improves search quality.
As a contributor writing a provider script, I need to know which format to convert a media property to so that I can ensure that the data is valid and can be saved in the database.
As a maintainer, I need to see if the discrepancy between the type of data property the API returns and what is described in the table (e.g., is null for a non-nullable property) is described in the documents so that I can open a new issue if necessary.
As an API user, I need to know what exactly this media property is and what data points from the providers are used for it (e.g., does the “creator” refer to the person who created the object or took the picture?) so that I can use the data in my application.
As a provider of openly-licensed media, I want to know what media properties Openverse collects so that I can change my API responses.

The information on data types, transformations, and validations will be presented as a table and will be extracted from the code. Other information about the data selection properties and data discrepancies will be written in a separate markdown file as it is easier to read, write, and lint.

Output¶

The result of this project will be a markdown page with a table of media properties and a long-form description of each property.

The page will be published on https://docs.openverse.org/meta/media_properties.html with the title of “Media Properties”.

The source will be located at documentation/meta/media_properties.md.

See sample at media_properties_documentation.md.

This is markdown file generated by the just script and contains the preamble, the tables with the data properties, and the long-form description of the properties from media_properties.md.

This file is used in the CI to compare the newly-generated file to the existing one. If there are differences, the CI will fail and the maintainers will need to update the documentation page.

Prior art¶

The catalog has a DAG documentation generator that extracts the docstrings from the DAGs and creates a page with the documentation. We can use the same approach to document media properties. The DAG doc generator runs the code inside docker to extract DAG information because it requires the project dependencies (i.e. airflow) in order to parse the DAG. In this project, we can simplify and use the ast module to parse static code and extract the docstrings to make the checks faster. This will, however, mean that some of the settings will not be picked up and will have to be updated during the generation. One example of this is the nullable field in the Column class: it falls back to the value of not nullable during the object initiation.

Inputs¶

This section describes where the data for the documentation will be coming from.

Existing Files¶

SQL DDL files¶

Located at docker/upstream_db/: 0003_openledger_image_schema.sql, 0004_openledger_image_view.sql, 0006_openledger_audio_schema.sql, 0007_openledger_audio_view.sql.

These files define the database schema for the catalog. This will be parsed by the python script and converted into a Markdown table with the data properties. In the beginning, we will use the materialized views (image_view and audio_view) to get the list of properties and some properties that are not available in the main database (popularity). After Popularity calculation optimizations (Matview refresh) is implemented, and we drop the materialized views, only the main tables (image and audio) will be used.

A Python method will extract the information from the SQL files into media_properties dictionary with the name of the media property and the following values: SQL datatype, and SQL constraints. More data will be added to this dictionary in the further steps. It will also be possible to add API information into this dictionary in the API portion of this project.

media_properties = {
  "title": {
     "sql_properties": {
       "type": "character varying",
       "sql_constraints": "(5000)",
     },
  }
}

`columns.py`¶

Located in catalog/dags/common/storage/columns.py.

This file contains the Column class that defines the validation and transformation rules for the media properties. This file will be parsed by the python script and the information will be added to the “Python properties” column in the table with the data properties.

The Column class and its child classes describe the validations we use to write the data collected by provider scripts. The add_item method has docstrings with short descriptions of what each property is. The information from these items will be added into the media_properties dictionary from step 1.

media_properties = {
  "title": {
     "sql_properties": { ... # SQL properties },
     "python_properties": {
       "column": "[StringColumn](https://github.com/WordPress/openverse/blob/b4adc87c4e3cd7c9bdc879affda17fa21791c9ad/catalog/dags/common/storage/columns.py#L361-L401)",
       "required": False,
       "nullable": True,
       "truncate": True,
       "size": 5000,
    },
  }
}

`media.py`, `image.py` and `audio.py`¶

Located in catalog/dags/common/storage/.

These files contain the MediaStore, ImageStore and AudioStore classes that validate and transform the data. They are not easy to parse, so they will only be used for the CI checks. When a PR adds changes to these files, the CI will show a warning that the documentation needs to be updated.

New files¶

This project will create the following documents that will be used to generate the final documentation page posted on https://docs.openverse.org/meta/media_properties.html.

`media_properties.md`¶

Located at catalog/utilities/media_docs_gen/media_properties.md.

See sample at media_properties.md.

This is a markdown file with the description of the media properties. This file is manually written by the maintainers and contains the detailed information about the shape and kind of data we expect to have for the property, how to select it from the provider, and any inconsistencies between the data we have in the database and the data we expect to have. This file is parsed in step 3 and the information is added to the media_properties dictionary. This dictionary will be used to write the final page on the documentation site.

Information to include in `media_properties.md` for each property¶

Media types: Image, Audio or both
Short description
Names used in the provider scripts, if different from the database name
Shape of the object (if it is not a simple type)
Selection criteria
Normalization and validation performed in the MediaStore class and the relevant Column
Inconsistencies in the database data. See sample for identifier and tags in media_properties.md.

`preamble.md` and `postamble.md`¶

Located at catalog/utilities/media_docs_gen/preamble.md and catalog/utilities/media_docs_gen/postamble.md.

See sample at preamble.md and postamble. This is a markdown file with the preamble and the postamble (i.e. the general notes that are applicable to more than one media property) for the documentation page.

`generate_media_properties_documentation.py`¶

Located at catalog/utilities/media_docs_gen/generate_media_properties_documentation.py.

This is a Python script that parses the SQL DDL files, Python files, and the media_properties.md file to generate the media_properties_documentation.md file. This script is run by the just script.

Outlined Steps (parallelizable workstreams)¶

Create a Python script that parses the DDL files, Python files and media_properties.md file to generate the media_properties_documentation.md file.
Create a just script that runs the Python script from step 1 and posts a note about how to update the documentation if there are differences.
Create markdown files: general.md and media_properties.md with the information about the media properties. This will require going through the issues and data in the database to collect the information about any inconsistencies.
Create checks for changes in MediaStore files that will post a note about how to update the documentation if there are differences (in precommit).
Add the CI workflows to openverse repository CI workflows.

Dependencies¶

We can use the Python standard library as possible for this project.

Alternative (or additional) Approaches¶

Another option for documenting the media properties is to pass a dataclass instead of the named parameters to the add_item method of the ImageStore and AudioStore. This would make the documentation closer to code for easier updating, and have another benefit of allowing us to check for required properties sooner. However, this would require a lot of refactoring and if decided on, would need to be a separate project.