2023-04-06 Project Proposal: Popularity Calculation Optimizations¶

Author: @stacimc

Reviewers¶

[x] @AetherUnbound
[x] @obulat

Project summary¶

Reduce the length of time required for a data refresh by optimizing the popularity calculation steps, which are currently both time consuming and required steps. These optimizations will enable the data refresh to be run on a regular automated schedule.

Goals¶

Yearly goal: Data Inertia

Requirements¶

Any changes to the data refresh and popularity calculations must meet the following criteria:

A full data refresh should reliably complete in less than one week, for each media type.
During a data refresh, all updates made to existing records since the previous refresh should propagate to the API DB and Elasticsearch.
During a data refresh, all new records ingested since the previous refresh should propagate to the API DB and Elasticsearch.
Records for providers that support popularity metrics should have standardized popularity scores as soon as they become available in Elasticsearch.
When popularity constants and standardized scores are being updated, the data refresh runtime should not be affected.
There must be no ‘down-time’ in the Catalog, where writes to the media tables are locked (i.e., ingestion workflows should be able to continue as normal at all times).
Query time in the API must not be increased.

Ideally, we should have no regression in the regularity with which popularity constants are recalculated and normalization scores refreshed. Currently this happens monthly. However, as long as the stated requirements are met, it will be acceptable to recalculate constants and scores on a quarterly basis.

Success¶

For this project to be considered a success, we must be able to turn on the data refresh DAGs with an automated weekly schedule. This requires that the data refresh must be able to complete in under a week.

The requirements in the section above should also be met to prevent regressions in data quality and availability.

Participants and stakeholders¶

Lead: @stacimc
Implementation:
- @stacimc
- TBD
Stakeholders:
- Openverse Team

Infrastructure¶

This project will require:

changes to the Catalog database schema
creation and modification of Airflow DAGs related to the data refresh
minor changes in the ingestion server

Specifics will be detailed in the implementation plan. For testing, this will require connecting Airflow with the staging ingestion server and API database.

I do not anticipate changes to the Elasticsearch mapping or queries. However it should be noted that if this changes, we may require access to the Search Relevancy Sandbox in order to test.

Accessibility¶

This project should not have any frontend facing changes. However it will touch many complex parts of ingestion, popularity calculation, and the data refresh, and there must be a significant effort at each stage to document both previously undocumented processes and the additions made here.

Marketing¶

This change is largely internal and may present few marketing opportunities. However, since it will enable us to get automated data refreshes running consistently, there may be a significant increase in records in Elasticsearch after the project completes. We should check the record count difference and make an announcement if it is significant.

Required implementation plans¶

Removing the popularity steps from the data refresh DAG
- This plan will describe changes needed to separate the popularity calculations from the data refresh, including necessary changes to:
  - The data structure
  - New and existing DAGs
  - The ingestion server