2023-03-31 Project Proposal: Search relevancy sandbox#

Author: @aetherunbound

Reviewers#

Project summary#

Develop mechanisms and workflows for sustainably creating & updating our staging API database & Elasticsearch cluster, in order to enable testing changes to Elasticsearch indices. The staging setup will reflect production in two separate ways using two different indices: proportionally by provider and with production-level volumes.

Goals#

Yearly goal: Result Relevancy

Requirements#

The following should be available once this project is complete:

  1. A mechanism by which maintainers can easily update the staging API database with recent data.

  2. A mechanism by which maintainers can create/update a small but proportional-to-production-by-provider index.

  3. A mechanism by which maintainers can create/update a production-data-volume sized index.

  4. A mechanism by which maintainers can easily deploy & iterate on new Elasticsearch indexing configurations

  5. A mechanism by which maintainers can easily point the staging Elasticsearch index aliases (currently image and audio) to one of the above.

Many of these mechanisms will likely be built in Airflow.

Proportional by provider#

The proportional-by-provider index will be a subset of the production data, constructed so that the number of records by provider is roughly proportional to the percentages per provider that exist in production. This index will allow us to iterate rapidly on the following:

  • Changes affecting Elasticsearch index configuration (e.g. shard sizes, slicing, etc.)

  • Changes to the ingestion server’s configuration which require running a data refresh end-to-end

  • Measuring index performance

  • Integration testing dead link & thumbnail checks across all providers

This could be done by:

  1. Querying the /stats endpoint of the production API (e.g. https://api.openverse.engineering/v1/images/stats/).

  2. Calculating percentage-per-provider.

  3. Computing the number of results per provider based on these percentages based on the condition that the smallest provider has a minimum N results (e.g. a minimum of 10 results).

  4. Pulling a number of results based on the previous step per-provider either into a Postgres table or an Elasticsearch index (the specifics of this step will be left for the implementation plan).

This index will allow us to assess how changes to Elasticsearch index or the API will affect results using a smaller total result count. It will also allow us to test API results from all currently available providers.

Production data volume#

The production-data-volume index will be an index of similar size (i.e. on the same order of magnitude in document count) to production. This is how the data refresh is currently set up and does not require any additional work to set up.

This index will allow us to test how queries and indexing operations perform at a data volume consistent with production.

Success#

This project can be considered a success once any maintainer can make adjustments to the staging API database & Elasticsearch index based on the requirements described above.

Participants and stakeholders#

Infrastructure#

This project may require infrastructure work necessary to assist with the rapid iteration of the ingestion server. Further details will be determined in the implementation plan.

The project will also likely involve the creation or modification of Airflow DAGs related to the data refresh and infrastructure modification. This will require interfacing Airflow with the staging ingestion server, API database, and Elasticsearch cluster.

Accessibility#

Not every member of the maintainer team is intimately familiar with Airflow - we will need to provide clear instructions for maintainers on how to use Airflow to run any of the above mechanisms.

Marketing#

No marketing is required as this is an improvement internal to the team.

Required implementation plans#

In the order they should be completed:

  1. Staging API DB update procedure

    • This plan will describe a DAG which can be triggered to update the staging API database.

  2. Rapid iteration on ingestion server index configuration

    • This plan will describe how rapid iteration of Elasticsearch index configurations can happen. This can be done via Airflow interacting directly with Elasticsearch rather than having to process requests through the ingestion server. Airflow would only need to be aware of the existing index settings and how to augment/adjust them for the new index. (See this comment for further inspiration)

    • It should be assessed as part of this implementation plan whether it would be easier to convert the ingestion server to ECS or provide a mechanism on the existing EC2 infrastructure to update the Elasticsearch index configuration without issuing a deployment (similar to the dag-sync script for the catalog).

  3. Staging Elasticsearch reindex DAGs for both potential index types (these will be subsets of the full data refresh)

    • This plan will describe the DAG or DAGs which will be used to create/update both the proportional-by-provider and production-data-volume indices.

    • It will also describe the mechanism by which maintainers can rapidly switch index the staging API uses. This could be done in two separate ways: a DAG which allows changing the primary index alias or a set of changes to the API which would allow queries to specify which index they use. The implementation plan should explore and describe both options.