Catalog Deployments#

The Openverse Catalog consists of two core components: a Postgres instance and an Airflow instance. This document describes the environments & deployment procedures for both components.



Airflow has two different deployment mechanisms: service deployments and DAG deployments. Service deployments occur when the Airflow service itself or any dependencies needs to be updated. DAG deployments occur when DAG definitions or operational code has changed.

Service deployments#

The Catalog’s Airflow instance is hosted via AWS’s Elastic Compute Cloud (EC2) service. The EC2 instance is managed by Terraform. If a change is made to the service initialization bash script, Terraform will recognize this change and can deploy a new EC2 instance with the updated configuration. This change does not occur automatically and will need to be manually initiated.

Currently the webserver, scheduler, and worker(s) are all run within a single docker container on the EC2 instance as defined by the Airflow Dockerfile. The docker-compose.yml is used to spin up Airflow in production.

Note: Service deployments are only necessary in the following conditions:

  • The Docker image has been modified

  • An environment variable needs to be added/removed/changed

  • A change to the infrastructure has been made (e.g. networking/routing changes)

For DAG updates and code-only changes, see the DAG deployments section.

Deployment workflow#

The “🔒” icon below notes that the action must be performed by a core maintainer with infrastructure access.

  1. Cut a new release on GitHub.

  2. After the release GitHub action has run, check that the new Docker tag corresponding to the release exists on

  3. Ensure that there are no actively running DAGs.

  4. 🔒 Update the catalog version in the infrastructure repo by running just bump prod catalog-airflow

  5. 🔒 Initiate the deployment via just plan prod catalog-airflow (and subsequently just apply prod catalog-airflow if the plan looks correct).

DAG deployments#

Due to the nature of Airflow, changes made to DAG definitions or operational code (e.g. any files that are used by the DAG) will be immediately picked up on the next task run for a DAG. This means that we can update the python code in-place and the next DAG run or task in a currently running DAG will use the updated code. In these cases, a new EC2 instance does not need to be deployed.

The script is used in production to regularly update the repository (and thus the DAG files) on the running EC2 instance.

Deployment workflow#

Since this process happens automatically, the only precursor to a DAG deployment is merging the relevant PR into main.

Note: New DAGs are not automatically enabled and will need to be turned on once deployed.


The Catalog is also backed by a Postgres instance, which serves as the primary data warehouse for all ingested media data.

This instance is hosted on AWS using RDS. The instance can be accessed using pgcli or a similar client with the jumphost defined in the infrastructure repository. For more information on this process see the documentation on connecting to the jumphost (note that you will need additional access to view this document, please reach out to the maintainers if you’re interested).


Any migrations to the Catalog database must either be performed by hand or as part of a DAG’s normal operation (see: iNaturalist).