The Openverse Catalog consists of two core components: a Postgres instance and an Airflow instance. This document describes the environments & deployment procedures for both components.
Airflow has two different deployment mechanisms: service deployments and DAG deployments. Service deployments occur when the Airflow service itself or any dependencies needs to be updated. DAG deployments occur when DAG definitions or operational code has changed.
The Catalog’s Airflow instance is hosted via AWS’s Elastic Compute Cloud (EC2) service. The EC2 instance is managed by Terraform. If a change is made to the service initialization bash script, Terraform will recognize this change and can deploy a new EC2 instance with the updated configuration. This change does not occur automatically and will need to be manually initiated.
Currently the webserver, scheduler, and worker(s) are all run within a single
docker container on the EC2 instance as defined by
is used to spin up Airflow in production.
Note: Service deployments are only necessary in the following conditions:
The Docker image has been modified
An environment variable needs to be added/removed/changed
A change to the infrastructure has been made (e.g. networking/routing changes)
For DAG updates and code-only changes, see the DAG deployments section.
The “🔒” icon below notes that the action must be performed by a core maintainer with infrastructure access.
After the release GitHub action has run, check that the new Docker tag corresponding to the release exists on ghcr.io.
Ensure that there are no actively running DAGs.
🔒 Update the catalog version in the infrastructure repo by running
just bump prod catalog-airflow
🔒 Initiate the deployment via
just plan prod catalog-airflow(and subsequently
just apply prod catalog-airflowif the plan looks correct).
Due to the nature of Airflow, changes made to DAG definitions or operational code (e.g. any files that are used by the DAG) will be immediately picked up on the next task run for a DAG. This means that we can update the python code in-place and the next DAG run or task in a currently running DAG will use the updated code. In these cases, a new EC2 instance does not need to be deployed.
script is used in production to regularly update the repository (and thus the
DAG files) on the running EC2 instance.
Since this process happens automatically, the only precursor to a DAG deployment
is merging the relevant PR into
Note: New DAGs are not automatically enabled and will need to be turned on once deployed.
The Catalog is also backed by a Postgres instance, which serves as the primary data warehouse for all ingested media data.
This instance is hosted on AWS using RDS. The
instance can be accessed using
pgcli or a similar
client with the jumphost defined in the infrastructure repository. For more
information on this process see the
documentation on connecting to the jumphost
(note that you will need additional access to view this document, please reach
out to the maintainers if you’re interested).
Any migrations to the Catalog database must either be performed by hand or as part of a DAG’s normal operation (see: iNaturalist).