2023-04-06 Implementation Plan: Update Staging Database#
This document describes how we will implement a mechanism for updating the staging database with the latest data from the production database.
There are two primary products of this plan. The first will be a DAG (scheduled
@monthly) which will recreate the staging database from the most recent
snapshot of the production API database. The second will be modifications to the
Django Admin UI notifying maintainers of the next scheduled staging database
This will be accomplished by:
Use a new
SKIP_STAGING_DATABASE_RESTOREAirflow Variable to control whether the DAG should run or not. This will allow us to skip the DAG in the case where we have made diverging chances to the staging database for testing a new feature that we do not want to be overwritten. If this Variable is set to
True, the DAG should issue a Slack message notifying the maintainers of the skipped run and raise an
Determine the most recent automated production snapshot using boto3’s
describe_db_snapshotsfunction (an Airflow operator does not exist for this operation). This step should also check the
Statusvalue of the response for the most recent snapshot to ensure that it is
In the case where the snapshot is not yet available, wait for the status of the most recent snapshot using the
In parallel, gather the attributes of the staging database using boto3’s
describe_db_instancesfunction (an Airflow operator does not exist for this operation). Namely, the following attributes should be gathered and set to ensure they match:
Availability should be a single DB instance (not multi-AZ)
It should be attached to our default Virtual Private Cloud (VPC)
It should not be publicly accessible
It will need access to the following VPC security groups:
It should match the staging database’s instance size (currently
It should have 3000 GB of allocated gp2 storage
It should use password authentication (this may not need to be changed from the default, as the snapshot should contain the same connection information)
Create a new database from this snapshot using boto3’s
restore_db_instance_from_db_snapshotfunction (an operator does not exist for this operation). This database will be named in a way that does not conflict with the existing staging database name, e.g.
dev-next-openverse-db. The database configuration information from the previous step will be used to ensure the new database matches the old database’s configuration exactly.
Rename the old staging database to
modify_db_instancefunction (an operator does not exist for this operation). We must set the
ApplyImmediatelyoption here to
Trueto ensure this operation happens immediately rather than waiting for the next maintenance window.
Wait for the old database rename to be complete with the
RdsDbSensor(we may need retries on this step, since the database may not be initially available/named when the sensor first starts). Note: this will cause a temporary outage of the staging API, see the alternatives section for why this is the case. A Slack notification should be sent out at the start of this step to alert the team of the outage, and again once the outage is resolved.
Rename the new database to
modify_db_instance. (Noting that
ApplyImmediatelyshould be set to
Truehere as well.)
Wait for the new database rename to be complete with the
RdsDbSensor(we may need retries on this step, since the database may not be initially available/named when the sensor first starts).
If the previous steps fail, rename
dev-old-openverse-dbcan be deleted using the
RdsDeleteDbInstanceOperator. We should use the following configuration options:
wait_for_completionshould be left as
Truehere so this stop will hang until the database removal is complete.
SkipFinalSnapshotshould be set to
Trueto avoid creating a final snapshot of the database before deletion.
Once the database rename is complete, the following actions will also need to occur (they can occur simultaneously in the DAG):
The staging API must be deployed to apply any necessary migrations to the new copy the database. In order to do this we must augment the existing GitHub API class with additional functions for getting the latest GHCR image tag and triggering the deployment workflow. Once those functions exist, the steps for this piece would be:
Get the latest image tag for the
Trigger the deployment workflow in a similar manner to the existing CI/CD workflow using the tag from the previous step and the
Update the Elasticsearch indices corresponding to each media type. This can be done by using the
UPDATE_INDEXaction on the ingestion server. The steps for this piece would be:
Get the date for the last successful run of the update DAG (see the
_month_checkfunction from the data refresh DAG for a similar example).
For each media type:
Truncate the OAuth tables to prevent production API applications from working in staging. Each of these truncate operations can be run simultaneously using
TRUNCATE TABLE <table_name> RESTART IDENTITY CASCADE;in a
PGExecuteQueryOperator. The tables that need to be truncated are:
Report the success or failure of the DAG run to Slack.
Django Admin UI changes#
The above section describes the process for performing the staging update. A
side-effect of this method of updating is that the data in staging will be
destroyed in favor of the data from the production database. Maintainers can use
SKIP_STAGING_DATABASE_RESTORE Airflow Variable described above to prevent
the deletion of the staging database. We will also add a notice to the top of
Django Admin UI denoting when the next scheduled database recreation will occur.
This can be done by
overriding the base Django Admin template
and adding a notice to the top of the page. Since the DAG is intended to be run
@monthly, a countdown to the next run can be calculated and displayed for
maintainers. We will also want to provide instructions or a link to instructions
for how to disable the staging database recreation using the aforementioned
Tools & packages#
We already have the
amazon provider (and thus
installed on Airflow.
We should not need any additional dependencies.
Other projects or work#
This does not depend on any existing projects.
This project directly affects our infrastructure, but should not require any additional dependencies or changes. After this DAG is run, the infrastructure should mirror what existed before (albeit with an updated staging database).
The plan described above will incur a short (<10 minute) outage of the staging API. This is because the database name is changed in the middle of the process. If we want to avoid this outage, we could instead create a new database with a different name, and then switch the staging API and any other dependent services (such as the ingestion server) over to the new database. This would eliminate the outage, but would instead require that we deploy all staging services which depend on the database as part of the DAG workflow. We do not yet have an automated way for Airflow to trigger a deployment itself, so this would require additional work to get Airflow integrated with GitHub Actions. Furthermore, changes to the services environment variables would require changes to the infrastructure repository and an application of the updated terraform configuration.
With this in mind, it seems much easier to handle the outage for staging rather than try and avoid it.
Per-table update policy#
We could also update the database in a more granular fashion, updating each table individually. This would allow us to avoid the outage, would enable further granularity with respect to which tables we update when, and could be used to reduce the delay between when the production database is updated and when the staging database receives the same data. This was discussed heavily in the implementation plan PR, and several potential mechanisms for this method of updating were discussed, namely:
Foreign Data Wrapper to point staging media tables to production
Postgres’ “logical replication” feature to keep production and staging media tables in sync
Foreign Data Wrapper to insert all records from production media tables into staging
We opted to defer the pursuit of this alternate approach for the time being for the following reasons:
The FDW/replication approach(es) described may have numerous pitfalls, unknowns, or drawbacks which the maintainers are not privy to due to unfamiliarity with the functionality.
The proposed alternative solution is essentially describes an ETL framework for production -> staging replication. While such a system could be significantly more flexible, it would by necessity also be more complex, and would warrant its own project process to flesh out.
At the time of writing this is the first of three implementation plans this project requires, the other two being “Rapid iteration on ingestion server index configuration” and “Staging Elasticsearch reindex DAGs for both potential index types”. Any elongations to the project timeline at this step could also affect the timeline for drafting and implementing those plans as well.
The DAG described here does not present any permanent, lasting, and irreversible changes; it can be disabled and replaced at any time in favor of a more thorough update approach down the line.
It is intended to be explored further in another project in the future.
This DAG will need to have clear instructions for how to run it for those unfamiliar with Airflow.
Based on the above steps, the following work can be parallelized:
Alterations to the GitHub API class to include new methods for pulling the latest GHCR image tag and for triggering a workflow.
The addition of the notification banner to the Django Admin UI.
After the above work is complete, the primary DAG can be written and tested.
A rollback for this would work would only require deleting the created DAG.
Because this DAG modifies our existing infrastructure directly, there is risk
that we may adversely affect production databases with this DAG. Therefore, care
should be taken in asserting that only staging is affected. One way to
accomplish this would be to provide a wrapper around all the functions we will
need for interacting with
boto3 that checks that
prod is not in any of the
references made to database assets (save for the initial snapshot acquisition
prod is found in any of the references, the DAG should fail.
Additionally, I have enabled deletion protection on the
This does not affect rename operations and thus does not fully mitigate a
production outage if the database was renamed, but it does prevent a full
deletion of the database.
Internal discussions regarding restoring from snapshots in the past (these links are not publicly available but may serve as useful references for maintainers):