2023-04-06 Implementation Plan: Update Staging Database#

Author: @aetherunbound

Reviewers#

Overview#

This document describes how we will implement a mechanism for updating the staging database with the latest data from the production database.

Outlined Steps#

There are two primary products of this plan. The first will be a DAG (scheduled for @monthly) which will recreate the staging database from the most recent snapshot of the production API database. The second will be modifications to the Django Admin UI notifying maintainers of the next scheduled staging database update.

DAG#

This will be accomplished by:

  1. Use a new SKIP_STAGING_DATABASE_RESTORE Airflow Variable to control whether the DAG should run or not. This will allow us to skip the DAG in the case where we have made diverging chances to the staging database for testing a new feature that we do not want to be overwritten. If this Variable is set to True, the DAG should issue a Slack message notifying the maintainers of the skipped run and raise an AirflowSkipException.

  2. Determine the most recent automated production snapshot using boto3’s describe_db_snapshots function (an Airflow operator does not exist for this operation). This step should also check the Status value of the response for the most recent snapshot to ensure that it is available.

  3. In the case where the snapshot is not yet available, wait for the status of the most recent snapshot using the RdsSnapshotExistenceSensor.

  4. In parallel, gather the attributes of the staging database using boto3’s describe_db_instances function (an Airflow operator does not exist for this operation). Namely, the following attributes should be gathered and set to ensure they match:

    • Availability should be a single DB instance (not multi-AZ)

    • It should be attached to our default Virtual Private Cloud (VPC)

    • It should not be publicly accessible

    • It will need access to the following VPC security groups: default, openverse-vpc-db-access, staging-dev-rds-sg

    • It should match the staging database’s instance size (currently m5.xlarge)

    • It should have 3000 GB of allocated gp2 storage

    • It should use password authentication (this may not need to be changed from the default, as the snapshot should contain the same connection information)

  5. Create a new database from this snapshot using boto3’s restore_db_instance_from_db_snapshot function (an operator does not exist for this operation). This database will be named in a way that does not conflict with the existing staging database name, e.g. dev-next-openverse-db. The database configuration information from the previous step will be used to ensure the new database matches the old database’s configuration exactly.

  6. Wait for the new database to be ready using the RdsDbSensor (example).

  7. Rename the old staging database to dev-old-openverse-db using boto3’s modify_db_instance function (an operator does not exist for this operation). We must set the ApplyImmediately option here to True to ensure this operation happens immediately rather than waiting for the next maintenance window.

  8. Wait for the old database rename to be complete with the RdsDbSensor (we may need retries on this step, since the database may not be initially available/named when the sensor first starts). Note: this will cause a temporary outage of the staging API, see the alternatives section for why this is the case. A Slack notification should be sent out at the start of this step to alert the team of the outage, and again once the outage is resolved.

  9. Rename the new database to dev-openverse-db using modify_db_instance. (Noting that ApplyImmediately should be set to True here as well.)

  10. Wait for the new database rename to be complete with the RdsDbSensor (we may need retries on this step, since the database may not be initially available/named when the sensor first starts).

  11. If the previous steps fail, rename dev-old-openverse-db back to dev-openverse-db. Otherwise, dev-old-openverse-db can be deleted using the RdsDeleteDbInstanceOperator. We should use the following configuration options:

    • wait_for_completion should be left as True here so this stop will hang until the database removal is complete.

    • SkipFinalSnapshot should be set to True to avoid creating a final snapshot of the database before deletion.

  12. Once the database rename is complete, the following actions will also need to occur (they can occur simultaneously in the DAG):

    1. The staging API must be deployed to apply any necessary migrations to the new copy the database. In order to do this we must augment the existing GitHub API class with additional functions for getting the latest GHCR image tag and triggering the deployment workflow. Once those functions exist, the steps for this piece would be:

      1. Get the latest image tag for the openverse-api package.

      2. Trigger the deployment workflow in a similar manner to the existing CI/CD workflow using the tag from the previous step and the openverse-bot actor.

    2. Update the Elasticsearch indices corresponding to each media type. This can be done by using the UPDATE_INDEX action on the ingestion server. The steps for this piece would be:

      1. Get the date for the last successful run of the update DAG (see the _month_check function from the data refresh DAG for a similar example).

      2. For each media type:

        1. Get the current index (see the get_current_index task on the data refresh DAG)

        2. Initiate the UPDATE_INDEX action using the date and index suffix retrieved above

        3. Wait for the index update to complete

    3. Truncate the OAuth tables to prevent production API applications from working in staging. Each of these truncate operations can be run simultaneously using TRUNCATE TABLE <table_name> RESTART IDENTITY CASCADE; in a PGExecuteQueryOperator. The tables that need to be truncated are:

      • api_throttledapplication

      • api_oauth2registration

      • api_oauth2verification

      • oauth2_provider_accesstoken

      • oauth2_provider_grant

      • oauth2_provider_idtoken

      • oauth2_provider_refreshtoken

  13. Report the success or failure of the DAG run to Slack.

Django Admin UI changes#

The above section describes the process for performing the staging update. A side-effect of this method of updating is that the data in staging will be destroyed in favor of the data from the production database. Maintainers can use the SKIP_STAGING_DATABASE_RESTORE Airflow Variable described above to prevent the deletion of the staging database. We will also add a notice to the top of Django Admin UI denoting when the next scheduled database recreation will occur. This can be done by overriding the base Django Admin template and adding a notice to the top of the page. Since the DAG is intended to be run @monthly, a countdown to the next run can be calculated and displayed for maintainers. We will also want to provide instructions or a link to instructions for how to disable the staging database recreation using the aforementioned Airflow Variable.

Dependencies#

Tools & packages#

We already have the amazon provider (and thus boto3) installed on Airflow. We should not need any additional dependencies.

Other projects or work#

This does not depend on any existing projects.

Infrastructure#

This project directly affects our infrastructure, but should not require any additional dependencies or changes. After this DAG is run, the infrastructure should mirror what existed before (albeit with an updated staging database).

Alternatives#

Database cut-over#

The plan described above will incur a short (<10 minute) outage of the staging API. This is because the database name is changed in the middle of the process. If we want to avoid this outage, we could instead create a new database with a different name, and then switch the staging API and any other dependent services (such as the ingestion server) over to the new database. This would eliminate the outage, but would instead require that we deploy all staging services which depend on the database as part of the DAG workflow. We do not yet have an automated way for Airflow to trigger a deployment itself, so this would require additional work to get Airflow integrated with GitHub Actions. Furthermore, changes to the services environment variables would require changes to the infrastructure repository and an application of the updated terraform configuration.

With this in mind, it seems much easier to handle the outage for staging rather than try and avoid it.

Per-table update policy#

We could also update the database in a more granular fashion, updating each table individually. This would allow us to avoid the outage, would enable further granularity with respect to which tables we update when, and could be used to reduce the delay between when the production database is updated and when the staging database receives the same data. This was discussed heavily in the implementation plan PR, and several potential mechanisms for this method of updating were discussed, namely:

  1. Foreign Data Wrapper to point staging media tables to production

  2. Postgres’ “logical replication” feature to keep production and staging media tables in sync

  3. Foreign Data Wrapper to insert all records from production media tables into staging

We opted to defer the pursuit of this alternate approach for the time being for the following reasons:

  • The FDW/replication approach(es) described may have numerous pitfalls, unknowns, or drawbacks which the maintainers are not privy to due to unfamiliarity with the functionality.

  • The proposed alternative solution is essentially describes an ETL framework for production -> staging replication. While such a system could be significantly more flexible, it would by necessity also be more complex, and would warrant its own project process to flesh out.

  • At the time of writing this is the first of three implementation plans this project requires, the other two being “Rapid iteration on ingestion server index configuration” and “Staging Elasticsearch reindex DAGs for both potential index types”. Any elongations to the project timeline at this step could also affect the timeline for drafting and implementing those plans as well.

  • The DAG described here does not present any permanent, lasting, and irreversible changes; it can be disabled and replaced at any time in favor of a more thorough update approach down the line.

It is intended to be explored further in another project in the future.

Accessibility#

This DAG will need to have clear instructions for how to run it for those unfamiliar with Airflow.

Parallelizable streams#

Based on the above steps, the following work can be parallelized:

  1. Alterations to the GitHub API class to include new methods for pulling the latest GHCR image tag and for triggering a workflow.

  2. The addition of the notification banner to the Django Admin UI.

After the above work is complete, the primary DAG can be written and tested.

Rollback#

A rollback for this would work would only require deleting the created DAG.

Risks#

Because this DAG modifies our existing infrastructure directly, there is risk that we may adversely affect production databases with this DAG. Therefore, care should be taken in asserting that only staging is affected. One way to accomplish this would be to provide a wrapper around all the functions we will need for interacting with boto3 that checks that prod is not in any of the references made to database assets (save for the initial snapshot acquisition step). If prod is found in any of the references, the DAG should fail.

Additionally, I have enabled deletion protection on the prod-openverse-db. This does not affect rename operations and thus does not fully mitigate a production outage if the database was renamed, but it does prevent a full deletion of the database.

Prior art#