2024-05-30 Implementation Plan: Augment the catalog database with suitable Rekognition tags¶

Reviewers¶

[x] @sarayourfriend
[x] @stacimc

Project links¶

Overview¶

Note

References throughout this document to “the database” refer exclusively to the catalog database. The API database is named explicitly where referenced.

Note

The terms “tags” and “labels” are often used interchangeably in this document. Broadly, “labels” refer to the actual name of the tag used, and “tags” refer to the blob of data available in the catalog database which include those labels.

This implementation plan describes the technical process we intend to use for incorporating Rekognition data in the catalog database, and the criteria we will use when filtering tags as they make their way into the API database. This includes defining criteria for the following:

Which tags should be included/excluded in the API
What minimum accuracy value is required for inclusion

Since there already exist machine-generated tags which may not conform to the above criteria, a plan is provided for handling those existing tags as well.

Note

This document operates under the understanding that the catalog database is Openverse’s data warehouse and should store as much as possible. It’s the responsibility of the data refresh process to dictate what data should be surfaced in the API, and filter where necessary (see #4541 and #4524 for more details).

Expected Outcomes¶

At the end of the implementation of this project, we should have the following:

Clear criteria for the kinds of tags we will filter when presenting machine-generated tags in the API
A clear minimum accuracy value for machine generated tags
All available Rekognition tags will be added to the catalog
An approach for filtering the new Rekognition tags based on the above criteria
An approach for filtering the existing Clarifai tags until further analysis can be performed on the kinds of tags it provides

Label criteria¶

This section describes the criteria used for determining which machine-generated tags we should exclude when adding any new tags to the database, and what the minimum accuracy cutoff for those tags should be.

Label selection¶

Machine-generated tags that are the product of AI image labeling models have been shown repeatedly and consistently to perpetuate certain cultural, structural, and institutional biases[1][2][3]. This includes analysis done on AWS Rekognition, specifically[4][5][6].

Certain demographic axes seem the most likely to result in an incorrect or insensitive label (e.g. gender assumption of an individual in a photo). For the reasons described in the above cited works, we should exclude labels that have a demographic context in the following categories:

There are other categories which might be useful for search relevancy and are less likely to be applied in an insensitive manner. Inclusion or exclusion of labels that match these categories should be considered on a case-by-case basis depending on the source of the labeling. Labels within each category that are otherwise gendered (e.g. “stewardess”, “actress”, etc.) should be excluded by default. Some examples include:

Occupation
Health and disability status
Political affiliation or preference
Religious affiliation or preference

Accuracy selection¶

We already filter out existing tags from the catalog when copying data into the API database during the data refresh’s cleanup step [7]. The minimum accuracy value used for this step is 0.9 (or 90%) . AWS’s own advice on what value to use is that it depends entirely on the use case of the application.

I took a small sample of the labels we have available (~100MB out of the 196GB dataset, about 45k images with labels) and performed some exploratory analysis on the data. I found the following pieces of information:

Total images: 45,059
Total labels: 555,718
Average confidence across all labels: 79.927835
Median confidence across all labels: 81.379463
Average confidence per image: 81.073921
Median confidence per image: 82.564148
Number of labels with confidence higher than 90: 210,341
Percentage of labels with confidence higher than 90: 37.85031%
Average number of labels per image higher than 90: 4.6629

For a full explanation on this exploration, see: Analysis explanation

Based on the number of labels we would still be receiving with a confidence higher than 90, and that 0.9 is already our existing minimum standard, we should retain 0.9 or 90% as our minimum label accuracy value for inclusion in the API.

This necessarily means that we will not be surfacing a projected 62% of the labels which are available in the Rekognition dataset. Accuracy, as it directly relates to search relevancy, is more desirable here than completeness. We will retain all Rekognition tags in the catalog regardless, and so if we decide to allow a lower accuracy threshold, we can always adjust the threshold value and run a new data refresh to surface those tags.

Step-by-step plan¶

In order to incorporate accomplish the goals of this plan, the following steps will need to be performed:

Step details¶

Note

Some of the steps listed below have some cross-over with functionality defined in/required by the data normalization project (#430) and the ingestion server removal project (#3925). Where possible, existing issues will be referenced and possible duplicated effort will be identified.

Determine excluded labels¶

This will involve a manual process of looking through each of the available labels for Rekognition and seeing if they match any of the criteria to be filtered. This process should be completed by two maintainers, and their list of exclusions discussed & combined. The excluded labels should then be saved in an accessible location, either on S3 or within the sensitive terms repository as a new file. Consent & approval should be sought from two other maintainers on the accuracy of the exclusion list prior to publishing.

Preemptively filter Rekognition tags¶

Before inserting the Rekognition tags, we want to make sure they are appropriately filtered during the data refresh. This filtering can either be the more complete set of exclusions described above for both the labels themselves and their accuracy. This, however, depends on the completion of #4541 and the ingestion server removal project in general (#3925).

In order to work on this effort in parallel with #3925, we can add a check to the existing tag filtering step which will exclude all tags with the provider rekognition. That way we can add all of the tags to the catalog with impunity, and allow those tags to be exposed when #3925 is finished and turned on.

Insert new Rekognition tags¶

The below steps describe a thorough, testable, and reproducible way to generate and incorporate the new Rekognition tags. It would be possible to short-cut many of these steps by running them as one-off commands or scripts locally (see Alternatives). Since we may need to incorporate machine-labels in bulk in a similar manner in the future, having a clear and repeatable process for doing so will make those operations easier down the line. It also allows us to test the insertion process locally, which feels crucial for such a significant addition of data.

Note

This step will add all of the available Rekognition tags to the catalog. The tags with an insufficient confidence value or that were determined to be excluded will be filtered out of the API database during the data refresh.

Context¶

The Rekognition dataset we have available is a JSON lines file where each line is a JSON object with (roughly) the following shape:

{
  "image_uuid": "960b59e6-63f7-4beb-9cd0-6e3a275991a8",
  "response": {
    "Labels": [
      {
        "Name": "Human",
        "Confidence": 99.82632446289062,
        "Instances": [],
        "Parents": []
      },
      {
        "Name": "Person",
        "Confidence": 99.82632446289062,
        "Instances": [
          {
            "BoundingBox": {
              "Width": 0.219997838139534,
              "Height": 0.46728312969207764,
              "Left": 0.6179072856903076,
              "Top": 0.39997851848602295
            },
            "Confidence": 99.82632446289062
          },
          ...
        ],
        "Parents": []
      },
      {
        "Name": "Crowd",
        "Confidence": 93.41161346435547,
        "Instances": [],
        "Parents": [
          {
            "Name": "Person"
          }
        ]
      },
      {
        "Name": "People",
        "Confidence": 86.95382690429688,
        "Instances": [],
        "Parents": [
          {
            "Name": "Person"
          }
        ]
      },
      {
        "Name": "Game",
        "Confidence": 68.61305236816406,
        "Instances": [],
        "Parents": [
          {
            "Name": "Person"
          }
        ]
      },
      {
        "Name": "Chess",
        "Confidence": 68.61305236816406,
        "Instances": [
          {
            "BoundingBox": {
              "Width": 0.8339029550552368,
              "Height": 0.7898563742637634,
              "Left": 0.08363451808691025,
              "Top": 0.1719469130039215
            },
            "Confidence": 68.61305236816406
          }
        ],
        "Parents": [
          {
            "Name": "Game"
          },
          {
            "Name": "Person"
          }
        ]
      },
      {
        "Name": "Coat",
        "Confidence": 68.09342193603516,
        "Instances": [],
        "Parents": [
          {
            "Name": "Clothing"
          }
        ]
      },
      {
        "Name": "Suit",
        "Confidence": 68.09342193603516,
        "Instances": [],
        "Parents": [
          {
            "Name": "Overcoat"
          },
          {
            "Name": "Coat"
          },
          {
            "Name": "Clothing"
          }
        ]
      },
      {
        "Name": "Apparel",
        "Confidence": 68.09342193603516,
        "Instances": [],
        "Parents": []
      },
      {
        "Name": "Clothing",
        "Confidence": 68.09342193603516,
        "Instances": [],
        "Parents": []
      },
      {
        "Name": "Overcoat",
        "Confidence": 68.09342193603516,
        "Instances": [],
        "Parents": [
          {
            "Name": "Coat"
          },
          {
            "Name": "Clothing"
          }
        ]
      },
      {
        "Name": "Meal",
        "Confidence": 62.59776306152344,
        "Instances": [],
        "Parents": [
          {
            "Name": "Food"
          }
        ]
      },
      {
        "Name": "Food",
        "Confidence": 62.59776306152344,
        "Instances": [],
        "Parents": []
      },
      {
        "Name": "Furniture",
        "Confidence": 58.1875,
        "Instances": [],
        "Parents": []
      },
      {
        "Name": "Tablecloth",
        "Confidence": 57.604129791259766,
        "Instances": [],
        "Parents": []
      },
      {
        "Name": "Party",
        "Confidence": 57.07652282714844,
        "Instances": [],
        "Parents": []
      },
      {
        "Name": "Dinner",
        "Confidence": 56.07081985473633,
        "Instances": [],
        "Parents": [
          {
            "Name": "Food"
          }
        ]
      },
      {
        "Name": "Supper",
        "Confidence": 56.07081985473633,
        "Instances": [],
        "Parents": [
          {
            "Name": "Food"
          }
        ]
      }
    ],
    "LabelModelVersion": "2.0",
    "ResponseMetadata": {
      "RequestId": "60c4b6f5-3b73-466e-8fa5-e40037661253",
      "HTTPStatusCode": 200,
      "HTTPHeaders": {
        "content-type": "application/x-amz-json-1.1",
        "date": "Thu, 29 Oct 2020 19:46:02 GMT",
        "x-amzn-requestid": "60c4b6f5-3b73-466e-8fa5-e40037661253",
        "content-length": "3526",
        "connection": "keep-alive"
      },
      "RetryAttempts": 0
    }
  }
}

This file is about 200GB in total. For more information about the data, see Analysis Explanation.

DAG¶

Attention

A snapshot of the catalog database should be created prior to running this step in production.

We will create a DAG (add_rekognition_labels) which will perform the following steps:

Create a temporary table in the catalog for storing the tag data. This table will be two columns: identifier and tags (with data types matching the existing catalog columns).
Iterate over the large Rekognition dataset in a chunked manner using smart_open. smart_open provides options for tuning buffer size so larger chunks can be read into memory.
1. For each line, read in the JSON object and pull out the top-level labels & confidence values. Note: some records may not have any labels.
2. Construct a tags JSON object similar to the existing tags data for that image, including accuracy and provider. Ensure that the casing of the labels is preserved and that the confidence value is between 0.0 and 1.0 (e.g. [{"name": "Cat", "accuracy": 0.9983, "provider": "rekognition"}, ...]).
3. At regular intervals, insert batches of constructed identifier/tags pairs into the temporary table.
Launch a batched update run which merges the existing tags and the new tags from the temporary table for each identifier[8]. Note: the batched update DAG may need to be augmented in order to reference data from an existing table, similar to #3415.
Delete the temporary table.

For local testing, a small sample of the Rekognition data could be made available in the local S3 server similar to the iNaturalist sample data.

Filter Rekognition tags¶

Once the new Rekognition tags have been inserted into the catalog, we will want to remove the blanket-filter for all Rekognition tags that was put in place in the preemptive filtering step. We will also need to add the logic for filtering out the tags that were determined should be excluded in the determine excluded labels step. This filtering is done in service of appropriately improving search relevancy.

For all machine-generated labels, we will employ an inclusion-based filtering process. This means that we will only filter out labels that match the list of approved labels, which prevents labels that are unreviewed from appearing in the downstream dataset. This can be added to the alter_data step of the data refresh (see #4684) and would only be applied to tags where the provider was not the record’s provider.

The comparison between labels on the record and labels in the list should be case-insensitive, given that the semantic content of the labels is generally case-insensitive too. Similar to the sensitive terms list, both inclusion and reviewed lists will be applied to all tag sources (in that, we will not maintain provider-specific lists).

For any orthographic corrections we’ve made to the labels, we will have the corrected label present in the inclusion list and the original label in the reviewed list. This will ensure that the corrected label is surfaced in the API, but the original label gets blocked in the cases where it may be added by another provider.

We will also add a step for recording if a label was not in the inclusion list and if it did not exist in a full list of all reviewed labels from the provider. These “unreviewed” labels should be surfaced as part of the data refresh, so maintainers can review them and decide if they should be included in the inclusion list.

Filter Clarifai tags¶

While this project seeks to add new magine-generated labels to the database, we already have around 10 million records which include labels from the Clarifai image labeling service. It is unclear how these labels were applied, or what the exhaustive label set is. Thus, it’s prudent for us to perform some analysis on these tags to determine which labels from this dataset should also be filtered from the API.

Note

We will not be removing any existing tags from the catalog.

Similar to the preemptive Rekognition filtering, we will want to filter the existing Clarifai tags until we can perform the same analysis on the set of available tags as will be done for the Rekognition ones. This can be done using the same steps described for the Rekognition filtering, based on the status of this project and #3925. An alternative approach could be to use the “unreviewed” label approach described in the filter Rekognition tags section to surface the Clarifai tags for review.

Once the filtering is in place, we can construct an exhaustive set of Clarifai labels and determine exclusions for that provider using the approach described above. Then the Clarifai label inclusions can be added to the alter_data step (#4541) in the same way Rekognition’s are added and the blanked exclusion for all tags from that provider can be lifted. These exclusion lists could be combined into a single filtering step, or we could have individual filter lists based on the label provider. My preference is former, since that way the single list serves as a more exhaustive exclusion list.

Dependencies¶

Infrastructure¶

No infrastructure changes will be necessary for this work.

Tools & packages¶

The smart_open package will need to be installed as a dependency within Airflow, in order for it to be available for this DAG.

Other projects or work¶

This project intersects with the ingestion server removal project (#3925), but steps can be taken to circumvent this dependency for the time being. See preemptively filter Rekognition tags for more details.

This project is also related to, but not necessarily dependent on, the data normalization project. See the note in Step Details.

Alternatives¶

Although the above plan is thorough and may require more investment up-front, we could opt to incorporate this data as soon as possible by performing all of the steps of the DAG by hand. We would need to record what exact set of steps were taken, as there would likely be some iteration on scripts and SQL as part of figuring out the exact commands necessary. The entire Rekognition file[9] could be downloaded by a maintainer locally and all data manipulation could be performed on their machine. A new TSV could be generated matching the table pattern described in DAG step 1, the file could be uploaded to S3, and a table in Postgres could be created from it directly. The final batched update step would then be kicked off by hand.

While I would personally prefer to take these actions by hand to get the data in quicker, I think it’s prudent for us to have a more formal process for accomplishing this. It’s possible that we might receive more machine-generated labels down the line, and having a rubric for how to add them will serve us much better than a handful of scripts and instructions.

We could also skip processing the Rekognition file in Python and insert it directly into Postgres. We’d then need to perform the label extraction and filtering from the JSON objects using SQL instead, which does seem possible. This would obviate the need to use smart_open and install a new package on Airflow. I think this route will be much harder based on my own experience crafting queries involving Postgres’s JSON access/manipulation methods, and I think the resulting query would not be as much of a benefit as the time it might take to craft it.

Blockers¶

No blockers, this work can begin immediately (though some may conflict with the data normalization and ingestion server removal projects, see the note in dependencies).

Rollback¶

Rollback for this project looks different for each label source:

Clarifai: If we decide to roll back any filters for Clarifai that we instated, we could simply remove those filters and re-surface the data in the API. We’re not removing any data from the catalog as part of this project, so this would return the Clarifai tags to their currently fully-visible state.
Rekognition: If we decide not to surface any Rekognition tags in the API, we could simply retain the blanket provider-wide filter for all Rekognition tags.

Risks¶

We are only adding new data to the catalog as part of this effort; we do not intend to remove any existing data. We have full control over what data we filter when constructing the API database during the data refresh, and so we could opt to filter out all of the machine-generated labels that exist in the database even after the new ones are inserted. As such, this project poses little risk beyond increased database storage size.

Adding this new data will affect search relevancy. Discussion around that risk can be found in the project proposal.

Prior art¶

Previous examples for tag manipulation using the batched update DAG are shared throughout[8].

Analysis explanation¶

I downloaded the first 100MB of the file using the following command:

aws s3api get-object --bucket migrated-cccatalog-archives --key kafka/image_analysis_labels-2020-12-17.txt --range bytes=0-100000000 ~/misc/aws_rekognition_subset.txt

The S3 file referenced here is a JSON lines file where each line is a record for an image. I had to delete the last line because a byte selection couldn’t guarantee that the entire line would be read in completely, and it might not parse as valid JSON.

Then I used pandas and ipython for further exploration. Below is the script I used to ingest the data and compute the values referenced in the accuracy selection section:

import json
import pandas as pd

# Read the file in as JSON lines
df = pd.read_json("/home/madison/misc/aws_rekognition_subset.txt", lines=True)

# Extract the labels from each row into mini-dataframes
recs = []
for _, row in df.iterrows():
  iid = row.image_uuid
  try:
    # Normalize the labels into a table, then get only the name and confidence values
    # Skip the record if it doesn't have labels
    tags = pd.json_normalize(row.response["Labels"])[["Name", "Confidence"]]
  except KeyError:
    continue
  # Add the image ID as an index
  tags["image_uuid"] = iid
  recs.append(tags)

# Concatenate all dataframes together
# This results in the columns: image_uuid, name, confidence
xdf = pd.concat(recs)

# Compute the total number of labels
len(xdf)

# Get average statistics for the dataframe, namely confidence mean
xdf.describe()

# Average confidence by image
xdf.groupby("image_uuid").mean("Confidence").mean()

# Global median confidence
xdf.Confidence.median()

# Median confidence by image
xdf.groupby("image_uuid").median("Confidence").median()

# Number of labels w/ confidence higher than 90
(xdf.Confidence > 90).sum()

# Percent of total labels w/ confidence higher than 90
(xdf.Confidence > 90).sum() / len(xdf)

# Average number of tags per item w/ confidence higher than 90
(xdf.Confidence > 90).sum() / len(df)

Changelog¶

2024-07-25 - (#4662) Clarified policy around initially included demographic labels to ensure they’re reviewed on a case-by-case basis.
2024-08-19 - (#4784) Added a note about the inclusion-based filtering process and the “unreviewed” label capture, and a note to insert all available Rekognition data and filter it afterwards.