2024-03-20 Project Proposal: Incorporate Rekognition data into the Catalog¶
Author: @AetherUnbound
Reviewers¶
Project summary¶
AWS Rekognition data in the form of object labels was collected by Creative Commons several years ago for roughly 100m image records in the Openverse catalog. This project intends to augment the existing tags for the labeled results with the generated tags in order to improve search result relevancy.
Goals¶
Improve Search Relevancy
Requirements¶
This project will be accomplished in two major pieces:
Determining how machine-generated tags will be displayed/conveyed in the API and the frontend
Augmenting the catalog database with the tags we deem suitable
Focusing on the frontend first may seem like putting the cart before the horse, but it seems prudent to imagine how the new data we add will show up in both the frontend and the API. While both of the above will be expanded on in respective implementation plans, below is a short description of each piece.
Augmenting the catalog¶
Once we have a clear sense of how the labels will be shared downstream, we can incorporate the labels themselves into the catalog database. This can be broken down into three steps:
Determine which labels to use (see label determination)
Determine an accuracy cutoff value
Upsert the filtered labels into the database
Once step 3 is performed, the next data refresh will make the tags available in the API and the frontend. The specifics for each step will be determined in the implementation plan for this piece. Note that once introduced, the tags will not be removed by subsequent updates to the catalog data. This means that any adjustment/removal of the tags will also need to occur on the catalog.
Label determination¶
The exhaustive list of AWS Rekognition labels can be downloaded here: AWS Rekognition Labels. While this list is already fairly demographically neutral, it is my opinion that we should exclude labels that have a demographic context in the following categories:
Age
Gender
Sexual orientation
Nationality
Race
These seem the most likely to result in an incorrect or insensitive label (e.g. gender assumption of an individual in a photo). There are other categories which might be useful for search relevancy and are less likely to be applied in an insensitive manner. Some examples include:
Occupation
Marital status
Health and disability status
Political affiliation or preference
Religious affiliation or preference
Specifics for how this will be tackled regarding the Rekognition data will be outlined in the associated implementation plan.
Success¶
This project can be marked as success once the machine-generated tags from Rekognition are available in both the API and the frontend.
If the labels themselves are observed to have a negative impact on search relevancy, we will need a mechanism or plan for the API for suppressing or deboosting the machine-labeled tags without having to remove them entirely (NB: We may be able to leverage some of the DAGs created as a part of the search relevancy sandbox project for this rollback). We do not currently have the capacity to accurately and definitively assess result relevancy, though we plan to build those tools out in #421. We still feel that this project has value now, much like the introduction of iNaturalist data did even though we incurred the same risks with that effort.
The S3 bucket containing the Rekognition data will persist in perpetuity even after this project’s completion, though it can be moved to an infrequent access storage class after the initial data import is complete. This will allow us to perform additional extractions on the data in the future if desired.
Participants and stakeholders¶
Lead: @AetherUnbound
Design: @fcoveram (if any frontend design is deemed necessary)
Implementation: Implementation may be necessary for the frontend, API, and catalog; all developers working on those aspects of the project could be involved.
Infrastructure¶
The Rekognition data presently exists in an S3 bucket that was previously accessible to @zackkrida. We will need to ensure that the bucket is accessible by whatever resources are chosen to process the data. This was previously done by manually instantiating an EC2 instance to run a python script which generated a labels CSV. We may instead wish to either run any pre-processing locally or set up an Airflow DAG which would perform the processing for us.
Accessibility¶
The greatest concern on accessibility would be ensuring whatever mechanism we use for conveying the machine-generated nature/accuracy values in the frontend is also reflected in a suitable manner for screen readers.
Marketing¶
We should share the addition of the new machine-generated tags publicly once they are present in both the API and the frontend.
Required implementation plans¶
The requisite implementation plans reflect the primary pieces of the project described above:
Determine and design how machine-generated tags will be displayed/conveyed in the API
Determine and design how machine-generated tags will be displayed/conveyed in the frontend
Augment the catalog database with the suitable tags
The most important, blocking aspect of this work is determining how the labels will be surfaced in API results. Once that is determined, the frontend can be modified to exclude those values visually while the designs and implementation are executed. All work after that point can occur simultaneously.