Openverse Media Properties#
_This document is auto-generated from the source code in
utilities/media_props_gen/media_props_generation.py
. _
This is a list of the media properties, with the descriptions of corresponding
database columns and Python objects that are used to store and retrieve media
data. The order of the properties corresponds to their order in the image_view
materialized view.
image_view
properties#
DB Field |
DB Nullable |
DB Type |
Python Column |
Description |
---|---|---|---|---|
True |
uuid |
UUIDColumn (nullable=True, required=False, upsert_strategy=newest_non_null)) |
UUID generated by Postgres |
|
True |
string |
StringColumn (nullable=True, required=False, upsert_strategy=newest_non_null)) |
Media title provided by the creator |
|
True |
integer |
IntegerColumn (nullable=True, required=False, upsert_strategy=newest_non_null)) |
Image width in pixel |
|
True |
jsonb |
JSONColumn (nullable=True, required=False, upsert_strategy=merge_jsonb_arrays)) |
A list of tag objects |
Media Properties#
identifier
#
Media type: image
, audio
Description#
UUID generated by Postgres, the primary key of the image
/audio
tables.
Openverse identifier
is generated during the ingestion process when the image
is inserted into the image
table for the first time.
Title#
Media type: image
, audio
The title of the work provided by the creator. If blank, uses “This work” for the attribution (This work by creator is licensed with CC BY).
Shape of the data and Selection criteria#
We select the default title returned by the provider. It can be blank. Blank values (whether None or empty string “”) are saved as empty string in the database (TODO: check if this is true).
Existing data problems#
Some media items had incorrectly encoded titles [^1 - Link to a description of Unicode encoding problem in the “postamble”]. This is compensated for in the Frontend (link to the code that fixes title encoding). This problem has been fixed for the items that have been reingested after some time in 2020, but might still persist for items that were not updated since then. Link to issues for fixing the encoding in the catalog/api/frontend. Some Wikimedia titles have a shape of “FILE:xxx.svg”. The provider script removes them now, but this is still a problem for items that were ingested earlier. The “FILE” and “.extension” are removed in the frontend (link to the code). Link to the issue to fix it in the API.
width
#
Media type: image
Description#
Image width in pixels. The width is used to calculate the aspect ratio of the image on the frontend to prevent content shifting.
Shape of the data collected by provider scripts#
If the provider API returns the width, it is collected as an integer. If the
provider API does not return the width, it can be requested using head
request
to the direct image URL, however, this is not always accurate.
What happens for images where we don’t use the largest image?
Common issues#
Incorrectly escaped Unicode characters#
In: title
, tags
, creator
Sometimes, Unicode characters were encoded
incorrectly. Symbols like \u<xxxx>
were saved as either \\u<xxxx>
or
u<xxxx>
. Thus, instead of showing the non-ascii symbols, they are shown as
gibberish in the UI. This is addressed in the Nuxt front end. When data refresh
is run, the incorrect values can be replaced if the upsert_strategy
for the
column is newest_non_null
. However, for the strategies that update the value
instead of replacing it, this results in duplication of the values. Thus, some
items have duplicated tags: the correctly-encoded tags cannot replace the
incorrectly-encoded tags because they are different.