Catalog Media Properties¶
This document is auto-generated from the source code in /catalog/utilities/media_props_gen/generate_media_properties.py.
This is a list of the media properties, with the descriptions of corresponding
database columns and Python objects that are used to store and retrieve media
data. The order of the properties corresponds to their order in the image
table. Property names typically match those of the database columns, except when
noted otherwise in the Python column’s name property.
Image Properties¶
Name |
DB Field |
Python Column |
|---|---|---|
uuid, nullable |
UUIDColumn ( |
|
timestamp with time zone, non-nullable |
TimestampColumn ( |
|
timestamp with time zone, non-nullable |
TimestampColumn ( |
|
character varying (80), nullable |
StringColumn ( |
|
character varying (80), nullable |
StringColumn ( |
|
character varying (80), nullable |
StringColumn ( |
|
text, nullable |
StringColumn ( |
|
text, nullable |
URLColumn ( |
|
text, non-nullable |
URLColumn ( |
|
text, nullable |
URLColumn ( |
|
integer, nullable |
IntegerColumn ( |
|
integer, nullable |
IntegerColumn ( |
|
integer, nullable |
IntegerColumn ( |
|
character varying (50), non-nullable |
StringColumn ( |
|
character varying (25), nullable |
StringColumn ( |
|
text, nullable |
StringColumn ( |
|
text, nullable |
URLColumn ( |
|
text, nullable |
StringColumn ( |
|
jsonb, nullable |
JSONColumn ( |
|
jsonb, nullable |
JSONColumn ( |
|
boolean, nullable |
BooleanColumn ( |
|
timestamp with time zone, nullable |
TimestampColumn ( |
|
boolean, non-nullable |
BooleanColumn ( |
|
character varying (5), nullable |
StringColumn ( |
|
character varying (80), nullable |
StringColumn ( |
|
double precision, nullable |
CalculatedColumn ( |
Audio Properties¶
Name |
DB Field |
Python Column |
|---|---|---|
uuid, nullable |
UUIDColumn ( |
|
timestamp with time zone, non-nullable |
TimestampColumn ( |
|
timestamp with time zone, non-nullable |
TimestampColumn ( |
|
character varying (80), nullable |
StringColumn ( |
|
character varying (80), nullable |
StringColumn ( |
|
character varying (80), nullable |
StringColumn ( |
|
text, nullable |
StringColumn ( |
|
text, nullable |
URLColumn ( |
|
text, non-nullable |
URLColumn ( |
|
text, nullable |
URLColumn ( |
|
character varying (5), nullable |
StringColumn ( |
|
integer, nullable |
IntegerColumn ( |
|
integer, nullable |
IntegerColumn ( |
|
integer, nullable |
IntegerColumn ( |
|
character varying (80), nullable |
StringColumn ( |
|
array of character varying (80), nullable |
ArrayColumn ( |
|
jsonb, nullable |
JSONColumn ( |
|
integer, nullable |
IntegerColumn ( |
|
jsonb, nullable |
JSONColumn ( |
|
integer, nullable |
IntegerColumn ( |
|
character varying (50), non-nullable |
StringColumn ( |
|
character varying (25), nullable |
StringColumn ( |
|
text, nullable |
StringColumn ( |
|
text, nullable |
URLColumn ( |
|
text, nullable |
StringColumn ( |
|
jsonb, nullable |
JSONColumn ( |
|
jsonb, nullable |
JSONColumn ( |
|
boolean, nullable |
BooleanColumn ( |
|
timestamp with time zone, nullable |
TimestampColumn ( |
|
boolean, non-nullable |
BooleanColumn ( |
|
double precision, nullable |
CalculatedColumn ( |
|
text, nullable |
StringColumn ( |
Media Property Descriptions¶
identifier¶
Media Types: audio, image
DB Column Type: uuid, nullable
Description¶
The unique UUID identifier for the media item. The identifier is generated
when the item is first inserted into the main table.
Object Shape¶
created_on¶
Media Types: audio, image
DB Column Type: timestamp with time zone, non-nullable
Description¶
The timestamp of when the media item was first added to Openverse catalog. This timestamp is generated when the item is first inserted into the main table.
Note
This is not the date when the item was first published on the source site.
updated_on¶
Media Types: audio, image
DB Column Type: timestamp with time zone, non-nullable
Description¶
The timestamp of the last time any change was made to the media item. Unlike
last_synced_with_source, this can also be a change from a data cleaning step,
e.g. updating license URL in the meta_data, or fixing the URL using the
batched_update DAG.
ingestion_type¶
Media Types: audio, image
DB Column Type: character varying (80), nullable
Description¶
The way the media item was ingested into the Openverse catalog.
common_crawl: data extracted from the Common Crawl dataset, when the Creative Commons search API was first created.provider_apidata is extracted from various CC media provider APIs.sql_bulk_loaddata is extracted from the SQL data dumps.
provider¶
Media Types: audio, image
DB Column Type: character varying (80), nullable
Description¶
The name of the provider of the media metadata. This is usually, but not always, the website that hosts the media item.
Object Shape¶
This is a keyword for the provider, a string in a “snake_case” form.
source¶
Media Types: audio, image
DB Column Type: character varying (80), nullable
Description¶
The name of the source of the media item. It can be a collection on a provider site, or the provider itself.
Object Shape¶
This is a keyword for the source, a string in a “snake_case” form.
foreign_identifier¶
Media Types: audio, image
DB Column Type: text, nullable
Description¶
The unique identifier for the media item on the source site.
foreign_landing_url¶
Media Types: audio, image
DB Column Type: text, nullable
Description¶
The URL of the landing page for the media item (not a direct link to the media file). This should be unique for each media item. This value will be used on the frontend as the URL to direct users to for downloading a media item from the upstream provider.
url¶
Media Types: audio, image
DB Column Type: text, non-nullable
Description¶
The direct URL to the media file, from which the media file can be downloaded. This should be unique for each media item.
thumbnail¶
Media Types: audio, image
DB Column Type: text, nullable
Description¶
The URL of a smaller, thumbnail, image for the media item.
Selection Criteria¶
The smallest acceptable size for a thumbnail is 600px at the longest edge. See comment in issue #675.
width¶
Media Types: image
DB Column Type: integer, nullable
Description¶
The width of the main image in pixels.
height¶
Media Types: image
DB Column Type: integer, nullable
Description¶
The height of the main image in pixels.
Selection Criteria¶
If the provider does not provide the height and width of the image, it is possible to send a head request to the direct url to extract this dat.
filesize¶
Media Types: audio, image
DB Column Type: integer, nullable
Description¶
The size of the main media file in bytes. If not available in the API response, it can be extracted from a head request response to the media file URL.
license¶
Media Types: audio, image
DB Column Type: character varying (50), non-nullable
Description¶
The slug of the license under which the media item is licensed. For the list of available license slugs, see openverse-attribution package.
license_version¶
Media Types: audio, image
DB Column Type: character varying (25), nullable
Description¶
The string representing the version of the license. PublicDomain has no version which is denoted as “N/A”.
creator¶
Media Types: audio, image
DB Column Type: text, nullable
Description¶
The name of the creator of the media item. Some providers use “Unknown” or similar for unknown creators, see issue #1326.
creator_url¶
Media Types: audio, image
DB Column Type: text, nullable
Description¶
The URL of the creator’s page, usually on the source site.
title¶
Media Types: audio, image
DB Column Type: text, nullable
Description¶
The title of the media item.
Existing Data Inconsistencies¶
Provider scripts may include html tags in record titles, see issue #1441. Some Wikimedia titles in the database still include “FILE:” prefix, and unnecessary file extension, which is hot-fixed in the frontend. Some titles were incorrectly encoded, for which there is a hot-fix in the frontend.
meta_data¶
Media Types: audio, image
DB Column Type: jsonb, nullable
Description¶
A JSONB object containing additional metadata about the media item. This must
contain the license_url (automatically added by the MediaStore class from
the License object).
Selection Criteria¶
Relevant information that is not covered by other fields should be added here. This includes such items as the dates of creation, publication, geographical data, descriptions, and popularity data.
watermarked¶
Media Types: audio, image
DB Column Type: boolean, nullable
Description¶
Whether the image has a discernible watermark. If this field is null or false, it does not mean the image doesn’t have a watermark. This field was set to true for some CommonCrawl providers (McCord Museum, 500px, FloraOn, IHA). Currently, no provider script or SQL ingestion sets this field value.
last_synced_with_source¶
Media Types: audio, image
DB Column Type: timestamp with time zone, nullable
Description¶
For new items, the timestamp that is the same as the created_on. For items
that were updated during re-ingestion, the timestamp of re-ingestion.
removed_from_source¶
Media Types: audio, image
DB Column Type: boolean, non-nullable
Description¶
Set to True for items that were not updated during re-run of the provider
script. Items that have True in removed_from_source are not added to the ES
index during the data refresh process.
Selection Criteria¶
expire_old_images
DAG added in
Expiration of outdated images in the database
was used to set removed_from_source to True for images that were updated
more than OLDEST_PER_PROVIDER value.
filetype¶
Media Types: audio, image
DB Column Type: character varying (5), nullable
Description¶
The filetype (extension) of the main media file (not the MIME type). If the filetype is not available in the API response, it can be extracted from the URL extension or from the HEAD response from the media direct URL.
Normalization and Validation¶
The
extract_filetype function in catalog/dags/common/extensions.py
is used to get the file extension from a URL. The function returns the file
extension in lowercase. Equivalent image file types are normalized to a single
file type, see
FILETYPE_EQUIVALENTS.
category¶
Media Types: audio, image
DB Column Type: character varying (80), nullable
Description¶
One of the media category Enum values:
ImageCategory
and
AudioCategory.
Selection Criteria¶
Category is assigned heuristically based on the extension and default categories per provider.
standardized_popularity¶
Media Types: audio, image
DB Column Type: double precision, nullable
Description¶
Normalized popularity, a calculated column. Only available for providers that have popularity data.
Normalization and Validation¶
The value is updated monthly during the data refresh process.
duration¶
Media Types: audio
DB Column Type: integer, nullable
Description¶
The duration of the main audio file in milliseconds.
bit_rate¶
Media Types: audio
DB Column Type: integer, nullable
Description¶
The bit rate of the main audio file.
sample_rate¶
Media Types: audio
DB Column Type: integer, nullable
Description¶
The sample rate of the main audio file.
genres¶
Media Types: audio
DB Column Type: array of character varying (80), nullable
Description¶
List of genres associated with the audio.
alt_files¶
Media Types: audio
DB Column Type: jsonb, nullable
Description¶
The list of alternative file details for the audio (different formats/ quality).
Object Shape¶
JSONB array of dictionaries:
[
{
"url": "http://example.com/audio.mp3",
"filesize": 123456,
"bit_rate": 128,
"sample_rate": 44100
}
]
url(string): the direct URL of the alternative file.filesize(integer): the size of the alternative file in bytes.filetype(string): the file type (extension) of the alternative file.bit_rate(integer): the bit rate of the alternative file.sample_rate(integer): the sample rate of the alternative file.duration(integer, optional): the duration of the alternative file in milliseconds.
audio_set¶
Media Types: audio
DB Column Type: jsonb, nullable
Description¶
The information about the audio set (collection, album) that the audio belongs to.
Object Shape¶
JSONB object:
{
"title": "Audio Set Title",
"foreign_landing_url": "http://example.com",
"thumbnail": "http://example.com/thumbnail.jpg",
"creator": "Creator Name",
"creator_url": "http://example.com/creator",
"foreign_identifier": "123456"
}
title(string): the title of the audio set.foreign_landing_url(string): the URL of the audio set on the source site.thumbnail(string): the URL of the thumbnail image for the audio set.creator(string): the name of the creator of the audio set.creator_url(string): the URL of the creator’s page, usually on the source site.foreign_identifier(string): the unique identifier for the audio set on the source site. This identifier is saved in theaudio_set_foreign_identifierfield of the TSV and catalog audio table.
audio_set_foreign_identifier¶
Media Types: audio
DB Column Type: text, nullable
Description¶
Unique identifier for the audio set on the source site.
set_position¶
Media Types: audio
DB Column Type: integer, nullable
Description¶
The position of the audio in the audio set.
Encoding problems¶
In the beginning of the project, some items were saved to the database with encoding problems. There are 3 ways that non-ASCII symbols were incorrectly saved:
escaped with double backslashes instead of the single backslash, e.g.
ä->\\u00e4escaped without a backslash, e.g.
ä->u00e4x-escaped with double backslashes, e.g.
ä->\\x61
With subsequent data re-ingestions, most titles were fixed. This problem still
exists for titles of items that were not re-ingested, and for fields that are
not simply replaced during re-ingestion, such as tags and
meta_data.description. The frontend uses a hotfix to replace these encoding
problems with the correct characters in
title,
tags
and
creator.