2023-07-20 Implementation Plan: Additional Search Views

Author: @obulat

Reviewers

Note

The original version of this plan was significantly revised on 2024-03-01 due to problems discovered during the implementation work. See Plan revisions section for summary of changes, and Original plan for the original version.

Expected Outcomes

API returns all media with the selected tag, from the selected source or by the selected creator within a given source, sorted by date added to Openverse, when collection parameter is set in the search endpoints. Sample API URLs:

  • /v1/<images|audio>/?collection=tag&tag=cat

  • /v1/<images|audio>/?collection=source&source=flickr

  • /v1/<images|audio>/?collection=creator&source=flickr&creator=Photographer

Frontend allows to browse media items by a selected creator, source, or with a selected tag on /collection page. Sample frontend URLs:

  • /<image|audio>/collection?tag=cat,

  • /<image|audio>/collection?source=flickr and

  • /<image|audio>/collection?source=flickr&creator=Photographer.

These pages are indexed by search engines, have relevant SEO properties and can be shared with an attractive thumbnail and relevant data.

The single result pages are updated to add the links to source, creator and tag collections.

Step-by-step plan

  1. API changes: Add API collection ES query builder for exact matching of the tag, source and creator fields to the search controller. Update the request serializer to validate collection parameters.

  2. Add a switchable “additional_search_views” feature flag.

  3. Update the store and utils used to construct the API query to allow for retrieving the collections by tag, creator or source.

  4. Create a page for collections that handles parameter validation, media fetching and setting relevant SEO properties.

  5. Create the new components: VCollectionHeader, VCollectionLink and VTag.

  6. Update the single result pages: tags area, the “creator” and “source” area under the main media item.

  7. Add the Analytics event VISIT_SOURCE_LINK and update where VISIT_CREATOR_LINK is sent. Also update the SELECT_SEARCH_RESULT, REACH_RESULT_END and LOAD_MORE events to include the new views.

  8. Cleanup after the feature flag is removed:

    • Remove conditional rendering (single result pages, sources page).

    • Remove the additional_search_views feature flag and VMediaTag component.

    • Stabilize the collection query parameters in the API (remove unstable__ prefix)

Step details

1. API changes

Currently, when filtering the search results, the API matches some query parameters in a fuzzy way: an item matches the query if the field value contains the query string as a separate word. When indexing the items, we “analyze” them, which means that we split the field values by whitespace and stem them. We also do the same to the query string. This means that the query string “bike” will match the field value “bikes”, “biking”, or “bike shop”.

For these pages, however, we need an exact match.

One alternative implementation considered when writing this plan was to use the database instead of the Elasticsearch to get the results. This would make it easy to get the exact matches. However, there are some problems with using the database rather than ES to access anything:

  • The database does not cache queries in the same way that ES does. Repeated queries will not necessarily be as efficient as from ES.

  • The database does not score documents at all, so the order will different dramatically to the way that ES would order the documents. That’s an issue with respect to popularity data today already, but will become even more of an issue if we start to score documents based on other metrics as theorised by our search relevancy discussions.

  • creator is not indexed in the API database, so a query against it will be very slow.

Search controller updates

To enable exact matching, we don’t need any changes in Elasticsearch index because we already have the .keyword fields for creator, source and tags. Using these fields in the term query will allow for exact matching of the values (e.g. bike will not match bikes or biking), and will probably make the search more performant since the fields and the query won’t need to be analyzed and/or stemmed.

The search controller’s search method should be refactored to be smaller and allow for more flexibility when creating the search query. The current implementation of query building consists of 3 steps.

We first apply the filters: if the query string has any parameters other than q, we use them for exact matches that must be in the search results, or must be excluded from the search results (if the parameter starts with exclude_).

Then, if q parameter is present, we apply the q parameter, which is a full-text search within tags, title and description fields. This is a fuzzy search, which means that the query string is stemmed and analyzed, and the field values are stemmed and analyzed, and the documents are scored based on the relevance of the match. If q is not present, but one of the creator/source/tags parameter is present, we search within those fields for fuzzy matches.

Finally, we apply the ranking and sorting parameters, and “highlight” the fields that were matched in the results.

The search controller needs to be updated to extract the first 2 steps into a build_search_query method. A new build_collection_query should be added and used when the collection parameter is present. This method should create a filter query for the relevant field and value.

The pagination and dead link cleanup should be the same for additional search queries as for the default search queries.

Search request serializer updates

The initial version of this plan proposed to add new API endpoints and use path parameters for the values (e.g., /image/tag/cat) to make the URLs more readable, easier to share, will be easier to cache or perform cache invalidation required by #1969. However, since tags and creator names contain characters that are special for path segments (/, ? and &) that cannot be properly encoded, this approach turned out to be not feasible. See PR#3793 for attempts at making the path parameters work

Note

The new query parameters (collection and tag) will have unstable__ prefix until this project is launched. In the text below, for brevity, the prefix is omitted.

The collections will use the search endpoint with collection query parameter set to tag, creator or source, and the following additional query parameters:

  • collection=tag will require tag parameter to be set

  • collection=source will require source parameter to be set

  • collection=creator will require both source and creator parameter to be set.

This means that the existing source and creator parameters will be reused. For the tag collection, the new singular tag parameter should be used, rather than the existing plural tags, since we are only presenting a single tag.

MediaSearchRequestSerializer should be updated to validate the collection parameter. This validator will check that the request contains the necessary additional parameters: source for collection=source, creator and source for collection=creator, and tag for collection=tag. It will only validate that the parameters have values, but not the values themselves.

The MediaSearchRequestSourceSerializer currently splits the source parameter value by , and validates that each value is a valid source. This validator will be updated:

  • If the collection parameter is not present, it will behave as before.

  • If the collection parameter is set to source or creator, it will take the value of source as is, and check if this value is present in the list of valid sources, without splitting it by ,. If the value is not valid, it will return a 400 error, showing the invalid value, as well as a list of valid sources for the media type.

  • If the collection parameter is set to tag, it will return the value as-is,

  • because it will be ignored in the search controller.

The documentation for the source parameter should be updated to reflect that it accepts a single source when there is a collection parameter set to source or creator, and a comma-separated list of sources when there is no collection parameter. Similarly, the documentation for creator parameter should say that it accepts a comma-separated list of values when there is no q parameter, and a single URI-encoded value when the collection=creator parameter is present. The creator serializer does not need any changes as the parsing (splitting by comma for regular searches and URI-decoding for creator collections) is only done in the search controller.

The additional documentation for the search parameters should be added as a draft so that it’s not published on the API documentation site until we launch this project and remove the unstable__ prefix.

2. Add the additional_search_views feature flag

The flag should be switchable, and off by default. This will mean that the changes are visible on staging when the flag is switched on using the preferences page. When the features are stable, we will turn the flag on to test the new views in production. After we conclude that the project is successful, we will remove the flag and the conditional rendering.

3. Nuxt store and API request changes

We can reuse the search store as is for these pages.

Previously, the frontend search store had a searchBy filter that allowed to search within the creator field. When searchBy value was set, the API q parameter was replaced with the <searchBy>=<searchTerm> API query parameter. This filter was removed because searchBy is not strictly a filter that can be toggled on or off

Add strategy and collectionParams to search store

The strategy parameter will be used to determine the API request query parameters. If it is set to "search", then the API query will be created using the current approach. If it is set to "collection", the query will be constructed using the new method (buildCollectionQuery) to set collection and other relevant parameters using the new collectionParams object in the store. It will not be setting the filter parameters such as license or category, and will ignore such unsupported query parameters.

4. Create collection pages

We should add the following pages:

  • /image/collection.vue

  • /audio/collection.vue

Validation of the collection query parameters

This page will use the collection middleware to validate the collection query parameters:

  • if collection is set to creator, there should also exist a source parameter

  • source parameter should be validated to be an existing source using the provider store

  • creator and tag parameters should be decoded using decodeURIComponent.

The middleware validates that the values of parameters is not empty, and that the source parameter is an existing source in the provider store. If the values are invalid, the 404 yellow error page is shown.

Fetching the media

This page should also update the state (searchType, collectionParams and strategy) in the search store and handle fetching using mediaStore’s fetchMedia method in the useFetch hook.

The media collections should be updated (not as part of this project) to move the load more methods to the page that fetches the media (i.e., search.vue or collection.vue), and the mediaStore should be updated to remove the load more methods.

SEO

This page should also have relevant SEO properties, and should be indexed by the search engines.

The following titles should be used for the pages:

  • “Images by Olga at Flickr” for the creator page

  • “Audio from Wikimedia” for the source page

  • “Images with the cat tag” for the tag page

There are i18n consideration for these titles that we will work on during the implementation. It is important to make the titles translatable, which is difficult if the non-translatable dynamic names are used inside the sentence due to different sentence structures.

The generic Openverse thumbnail will be used. We could also generate a thumbnail for the collection pages in the future, but this is not in scope for this project.

5. Update the single result pages

All of these changes should be conditional on whether the additional_search_views feature flag is enabled.

The Figma links for new designs:

Update the tags area on the single result page

The tags area should be collapsible to make long lists of tags collapsible: https://github.com/WordPress/openverse/issues/2589

6. New and updated components

Extract the VAudioCollection component

Currently, it is not possible to reuse the audio collection from the audio search result page because it is a part of the audio.vue page. We should extract the part that shows the loading skeleton, the column of VAudioTrack rows and the Load more section into VAudioCollection component. This component will be reused in the audio search page and on the Additional search views.

Add a VCollectionHeader component

The header should have an icon (tag, creator or source) and the name of the tag/creator/source. For source and the creator, there should be an external link button if it’s available (not all creators have urls).

The header should also display the number of results, “251 audio files with the selected tag”, “604 images provided by this source”, “37 images by this creator in Hirshhorn Museum and Sculpture Garden”.

Note

There are sources that only have works by one creator. In this case, we should probably still have two separate pages for the source and the creator, but we might want to add a note that the creator is the only one associated with this source.

Figma links: creator desktop and mobile, source desktop and mobile, tag desktop and mobile.

7. Additional analytics events

Some existing events will already track the new views events. The views can be tracked as page views, so no separate event is necessary. The only way to access the pages is directly or via links on the single results, which will all be captured by standard page visits. Clicking on the items will be tracked as SELECT_SEARCH_RESULT events. These events can be narrowed by pathname (/search or /collection, for example) to determine where the event occurred.

Analytics events should be added or updated:

  • The clicks on external creator link in the VCollectionHeader should be tracked as VISIT_CREATOR_LINK events.

  • We should also a special event for visiting source VISIT_SOURCE_LINK, similar to VISIT_CREATOR_LINK.

  • The REACH_RESULT_END, LOAD_MORE and SELECT_SEARCH_RESULT events should add strategy (search/tag/creator/source) and set query to tag name, source name or source/creator pair: cat, flickr or flickr/Olga.

8. Cleanup after the feature flag is enabled in production

After the feature flag is enabled in production, we should remove the conditional rendering on the single result pages and remove the additional_search_views feature flag and (old) VMediaTag component.`

Remove the unstable__ prefix from the collection and tag query parameters in the API.

Tests

We should add visual-regression tests for the new views. To minimize flakiness due to slow loading of the images, we should probably use the { filter: brightness(0%); } trick for the images on the page.

The search store tests should be updated to reflect the changes to the filters.

Dependencies

Infrastructure

These views potentially might cause more load on our infrastructure due to increase in scraping activity.

Tools & packages

No new tools or packages are necessary.

Other projects or work

Not applicable.

Design

Figma designs in the dev mode

Parallelizable streams

The API changes can be done independently of the frontend changes, although they should be finished before the final testing of the frontend changes.

Adding the new components (step 3), Nuxt store update (step 4) and the additional_search_views feature flag (step 5) can be done in parallel, and are not dependent on anything.

The work on the single result pages (step 7) can be done in parallel with the work on the collection pages (step 6), but should follow the previous steps.

The new frontend views can use the existing query parameters (source, creator and tags - instead of tag) until the API changes are implemented since this will return some relevant results, and changing the query parameter names is easy.

Blockers

The main blocker could be the maintainer capacity.

Accessibility

We should make sure that the search titles are accessible, and the pages clearly indicate the change of context.

Rollback

To roll back the changes, we would need to set the feature flag to OFF.

Risks

The biggest risk I see is that this project might be seen as an “invitation” to scraping. Hopefully, frontend rate limiting and the work on providing the dataset would minimize such risks.

Plan revisions

This plan was significantly revised on 2024-03-01 due to problems discovered during the implementation of the path parameters. The API changes were updated to use query parameters instead of path parameters, and to describe the new request serializer. The frontend store and API request changes were updated to use the new query parameters. The collection pages descriptions were updated to reflect the new query parameters. The steps were reordered to reflect the new implementation plan. More details on SEO was added in the collection pages section. The analytics event parameters were updated.