2023-07-20 Implementation Plan: Additional Search Views#
Author: @obulat
Reviewers#
[x] @zackkrida
[ ] @sarayourfriend
Project links#
Expected Outcomes#
API endpoints return all media with the selected tag, from the selected source or by the selected creator, sorted by date added to Openverse.
Frontend allows to browse media items by a selected creator, source, or with a selected tag.
The single result pages link to these collection views; the external links are also updated to clearly show that they are external.
Step-by-step plan#
Update the Elasticsearch index to enable exact matching of the
tag
,source
andcreator
fields (both the query analyzer and the index analyzer). This will require reindexing.Add API endpoints for exact matching of the
tag
,source
andcreator
fields.Create the new components:
VCollectionHeader
,VCollectionLink
andVTag
.Update the store and utils used to construct the API query to allow for searching by
tag
,creator
orsource
, in addition to the current search by title/description/tags combination.Add a switchable “additional_search_views” feature flag.
Create a page for
tag
/creator
/source
collections. The page should handle fetching and updating the search store state.Update the single result pages: tags area, the “creator” and “source” area under the main media item.
Add the Analytics event
VISIT_SOURCE_LINK
and update whereVISIT_CREATOR_LINK
is sent.Cleanup after the feature flag is removed:
Remove conditional rendering on the single result pages.
Remove the
additional_search_views
feature flag andVMediaTag
component.
Step details#
1. Search controller updates#
Currently, when filtering the search results, the API matches some query parameters in a fuzzy way: an item matches the query if the field value contains the query string as a separate word. When indexing the items, we “analyze” them, which means that we split the field values by whitespace and stem them. We also do the same to the query string. This means that the query string “bike” will match the field value “bikes”, “biking”, or “bike shop”.
For these pages, however, we need an exact match.
One alternative implementation considered when writing this plan was to use the database instead of the Elasticsearch to get the results. This would make it easy to get the exact matches. However, there are some problems with using the database rather than ES to access anything:
The database does not cache queries in the same way that ES does. Repeated queries will not necessarily be as efficient as from ES.
The database does not score documents at all, so the order will different dramatically to the way that ES would order the documents. That’s an issue with respect to popularity data today already, but will become even more of an issue if we start to score documents based on other metrics as theorised by our search relevancy discussions.
creator
is not indexed in the API database, so a query against it will be very slow.
To enable exact matching, we don’t need any changes in Elasticsearch index
because we already have the .keyword
fields for creator
, source
and
tags
. We just need to use them in the query. This will allow for exact
matching of the values (e.g. bike
will not match bikes
or biking
), and
will probably make the search more performant since the fields and the query
won’t need to be analyzed and/or stemmed.
The search controller’s search
method should be refactored to be smaller and
allow for more flexibility when creating the search query. The current
implementation of query building consists of 3 steps.
We first apply the filters
: if the query string has any parameters other
than q
, we use them for exact matches that must be in the search results, or
must be excluded from the search results (if the parameter starts with
exclude_
).
Then, if q
parameter is present, we apply the q
parameter, which is a
full-text search within tags
, title
and description
fields. This is a
fuzzy search, which means that the query string is stemmed and analyzed, and the
field values are stemmed and analyzed, and the documents are scored based on the
relevance of the match. If q
is not present, but one of the
creator
/source
/tags
parameter is present, we search within those fields
for fuzzy matches.
Finally, we apply the ranking and sorting parameters, and “highlight” the fields that were matched in the results.
The new search controller should allow for using different filters for the first
step and to not use the full-text search. We should also create a new serializer
for collection search requests. It should include the common parameters for
list
requests, such as page
and page_size
, and the parameters for the
exact matches: tag
, creator
and source
.
2. New API endpoints#
The new routes should use path parameters instead of query parameters for the
tag
, creator
and source
values. This will make the URLs more readable,
easier to share, will be easier to cache or perform cache invalidation required
by #1969. The path parameters should be URL encoded to preserve special
characters and spaces.
Instead of using query strings, we can describe the resource via the path:
/<media type>/source/<source>/creator/<creator>
is very clean, easy to read
and understand, and very easy to manage the cache for because it is a static
path. The source page can use the same route by leaving off the creator. This
removes the need to manage specific query params as well and would allow us to
add querying within these routes more easily in the future behind the regular q
parameter if we wanted.
For the tag route, the singular tag
rather than plural tags
should be used
for legibility since we are presenting a single tag.
The new views should use the same pagination and dead link cleanup as the search views.
3. New and updated components#
Extract the VAudioCollection
component#
Currently, it is not possible to reuse the audio collection from the audio
search result page because it is a part of the audio.vue
page. We should
extract the part that shows the loading skeleton, the column of VAudioTrack
rows and the Load more section into VAudioCollection
component. This component
will be reused in the audio search page and on the Additional search views.
Add a VCollectionHeader
component#
The header should have an icon (tag, creator or source) and the name of the tag/creator/source. For source and the creator, there should be an external link button if it’s available (not all creators have urls).
The header should also display the number of results, “251 audio files with the selected tag”, “604 images provided by this source”, “37 images by this creator in Hirshhorn Museum and Sculpture Garden”.
Note: There are sources that only have works by one creator. In this case, we should probably still have two separate pages for the source and the creator, but we might want to add a note that the creator is the only one associated with this source.
Figma links: creator desktop and mobile, source desktop and mobile, tag desktop and mobile.
Add VCollectionLink
component#
This component should be a VButton
with as="VLink"
, should have an icon, and
should accept a localized link to the creator or source page.
Figma link: creator and source buttons
Update links in the “information” section#
The links to creator in the image and audio single result pages Information section should have an “external link” icon.
Audio creator link should also be updated to match the image creator link. It
should be a conditional component: VLink
if the creator_url
is truthy and
span
if the creator_url
is falsy.
Currently, the foreign_landing_url
is linked to the “source” in the
image
page and “provider” in the
audio page.
The audio page should be updated to match the image page: the
foreign_landing_url
link should be added to the “source”, not provider.
4. Nuxt store and API request changes#
We can reuse the search store as is for these pages. Currently, the frontend can
perform searches by source
parameter. If searchBy
value is set, then the q
parameter is replaced with the <searchBy>=<searchTerm>
query parameter.
For this project, we should add new values to the searchBy
filter.
The API request URL is constructed from the search store state in the
prepare-search-query-params
method.
We will need to update this method to use the searchBy
filter value to
construct the API request path as described in the “API Changes” section.
Update the searchBy
filter#
The searchBy
filter will be used to determine the shape of the API request.
While currently these parameters will be mutually exclusive (we can only search
by one of them), we might want to allow searching by multiple parameters in the
future.
For other filters, we only use toggle
method to update the value. However, for
searchBy
, we need to be able to check one of the searchBy
parameters, and
uncheck the others. To enable that, we should add a new search
store method.
If searchBy
is set to tag
, creator
or source
, then the media store
should create search path instead of the search query. So, instead of calling
prepareSearchQuery
to create the query parameters, it should call
prepareSearchPath
to create the path.
const searchPathOrQuery = searchBy
? prepareSearchPath(searchParams, mediaType)
: prepareSearchQuery(searchParams, mediaType)
const prepareSearchPath = (
searchParams: Record<string, string>,
mediaType: SupportedMediaType
) => {
let path
if (searchBy === "tag") {
path = `${mediaType}/tag/${searchTerm}`
} else {
path = `${mediaType}/source/${searchParams[`${mediaType}Provider`]}`
if (searchBy === "creator") {
path += `/creator/${searchTerm}`
}
}
return path
}
5. Add the additional_search_views
feature flag#
The flag should be switchable, and off by default.
6. Create a page for tag
/ creator
/source
collections.#
Nuxt allows creating nested dynamic routes like
/pages/_collection/_mediaType/_term
.
We should add the following pages:
/pages/_mediaType/tag/_tag
/pages/_mediaType/source/_source
/pages/_mediaType/source/_source/creator/_creator
(this page might not be needed as it might be handled by the source page)
To make sure that the mediaType
, source
and creator
parameters are valid,
this page should use the
validate
method to
make sure that and show an error page if necessary.
function validate({ params, $pinia }): boolean {
const { collection, mediaType, term } = params
// Check that collection is one of ["tag", "creator" or "source"],
// and mediaType is one of `supportedMediaTypes`.
// Check that `term` is correctly escaped.
// If the params are not valid, return `false` to show the error page.
return isValid ? true : false
}
This page should also update the state (searchType
, searchTerm
and
searchBy
and provider
filters) in the search
store and handle fetching
using mediaStore
’s fetchMedia
method in the useFetch
hook.
Since it is not possible to change the path or query parameters from this page client-side, fetching can be much simpler than on the current search page (that has to watch for changes in the route and fetch if necessary).
This page should use VCollectionHeader
and the image grid or the audio
collection.
7. Update the single result pages#
All of these changes should be conditional on whether the
additional_search_views
feature flag is enabled.
The Figma links for new designs:
Update the VCollectionLink
area on the single result page#
The content info line under the main item on the single result page should be replaced with a section that has two buttons: one for a creator link and a source link. This section should be horizontally scrollable on mobile. It should implement a scroll-snap (example: https://play.tailwindcss.com/AbfA33Za50)
Add the information popover next to source and provider links#
The information popover should be added next to the source and provider links that explains the difference between the source and provider.
8. Additional analytics events#
Some existing events will already track the new views events. The views can be
tracked as page views, so no separate event is necessary. The only way to access
the pages is directly or via links on the single results, which will all be
captured by standard page visits. Clicking on the items will be tracked as
SELECT_SEARCH_RESULT
events. These events can be narrowed by pathname
(/search
or /tag\*
, for example) to determine where the event occurred.
Two analytics events should be added or updated:
The clicks on external creator link in the
VCollectionHeader
should be tracked asVISIT_CREATOR_LINK
events.We should also a special event for visiting source
VISIT_SOURCE_LINK
, similar toVISIT_CREATOR_LINK
.
9. Cleanup after the feature flag is enabled in production#
After the feature flag is enabled in production, we should remove the
conditional rendering on the single result pages and remove the
additional_search_views
feature flag and (old) VMediaTag
component.
Tests#
We should add visual-regression tests for the new views. To minimize flakiness
due to slow loading of the images, we should probably use the
{ filter: brightness(0%); }
trick
for the images on the page.
The search store tests should be updated to reflect the changes to the filters.
Dependencies#
Infrastructure#
These views potentially might cause more load on our infrastructure due to increase in scraping activity.
Tools & packages#
No new tools or packages are necessary.
Other projects or work#
Not applicable.
Design#
Parallelizable streams#
The API changes can be done independently of the frontend changes, although they should be finished before the final testing of the frontend changes.
Adding the new components (step 3), Nuxt store update (step 4) and the
additional_search_views
feature flag (step 5) can be done in parallel, and are
not dependent on anything.
The work on the single result pages (step 7) can be done in parallel with the work on the collection pages (step 6), but should follow the previous steps.
Blockers#
The main blocker could be the maintainer capacity.
Accessibility#
We should make sure that the search titles are accessible, and the pages clearly indicate the change of context.
Rollback#
To roll back the changes, we would need to set the feature flag to OFF
.
Risks#
The biggest risk I see is that this project might be seen as an “invitation” to scraping. Hopefully, frontend rate limiting and the work on providing the dataset would minimize such risks.