bryce-solr/solr/example/films/README.md at f4f046263c9b5e160a32ada2a1f1a55bfd5d81ea

Files

Bryce f4f046263c commit

2026-05-15 22:19:14 -07:00

1.8 KiB

Raw Blame History

We have a movie data set in JSON, Solr XML, and CSV formats. All 3 formats contain the same data. You can use any one format to index documents to Solr.

This example uses the _default configset that ships with Solr plus some custom fields added via Schema API. It demonstrates the use of ParamSets in conjunction with the Request Parameters API.

The original data was fetched from Freebase and the data license is present in the films-LICENSE.txt file. Freebase was shutdown in 2016 by Google.

This data consists of the following fields:

id - unique identifier for the movie
name - Name of the movie
directed_by - The person(s) who directed the making of the film
initial_release_date - The earliest official initial film screening date in any country
genre - The genre(s) that the movie belongs to
film_vector - The 10 dimensional vector representing the film, according to a toy example embedding model

The name and initial_release_date are created via the Schema API, and the genre and direct_by fields are created by the use of an Update Request Processor Chain called add-unknown-fields-to-the-schema.

The film_vector is an embedding vector created to represent the movie with 10 dimensions. The vector is created from a BERT pre-trained model, followed by a dimension reduction technique to reduce the embeddings from 768 to 10 dimensions. Even though it is expected that similar movies will be close to each other, this model is just a "toy example", so it's not guaranteed to be a good representation for the movies. The Python scripts utilized to create the model and calculate the films vectors are in the vectors directory.

1.8 KiB Raw Blame History

1.8 KiB

Raw Blame History