bryce-solr

notid/bryce-solr

Fork 0

Files

History

Bryce f4f046263c commit

2026-05-15 22:19:14 -07:00

create_dataset.py

commit

2026-05-15 22:19:14 -07:00

create_model.py

commit

2026-05-15 22:19:14 -07:00

films.py

commit

2026-05-15 22:19:14 -07:00

README.md

commit

2026-05-15 22:19:14 -07:00

README.md

We present in this directory the Python scripts that were used to create the film_vector field for the films dataset.

films.py: define some general purpose functions to read, save and process the films dataset.
create_model.py: creates an embedding model to represent the films.
create_dataset.py: uses the embedding model to calculate the vectors of the films and create the new dataset with the extra film_vector field.

To replicate the example you have to run the create_model.py script first, followed by create_dataset.py. We will describe and discuss each of these scripts below.

Setup

pip install sentence-transformers

Creating the Model (`create_model.py`)

There are several approaches that one could use to create vectors (embeddings) to represent documents. In the case of our example we decided to use a textual approach, where we use the text of the document as input for calculating its vector.

To create the "sentence" that will serve as textual input for the movies we get its title followed by the genres separated in comma. For example, the "8 Mile" movie will have this sentence:

8 Mile

Musical, Hip hop film, Drama, Musical Drama

We use a pretrained model from SentenceTransformers framework (all-mpnet-base-v2) as base for creating a new tailored reduced model. We calculate the 768-dimensions vectors for the sentences of all the movies in the dataset, then run a PCA to extract the 10 most important dimensions. With the PCA result we create a new model that will create vectors of size 10. The number of dimensions is a compromise between performance and quality, and we choose 10 here just to serve as a small and compact example. Generally the higher the number of dimensions, the higher the quality, while also increasing the memory consumption and the computational time to manipulate the vectors.

This model is created to serve as a small example to demonstrate the vectors features of Solr, so it is just one among many possible ways to create vectors for documents. For example, it is possible to fine-tune a pre-trained model using textual data from our context. Another possibility is to train a model that does not even rely on text, but uses coocurrence of documents or items, like item2vec.

Calculating Vectors (`create_dataset.py`)

Once we have the model created and stored we can use it to calculate the vectors of the documents.

First we load the model (reading it from disk to RAM). Then we read the films dataset and creates the sentences (as previously described in the previous section). Finally, for each sentence we use the model to calculate and encode the film vector according to its "sentence". After having the film_vector field added to the dataset, we export and store it in the 3 formats (JSON, XML and CSV).

So, if we have new movies to be indexed in the collection we have just to replicate the above steps: (1) load the model, (2) create the film sentence, (3) calculate the film vector from its sentence.

README.md

Setup

Creating the Model (create_model.py)

Calculating Vectors (create_dataset.py)

Creating the Model (`create_model.py`)

Calculating Vectors (`create_dataset.py`)