commit

2026-05-15 22:19:14 -07:00
commit f4f046263c
2058 changed files with 236159 additions and 0 deletions
--- a/solr/example/films/README.md
+++ b/solr/example/films/README.md
@@ -0,0 +1,18 @@
+We have a movie data set in JSON, Solr XML, and CSV formats.  All 3 formats contain the same data.  You can use any one format to index documents to Solr.
+
+This example uses the `_default` configset that ships with Solr plus some custom fields added via Schema API.  It demonstrates the use of ParamSets in conjunction with the [Request Parameters API](https://solr.apache.org/guide/solr/latest/configuration-guide/request-parameters-api.html).
+
+The original data was fetched from Freebase and the data license is present in the films-LICENSE.txt file.  Freebase was shutdown in 2016 by Google.
+
+This data consists of the following fields:
+ * `id` - unique identifier for the movie
+ * `name` - Name of the movie
+ * `directed_by` - The person(s) who directed the making of the film
+ * `initial_release_date` - The earliest official initial film screening date in any country
+ * `genre` - The genre(s) that the movie belongs to
+ * `film_vector` - The 10 dimensional vector representing the film, according to a toy example embedding model
+
+ The `name` and `initial_release_date` are created via the Schema API, and the `genre` and `direct_by` fields
+ are created by the use of an Update Request Processor Chain called `add-unknown-fields-to-the-schema`.
+
+ The `film_vector` is an embedding vector created to represent the movie with 10 dimensions. The vector is created from a BERT pre-trained model, followed by a dimension reduction technique to reduce the embeddings from 768 to 10 dimensions. Even though it is expected that similar movies will be close to each other, this model is just a "toy example", so it's not guaranteed to be a good representation for the movies. The Python scripts utilized to create the model and calculate the films vectors are in the [vectors directory](./vectors).
--- a/solr/example/films/films-LICENSE.txt
+++ b/solr/example/films/films-LICENSE.txt
@@ -0,0 +1,3 @@
+The films data (films.json/.xml/.csv) is licensed under the Creative Commons Attribution 2.5 Generic License.
+To view a copy of this license, visit http://creativecommons.org/licenses/by/2.5/
+or send a letter to Creative Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA.
--- a/solr/example/films/films.csv
+++ b/solr/example/films/films.csv
--- a/solr/example/films/films.json
+++ b/solr/example/films/films.json
--- a/solr/example/films/films.xml
+++ b/solr/example/films/films.xml
--- a/solr/example/films/vectors/README.md
+++ b/solr/example/films/vectors/README.md
@@ -0,0 +1,53 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+We present in this directory the Python scripts that were used to create the `film_vector` field for the films dataset.
+
+ - [films.py](./films.py): define some general purpose functions to read, save and process the films dataset.
+ - [create_model.py](./create_model.py): creates an embedding model to represent the films.
+ - [create_dataset.py](./create_dataset.py): uses the embedding model to calculate the vectors of the films and create the new dataset with the extra `film_vector` field.
+
+To replicate the example you have to run the `create_model.py` script first, followed by `create_dataset.py`. We will describe and discuss each of these scripts below.
+
+## Setup
+
+```
+pip install sentence-transformers
+```
+
+## Creating the Model (`create_model.py`)
+
+There are several approaches that one could use to create vectors (embeddings) to represent documents. In the case of our example we decided to use a _textual_ approach, where we use the text of the document as input for calculating its vector.
+
+To create the "sentence" that will serve as textual input for the movies we get its title followed by the genres separated in comma. For example, the "8 Mile" movie will have this sentence:
+```
+8 Mile
+
+Musical, Hip hop film, Drama, Musical Drama
+```
+
+We use a pretrained model from [SentenceTransformers](https://www.sbert.net/) framework (`all-mpnet-base-v2`) as base for creating a new tailored reduced model. We calculate the 768-dimensions vectors for the sentences of all the movies in the dataset, then run a PCA to extract the 10 most important dimensions. With the PCA result we create a new model that will create vectors of size 10. The number of dimensions is a compromise between performance and quality, and we choose 10 here just to serve as a small and compact example. Generally the higher the number of dimensions, the higher the quality, while also increasing the memory consumption and the computational time to manipulate the vectors.
+
+This model is created to serve as a small example to demonstrate the vectors features of Solr, so it is just one among many possible ways to create vectors for documents. For example, it is possible to _fine-tune_ a pre-trained model using textual data from our context. Another possibility is to train a model that does not even rely on text, but uses coocurrence of documents or items, like item2vec.
+
+## Calculating Vectors (`create_dataset.py`)
+
+Once we have the model created and stored we can use it to calculate the vectors of the documents.
+
+First we load the model (reading it from disk to RAM). Then we read the films dataset and creates the sentences (as previously described in the previous section). Finally, for each sentence we use the model to calculate and encode the film vector according to its "sentence". After having the `film_vector` field added to the dataset, we export and store it in the 3 formats (JSON, XML and CSV).
+
+So, if we have new movies to be indexed in the collection we have just to replicate the above steps: (1) load the model, (2) create the film sentence, (3) calculate the film vector from its sentence.
--- a/solr/example/films/vectors/create_dataset.py
+++ b/solr/example/films/vectors/create_dataset.py
@@ -0,0 +1,68 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This script will use the reduced model created by the `create_model` 
+# script to add a new field in the films dataset, which will store the 
+# film vector according to the embedding model.
+
+import json
+
+from sentence_transformers import SentenceTransformer, util
+import torch
+
+import films
+
+#### Load the 10-dimensions model
+model = SentenceTransformer(films.PATH_FILMS_MODEL)
+
+#### Load the original films dataset
+films_dataset = films.load_films_dataset()
+
+#### Use the embedding model to calculate vectors for all movies
+films_vectors = films.calculate_films_vectors(model, films_dataset)
+
+#### Visual evaluation of some specific movies
+
+def most_similar_movie(target_idx, top_k=5):
+    film = films_dataset[target_idx]
+    film_vector = films_vectors[target_idx]
+    
+    cos_scores = util.cos_sim(film_vector, films_vectors)[0]
+    top_results = torch.topk(cos_scores, k=top_k)
+    
+    print("\n======================\n")
+    print("Film:", films.get_film_sentence(film).replace("\n", " - "))
+    print("\nTop 5 most similar films in corpus:")
+
+    for score, idx in zip(top_results[0], top_results[1]):
+        movie_str = films.get_film_sentence(films_dataset[idx]).replace("\n", " - ")
+        print(f"  - [{idx}] {movie_str} (Score: {score:.4f})")
+        
+most_similar_movie(200)
+most_similar_movie(100)
+most_similar_movie(500)
+most_similar_movie(911)
+
+
+#### Create the new films dataset by creating a new field with the embedding vector
+for idx in range(len(films_dataset)):
+    films_dataset[idx]["film_vector"] = list(films_vectors[idx].astype("float64"))
+
+#### Export the new films dataset for all formats
+films.export_films_json(films_dataset)
+films.export_films_xml(films_dataset)
+films.export_films_csv(films_dataset)
--- a/solr/example/films/vectors/create_model.py
+++ b/solr/example/films/vectors/create_model.py
@@ -0,0 +1,110 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# In this example, we reduce the dimensionality of the embeddings of
+# the SBERT pre-trained model 'all-mpnet-base-v2' from 768 to 10 dimensions. 
+#
+# The code is derived from the SBERT documentation and corresponding example code:
+#  - https://www.sbert.net/examples/training/distillation/README.html
+#  - https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/distillation/dimensionality_reduction.py
+
+from sklearn.decomposition import PCA
+from sentence_transformers import SentenceTransformer, LoggingHandler, util, evaluation, models, InputExample
+import logging
+import os
+import pathlib
+import gzip
+import csv
+import random
+import numpy as np
+import torch
+
+import films
+
+#### Just some code to print debug information to stdout
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()])
+logger = logging.getLogger(__name__)
+
+#### Create folders structure
+pathlib.Path("./data/").mkdir(parents=True, exist_ok=True)
+pathlib.Path("./models/").mkdir(parents=True, exist_ok=True)
+
+
+######## Load full model ########
+
+# Model for which we apply dimensionality reduction
+model = SentenceTransformer("all-mpnet-base-v2")
+
+# New size for the embeddings
+new_dimension = 10
+
+
+######## Evaluate performance of full model ########
+
+# We use the STS benchmark dataset to see how much performance we loose by using the dimensionality reduction
+sts_dataset_path = "./data/stsbenchmark.tsv.gz"
+if not os.path.exists(sts_dataset_path):
+    util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)
+
+# We measure the performance of the original model
+# and later we will measure the performance with the reduces dimension size
+logger.info("Read STSbenchmark test dataset")
+eval_examples = []
+with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
+    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
+    for row in reader:
+        if row["split"] == "test":
+            score = float(row["score"]) / 5.0 #Normalize score to range 0 ... 1
+            eval_examples.append(InputExample(texts=[row["sentence1"], row["sentence2"]], label=score))
+
+# Evaluate the original model on the STS benchmark dataset
+stsb_evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(eval_examples, name="sts-benchmark-test")
+
+logger.info("Original model performance:")
+stsb_evaluator(model)
+
+
+######## Reduce the embedding dimensions ########
+
+# We load the films dataset and creates a list of unique sentences utilizing the movie title and the genres
+films_dataset = films.load_films_dataset()
+films_sentences = list(set(films.get_films_sentences(films_dataset)))
+random.shuffle(films_sentences)
+
+# To determine the PCA matrix, we need some example sentence embeddings.
+# Here, we compute the embeddings for all the movies in the films dataset. 
+pca_train_sentences = films_sentences
+train_embeddings = model.encode(pca_train_sentences, convert_to_numpy=True)
+
+# Compute PCA on the train embeddings matrix
+pca = PCA(n_components=new_dimension)
+pca.fit(train_embeddings)
+pca_comp = np.asarray(pca.components_)
+
+# We add a dense layer to the model, so that it will produce directly embeddings with the new size
+dense = models.Dense(in_features=model.get_sentence_embedding_dimension(), out_features=new_dimension, bias=False, activation_function=torch.nn.Identity())
+dense.linear.weight = torch.nn.Parameter(torch.tensor(pca_comp))
+model.add_module("dense", dense)
+
+
+######## Evaluate the model with the reduce embedding size
+logger.info("Model with {} dimensions:".format(new_dimension))
+stsb_evaluator(model)
+
+
+######## Store the reduced model on disc
+model.save(films.PATH_FILMS_MODEL)
--- a/solr/example/films/vectors/films.py
+++ b/solr/example/films/vectors/films.py
@@ -0,0 +1,92 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import csv
+from lxml import etree
+from sentence_transformers import SentenceTransformer
+
+PATH_FILMS_DATASET      = "../films.json"
+PATH_FILMS_MODEL        = "./models/films-model-size_10"
+PATH_FILMS_VECTORS_JSON = "./data/films-vectors.json"
+PATH_FILMS_VECTORS_XML  = "./data/films-vectors.xml"
+PATH_FILMS_VECTORS_CSV  = "./data/films-vectors.csv"
+
+def load_films_dataset():
+    with open(PATH_FILMS_DATASET, "r") as infile:
+        films_dataset = json.load(infile)
+    return films_dataset
+
+def get_film_sentence(film):
+    return f"{film['name']}\n\n{', '.join(film['genre'])}"
+
+def get_films_sentences(films_dataset):
+    return [get_film_sentence(film) for film in films_dataset]
+
+def load_films_embedding_model():
+    return SentenceTransformer(PATH_FILMS_MODEL)
+
+def calculate_film_vector(model, film):
+    film_sentence = get_film_sentence(film)
+    return model.encode(film_sentence)
+
+def calculate_films_vectors(model, films_dataset):
+    films_sentences = get_films_sentences(films_dataset)
+    return model.encode(films_sentences)
+
+def export_films_json(films_dataset):
+    with open(PATH_FILMS_VECTORS_JSON, "w") as outfile:
+        json.dump(films_dataset, outfile, indent=2)
+
+
+def export_films_xml(films_dataset):
+
+    films_xml = etree.Element("add")
+    for film in films_dataset:
+
+        film_xml = etree.Element("doc")
+
+        for field_name, field_value in film.items():
+
+            field_value = film[field_name]
+            if not isinstance(field_value, list):
+                field_value = [field_value]
+            
+            for value in field_value:
+                child = etree.Element("field", attrib={"name": field_name})
+                child.text = str(value)
+                film_xml.append(child)
+
+        films_xml.append(film_xml)
+
+    etree.ElementTree(films_xml).write(
+        PATH_FILMS_VECTORS_XML,
+        pretty_print=True,
+        xml_declaration=True,
+        encoding="utf-8"
+    )
+
+
+def export_films_csv(films_dataset):
+    with open(PATH_FILMS_VECTORS_CSV, "w") as outfile:
+        csvw = csv.DictWriter(outfile, ["name","directed_by","genre","type","id","initial_release_date","film_vector"])
+        csvw.writeheader()
+        for film in films_dataset:
+            film["directed_by"] = "|".join(film["directed_by"])
+            film["genre"] = "|".join(film["genre"])
+            film["film_vector"] = "|".join(map(str, film["film_vector"]))
+            csvw.writerow(film)