Skip to content

[Bug]: Error if ChromaBm25EmbeddingFunction is used concurrently #5969

@kylediaz

Description

@kylediaz

What happened?

I used ChromaBm25EmbeddingFunction as my sparse embedding function and tried upserting my docs to Chroma like so:

batch_size = 300
total_batches = (len(arxiv_ds) + batch_size - 1) // batch_size
batches = list(itertools.batched(arxiv_ds, batch_size))

with tqdm(total=total_batches, desc="Processing batches") as pbar:
    for batch_group in itertools.batched(batches, 5):
        with ThreadPoolExecutor(max_workers=5) as executor:
            futures = [executor.submit(upsert_batch, batch) for batch in batch_group]
            for future in as_completed(futures):
                future.result()
                pbar.update(1)

This gives me an error (see log output). This EF works fine if I upsert serially. This is likely because snowballstemmer is not thread-safe.

"""
Demonstration of snowballstemmer thread-safety bug.
This script shows that snowballstemmer works fine serially but fails when used concurrently.
"""

import snowballstemmer
from concurrent.futures import ThreadPoolExecutor, as_completed
import time

# Sample text to stem (similar to what we'd see in arxiv papers)
# Using longer, more complex text to increase the chance of race conditions
sample_texts = [
    """The gravitational wave background from massive black hole binaries emit bursts of 
    gravitational waves at periapse. Such events may be directly resolvable in the Galactic 
    centre. However, if the star does not spiral in, the emitted GWs are not resolvable for 
    extra-galactic MBHs, but constitute a source of background noise. We estimate the power 
    spectrum of this extreme mass ratio burst background.""",
    
    """Dynamics of planets in exoplanetary systems with multiple stars showing how the 
    gravitational interactions between the stars and planets affect the orbital stability 
    and long-term evolution of the planetary system architectures.""",
    
    """Diurnal Thermal Tides in a Non-rotating atmosphere with realistic heating profiles 
    and temperature gradients that demonstrate the complex interplay between radiation 
    and atmospheric dynamics in planetary atmospheres.""",
    
    """Intermittent turbulence, noise and waves in stellar atmospheres create complex 
    patterns of energy transport and momentum deposition that influence the structure 
    and evolution of stellar interiors and surfaces.""",
    
    """Superconductivity in quantum materials and condensed matter physics systems 
    exhibiting novel quantum phenomena including topological phases, strongly correlated 
    electron systems, and exotic superconducting pairing mechanisms.""",
] * 50  # 250 texts total

# Create a single stemmer instance (like BM25 does)
stemmer = snowballstemmer.stemmer('english')

def stem_text(text):
    """Stem all words in the text using the shared stemmer."""
    words = text.lower().replace('.', ' ').replace(',', ' ').split()
    stemmed = []
    for word in words:
        if word.strip():
            try:
                stemmed.append(stemmer.stemWord(word))
            except Exception as e:
                print(f"Error stemming '{word}': {e}")
                raise
    return " ".join(stemmed)

print("="*80)
print("PART 1: Running SERIALLY (should work fine)")
print("="*80)

try:
    start = time.time()
    results_serial = []
    for i, text in enumerate(sample_texts):
        result = stem_text(text)
        results_serial.append(result)
        if i % 50 == 0:
            print(f"Processed {i}/{len(sample_texts)} texts serially...")
    
    elapsed = time.time() - start
    print(f"✓ Serial processing completed successfully in {elapsed:.2f}s")
    print(f"  Processed {len(results_serial)} texts")
except Exception as e:
    print(f"✗ Serial processing failed: {e}")

print("\n" + "="*80)
print("PART 2: Running CONCURRENTLY with 10 workers (should fail)")
print("="*80)

try:
    start = time.time()
    results_concurrent = []
    
    # Use more workers and process all at once to maximize contention
    with ThreadPoolExecutor(max_workers=10) as executor:
        futures = [executor.submit(stem_text, text) for text in sample_texts]
        
        completed = 0
        for future in as_completed(futures):
            result = future.result()  # This will raise the exception
            results_concurrent.append(result)
            completed += 1
            if completed % 50 == 0:
                print(f"Processed {completed}/{len(sample_texts)} texts concurrently...")
    
    elapsed = time.time() - start
    print(f"✓ Concurrent processing completed successfully in {elapsed:.2f}s")
    print(f"  Processed {len(results_concurrent)} texts")
    print("\n  NOTE: Bug didn't trigger this time (race conditions are probabilistic)")
except Exception as e:
    print(f"✗ Concurrent processing failed: {type(e).__name__}: {e}")
    print("\n  *** THIS IS THE BUG ***")
    print("  snowballstemmer is NOT thread-safe!")
    print("  Multiple threads corrupted the shared stemmer's internal state.")
================================================================================
PART 1: Running SERIALLY (should work fine)
================================================================================
Processed 0/250 texts serially...
Processed 50/250 texts serially...
Processed 100/250 texts serially...
Processed 150/250 texts serially...
Processed 200/250 texts serially...
✓ Serial processing completed successfully in 0.10s
  Processed 250 texts

================================================================================
PART 2: Running CONCURRENTLY with 10 workers (should fail)
================================================================================
Processed 50/250 texts concurrently...
Processed 100/250 texts concurrently...
Error stemming 'interactions': string index out of range
✗ Concurrent processing failed: IndexError: string index out of range

  *** THIS IS THE BUG ***

Versions

Latest Chroma version on main e481836

Python 3.12.11
snowballstemmer==3.0.1

Relevant log output

Processing batches:   0%|                                                                                                                     | 0/8499 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/Users/kylediaz/Repos/chroma-core/funcs/ingest.py", line 75, in <module>
    future.result()
  File "/Users/kylediaz/.local/share/uv/python/cpython-3.12.11-macos-aarch64-none/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/kylediaz/.local/share/uv/python/cpython-3.12.11-macos-aarch64-none/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/Users/kylediaz/.local/share/uv/python/cpython-3.12.11-macos-aarch64-none/lib/python3.12/concurrent/futures/thread.py", line 59, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kylediaz/Repos/chroma-core/funcs/ingest.py", line 59, in upsert_batch
    collection.upsert(
  File "/Users/kylediaz/Repos/chroma-core/chroma/chromadb/api/models/Collection.py", line 449, in upsert
    upsert_request = self._validate_and_prepare_upsert_request(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kylediaz/Repos/chroma-core/chroma/chromadb/api/models/CollectionCommon.py", line 103, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kylediaz/Repos/chroma-core/chroma/chromadb/api/models/CollectionCommon.py", line 452, in _validate_and_prepare_upsert_request
    upsert_metadatas = self._apply_sparse_embeddings_to_metadatas(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kylediaz/Repos/chroma-core/chroma/chromadb/api/models/CollectionCommon.py", line 651, in _apply_sparse_embeddings_to_metadatas
    sparse_embeddings = self._sparse_embed(
                        ^^^^^^^^^^^^^^^^^^^
  File "/Users/kylediaz/Repos/chroma-core/chroma/chromadb/api/models/CollectionCommon.py", line 797, in _sparse_embed
    return sparse_embedding_function(input=input)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kylediaz/Repos/chroma-core/chroma/chromadb/api/types.py", line 1420, in __call__
    result = call(self, input)
             ^^^^^^^^^^^^^^^^^
  File "/Users/kylediaz/Repos/chroma-core/chroma/chromadb/utils/embedding_functions/chroma_bm25_embedding_function.py", line 120, in __call__
    sparse_vectors.append(self._encode(document))
                          ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kylediaz/Repos/chroma-core/chroma/chromadb/utils/embedding_functions/chroma_bm25_embedding_function.py", line 83, in _encode
    tokens = self._tokenizer.tokenize(text)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kylediaz/Repos/chroma-core/chroma/chromadb/utils/embedding_functions/schemas/bm25_tokenizer.py", line 256, in tokenize
    stemmed = self._stemmer.stem(token).strip()
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kylediaz/Repos/chroma-core/chroma/chromadb/utils/embedding_functions/schemas/bm25_tokenizer.py", line 213, in stem
    return cast(str, self._stemmer.stemWord(token))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kylediaz/Repos/chroma-core/funcs/.venv/lib/python3.12/site-packages/snowballstemmer/basestemmer.py", line 285, in stemWord
    self._stem()
  File "/Users/kylediaz/Repos/chroma-core/funcs/.venv/lib/python3.12/site-packages/snowballstemmer/english_stemmer.py", line 577, in _stem
    self.__r_Step_2()
  File "/Users/kylediaz/Repos/chroma-core/funcs/.venv/lib/python3.12/site-packages/snowballstemmer/english_stemmer.py", line 322, in __r_Step_2
    among_var = self.find_among_b(EnglishStemmer.a_7)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kylediaz/Repos/chroma-core/funcs/.venv/lib/python3.12/site-packages/snowballstemmer/basestemmer.py", line 187, in find_among_b
    diff = ord(self.current[c - 1 - common]) - ord(w.s[i2])
               ~~~~~~~~~~~~^^^^^^^^^^^^^^^^
IndexError: string index out of range in upsert.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions