Formalize the multithreaded indexing method

It would be helpful to have a consistent indexing method within this package. 

Currently, the benchmarking suite has a multithreaded indexer, but it's specific to LoTTE. We should abstract it and allow different files to be passed.

See: https://github.com/DeployQL/LintDB/blob/main/benchmarks/lotte/multiprocess_indexing.py

Currently, that code doesn't make it clear how many centroids to use. To figure that out, ColBERT recommends the square root of all embeddings that will be stored. 

What I'd like to see indexing code be able to do:
1. Determine total number of tokens in the dataset
2. Calculate the number of centroids
3. Train k-nearest neighbors
4. Index the data.

We should create a library that enables this functionality instead of hiding it in the benchmark code.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Formalize the multithreaded indexing method #10

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Formalize the multithreaded indexing method #10

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions