WIP✨(backend) refactor indexation pipeline #26

joehybird · 2025-11-24T13:59:43Z

Add support for deferred loading, preprocessing & embedding of documents.

TODO

Throttle indexation tasks or create commands + cron
Parser for pdf files with albert
Download directly from storage APIs (S3)
Use base64 + "encoding=base64" argument for small binary files sent directly through index/ endpoint

Whishlist

Add async support for download (use asyncio loop)
Parser for all formats with a dedicated service (docling, unstructured, ...)
Use hash to prevent indexing the same content multiple times

docs/env.md

src/backend/core/schemas.py

src/backend/core/factories.py

mascarpon3 · 2025-11-25T09:31:31Z

src/backend/core/factories.py

+    size = factory.LazyFunction(lambda: fake.random_int(min=0, max=1024**2))
+    users = factory.LazyFunction(lambda: [str(uuid4()) for _ in range(3)])
+    groups = factory.LazyFunction(lambda: [slugify(fake.word()) for _ in range(3)])
+    reach = factory.Iterator(list(enums.ReachEnum))


We shouldn't have the embedding here ?

mascarpon3 · 2025-11-25T09:34:35Z

src/backend/core/models.py

+# pylint: disable=too-many-instance-attributes
+@dataclass
+class IndexDocument:
+    """Represents the _source data of opensearch entry"""


I do not think it represents the _source.

I this exemple you ca see _source

assert fox_response["_source"] == { "depth": 1, "numchild": 0, "path": fox_document["path"], "size": fox_document["size"], "created_at": fox_document["created_at"].isoformat(), "updated_at": fox_document["updated_at"].isoformat(), "reach": fox_document["reach"], "title": fox_document["title"], }

I think this represents the mapping. _source are the keywords of the pamming.

mascarpon3 · 2025-11-25T09:37:42Z

src/backend/core/models.py

+    @property
+    def is_loaded(self):
+        """Retuns true if in loaded status"""
+        return self.content_status == enums.ContentStatusEnum.LOADED


isn't "indexing_status" more descriptive than "content_status" ?

Hum... it is for the loading and the preprocessing of the content, but also the embedding... so you may be right.

src/backend/core/schemas.py

Add support for deferred loading, preprocessing & embedding of documents. Add mimetype, language, content_status & content_uri fields in document schema. Signed-off-by: Fabre Florian <[email protected]>

Add AlbertAI client to wrap embedding & conversion API calls Implement working pdf to markdown converter using Albert Signed-off-by: Fabre Florian <[email protected]>

Use service.index_name instead of service.name in create_demo command. Signed-off-by: Fabre Florian <[email protected]>

New processors mechanism in IndexerTaskService : after the loading & conversion steps a list a functions can be chained to transform the document content (like django middlewares) Signed-off-by: Fabre Florian <[email protected]>

joehybird requested review from lunika, mascarpon3 and qbey November 24, 2025 13:59

joehybird changed the title ~~✨(backend) refactor indexation pipeline~~ WIP✨(backend) refactor indexation pipeline Nov 24, 2025

joehybird force-pushed the index-pipeline branch from 1c7e5ae to d65ecd4 Compare November 24, 2025 14:47

qbey reviewed Nov 24, 2025

View reviewed changes

docs/env.md Outdated Show resolved Hide resolved

src/backend/core/schemas.py Show resolved Hide resolved

mascarpon3 reviewed Nov 25, 2025

View reviewed changes

src/backend/core/factories.py Show resolved Hide resolved

mascarpon3 reviewed Nov 25, 2025

View reviewed changes

src/backend/core/schemas.py Outdated Show resolved Hide resolved

joehybird force-pushed the index-pipeline branch 5 times, most recently from 55ccc2b to 06cb52f Compare November 26, 2025 17:14

joehybird added 3 commits November 27, 2025 06:49

✨(backend) refactor indexation pipeline

7e5fc7c

Add support for deferred loading, preprocessing & embedding of documents. Add mimetype, language, content_status & content_uri fields in document schema. Signed-off-by: Fabre Florian <[email protected]>

✨(backend) albert AI client & pdf conversion

56f0fb8

Add AlbertAI client to wrap embedding & conversion API calls Implement working pdf to markdown converter using Albert Signed-off-by: Fabre Florian <[email protected]>

✨(backend) fix service index prefixed name in create_demo

55d9779

Use service.index_name instead of service.name in create_demo command. Signed-off-by: Fabre Florian <[email protected]>

joehybird force-pushed the index-pipeline branch 2 times, most recently from 5c65736 to 3e822c5 Compare November 27, 2025 14:00

✨(backend) add preprocess support for indexer

a2d3b08

New processors mechanism in IndexerTaskService : after the loading & conversion steps a list a functions can be chained to transform the document content (like django middlewares) Signed-off-by: Fabre Florian <[email protected]>

joehybird force-pushed the index-pipeline branch from 3e822c5 to a2d3b08 Compare November 28, 2025 05:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP✨(backend) refactor indexation pipeline #26

WIP✨(backend) refactor indexation pipeline #26

Uh oh!

joehybird commented Nov 24, 2025 •

edited by mascarpon3

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mascarpon3 Nov 25, 2025

Uh oh!

mascarpon3 Nov 25, 2025 •

edited

Loading

Uh oh!

mascarpon3 Nov 25, 2025 •

edited

Loading

Uh oh!

joehybird Nov 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

WIP✨(backend) refactor indexation pipeline #26

Are you sure you want to change the base?

WIP✨(backend) refactor indexation pipeline #26

Uh oh!

Conversation

joehybird commented Nov 24, 2025 • edited by mascarpon3 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Whishlist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mascarpon3 Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

mascarpon3 Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mascarpon3 Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joehybird Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

joehybird commented Nov 24, 2025 •

edited by mascarpon3

Loading

mascarpon3 Nov 25, 2025 •

edited

Loading

mascarpon3 Nov 25, 2025 •

edited

Loading