Skip to content

Conversation

@joehybird
Copy link
Contributor

@joehybird joehybird commented Nov 24, 2025

Add support for deferred loading, preprocessing & embedding of documents.

TODO

  • Throttle indexation tasks or create commands + cron
  • Parser for pdf files with albert
  • Download directly from storage APIs (S3)
  • Use base64 + "encoding=base64" argument for small binary files sent directly through index/ endpoint

Whishlist

  • Add async support for download (use asyncio loop)
  • Parser for all formats with a dedicated service (docling, unstructured, ...)
  • Use hash to prevent indexing the same content multiple times

@joehybird joehybird changed the title ✨(backend) refactor indexation pipeline WIP✨(backend) refactor indexation pipeline Nov 24, 2025
size = factory.LazyFunction(lambda: fake.random_int(min=0, max=1024**2))
users = factory.LazyFunction(lambda: [str(uuid4()) for _ in range(3)])
groups = factory.LazyFunction(lambda: [slugify(fake.word()) for _ in range(3)])
reach = factory.Iterator(list(enums.ReachEnum))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have the embedding here ?

# pylint: disable=too-many-instance-attributes
@dataclass
class IndexDocument:
"""Represents the _source data of opensearch entry"""
Copy link
Contributor

@mascarpon3 mascarpon3 Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think it represents the _source.

I this exemple you ca see _source

    assert fox_response["_source"] == {
        "depth": 1,
        "numchild": 0,
        "path": fox_document["path"],
        "size": fox_document["size"],
        "created_at": fox_document["created_at"].isoformat(),
        "updated_at": fox_document["updated_at"].isoformat(),
        "reach": fox_document["reach"],
        "title": fox_document["title"],
    }

I think this represents the mapping. _source are the keywords of the pamming.

@property
def is_loaded(self):
"""Retuns true if in loaded status"""
return self.content_status == enums.ContentStatusEnum.LOADED
Copy link
Contributor

@mascarpon3 mascarpon3 Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't "indexing_status" more descriptive than "content_status" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum... it is for the loading and the preprocessing of the content, but also the embedding... so you may be right.

@joehybird joehybird force-pushed the index-pipeline branch 5 times, most recently from 55ccc2b to 06cb52f Compare November 26, 2025 17:14
Add support for deferred loading, preprocessing & embedding of documents.
Add mimetype, language, content_status & content_uri fields in document schema.

Signed-off-by: Fabre Florian <[email protected]>
Add AlbertAI client to wrap embedding & conversion API calls
Implement working pdf to markdown converter using Albert

Signed-off-by: Fabre Florian <[email protected]>
Use service.index_name instead of service.name in create_demo
command.

Signed-off-by: Fabre Florian <[email protected]>
@joehybird joehybird force-pushed the index-pipeline branch 2 times, most recently from 5c65736 to 3e822c5 Compare November 27, 2025 14:00
New processors mechanism in IndexerTaskService : after the loading & conversion
steps a list a functions can be chained to transform the document content (like
django middlewares)

Signed-off-by: Fabre Florian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants