-
Notifications
You must be signed in to change notification settings - Fork 4
WIP✨(backend) refactor indexation pipeline #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
1c7e5ae to
d65ecd4
Compare
| size = factory.LazyFunction(lambda: fake.random_int(min=0, max=1024**2)) | ||
| users = factory.LazyFunction(lambda: [str(uuid4()) for _ in range(3)]) | ||
| groups = factory.LazyFunction(lambda: [slugify(fake.word()) for _ in range(3)]) | ||
| reach = factory.Iterator(list(enums.ReachEnum)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't have the embedding here ?
| # pylint: disable=too-many-instance-attributes | ||
| @dataclass | ||
| class IndexDocument: | ||
| """Represents the _source data of opensearch entry""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think it represents the _source.
I this exemple you ca see _source
assert fox_response["_source"] == {
"depth": 1,
"numchild": 0,
"path": fox_document["path"],
"size": fox_document["size"],
"created_at": fox_document["created_at"].isoformat(),
"updated_at": fox_document["updated_at"].isoformat(),
"reach": fox_document["reach"],
"title": fox_document["title"],
}
I think this represents the mapping. _source are the keywords of the pamming.
| @property | ||
| def is_loaded(self): | ||
| """Retuns true if in loaded status""" | ||
| return self.content_status == enums.ContentStatusEnum.LOADED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't "indexing_status" more descriptive than "content_status" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum... it is for the loading and the preprocessing of the content, but also the embedding... so you may be right.
55ccc2b to
06cb52f
Compare
Add support for deferred loading, preprocessing & embedding of documents. Add mimetype, language, content_status & content_uri fields in document schema. Signed-off-by: Fabre Florian <[email protected]>
Add AlbertAI client to wrap embedding & conversion API calls Implement working pdf to markdown converter using Albert Signed-off-by: Fabre Florian <[email protected]>
Use service.index_name instead of service.name in create_demo command. Signed-off-by: Fabre Florian <[email protected]>
5c65736 to
3e822c5
Compare
New processors mechanism in IndexerTaskService : after the loading & conversion steps a list a functions can be chained to transform the document content (like django middlewares) Signed-off-by: Fabre Florian <[email protected]>
3e822c5 to
a2d3b08
Compare
Add support for deferred loading, preprocessing & embedding of documents.
TODO
index/endpointWhishlist