-
Notifications
You must be signed in to change notification settings - Fork 1.4k
fix: improve embedding diagnostics and handle invalid db #423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: improve embedding diagnostics and handle invalid db #423
Conversation
Summary of ChangesHello @danielfrey63, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the robustness and diagnostic capabilities of the application's embedding system. It introduces improved logging for embedding vector states when loading persisted databases, ensuring that issues with missing or empty embeddings are clearly identified. Crucially, it adds a fallback mechanism to automatically rebuild embeddings if a loaded database is found to be unusable. Furthermore, the changes harden the parsing of Google embedder responses to handle diverse data structures and sanitize logging to protect sensitive information, ultimately leading to a more reliable and transparent embedding pipeline. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces valuable improvements to embedding diagnostics and robustness. When loading a persisted database, it now checks for empty embeddings and can trigger a rebuild, preventing the application from starting in a broken state. The logging for this process is also much more informative. On the Google embedder client side, the response parsing is now significantly more robust, able to handle various response structures. Additionally, logging of embedding API calls has been sanitized to prevent leaking sensitive content, which is a great security improvement. I've added one suggestion to enhance debuggability in the new parsing logic. Overall, this is a solid pull request that improves the reliability and security of the data pipeline.
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The broad except Exception can hide underlying issues that might be useful for debugging why embedding parsing is failing. While the goal is robustness, completely silencing exceptions can make it difficult to diagnose problems with new or unexpected response object structures. Consider adding a debug level log to record the exception. This will keep the logs clean in normal operation but provide valuable information when debugging.
| except Exception: | |
| pass | |
| except Exception as e: | |
| log.debug("Failed to get embedding via '%s': %s", method_name, e) | |
| pass |
Summary
Improve embedding diagnostics and make startup more robust when a persisted LocalDB contains missing/empty embedding vectors. Also sanitize Google embedder logging and harden parsing to avoid silently producing empty embeddings.
Problem
When loading an existing persisted database (
.pkl) the application can end up with documents whosevectorembeddings are missing/empty. This leads to:Document X has empty embedding vector, skippingNo valid embeddings found in any documentsAdditionally, Google embedding responses can vary in structure depending on client/version; if parsing is too strict, embeddings may silently end up empty.
Changes
Database load diagnostics (api/data_pipeline.py)
Google embedder logging + parsing robustness (api/google_embedder_client.py)
Log sanitized kwargs for embedding calls (no raw content; only lengths/counts).
Harden
parse_embedding_response to extract embeddings from dict-like and attribute-based response objects.
Log parsed embedding count + dimension when successful.
Testing
Index built ...,FAISS retriever created successfully)Parsed 1 embedding(s) (dim=...))Notes / Security