Skip to content

Conversation

@danielfrey63
Copy link
Contributor

Summary

Improve embedding diagnostics and make startup more robust when a persisted LocalDB contains missing/empty embedding vectors. Also sanitize Google embedder logging and harden parsing to avoid silently producing empty embeddings.

Problem

When loading an existing persisted database (.pkl) the application can end up with documents whose vector embeddings are missing/empty. This leads to:

  • Many Document X has empty embedding vector, skipping
  • No valid embeddings found in any documents
  • Retriever cannot be created

Additionally, Google embedding responses can vary in structure depending on client/version; if parsing is too strict, embeddings may silently end up empty.

Changes

  • Database load diagnostics (api/data_pipeline.py)

    • When loading an existing LocalDB, log:
      • number of documents
      • count of non-empty vs empty embeddings
      • sample embedding dimensions
    • If all embeddings are empty , automatically fall back to rebuilding embeddings instead of returning unusable docs.
  • Google embedder logging + parsing robustness (api/google_embedder_client.py)

    • Log sanitized kwargs for embedding calls (no raw content; only lengths/counts).

    • Harden

      parse_embedding_response to extract embeddings from dict-like and attribute-based response objects.

    • Log parsed embedding count + dimension when successful.

Testing

  • Start backend, open a repo (e.g. via WebSocket flow) and observe:
    • DB load log includes embedding stats and sample dims
    • Retriever builds successfully (Index built ..., FAISS retriever created successfully)
    • Query embeddings are generated (Parsed 1 embedding(s) (dim=...))

Notes / Security

  • Logging intentionally avoids printing raw text input or API keys (sanitized logs only).

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @danielfrey63, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and diagnostic capabilities of the application's embedding system. It introduces improved logging for embedding vector states when loading persisted databases, ensuring that issues with missing or empty embeddings are clearly identified. Crucially, it adds a fallback mechanism to automatically rebuild embeddings if a loaded database is found to be unusable. Furthermore, the changes harden the parsing of Google embedder responses to handle diverse data structures and sanitize logging to protect sensitive information, ultimately leading to a more reliable and transparent embedding pipeline.

Highlights

  • Improved Embedding Diagnostics: Enhanced logging during LocalDB loading to show counts of non-empty/empty embeddings and sample dimensions.
  • Robust LocalDB Loading: Automatically rebuilds embeddings if a loaded LocalDB contains no usable vectors, preventing application failures.
  • Hardened Google Embedder Parsing: Made the Google embedder more resilient to variations in API response structures, ensuring embeddings are correctly extracted.
  • Sanitized Google Embedder Logging: Modified logging for Google embedding calls to prevent sensitive content from being logged, instead showing aggregated stats like content length.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces valuable improvements to embedding diagnostics and robustness. When loading a persisted database, it now checks for empty embeddings and can trigger a rebuild, preventing the application from starting in a broken state. The logging for this process is also much more informative. On the Google embedder client side, the response parsing is now significantly more robust, able to handle various response structures. Additionally, logging of embedding API calls has been sanitized to prevent leaking sensitive content, which is a great security improvement. I've added one suggestion to enhance debuggability in the new parsing logic. Overall, this is a solid pull request that improves the reliability and security of the data pipeline.

Comment on lines +113 to +114
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The broad except Exception can hide underlying issues that might be useful for debugging why embedding parsing is failing. While the goal is robustness, completely silencing exceptions can make it difficult to diagnose problems with new or unexpected response object structures. Consider adding a debug level log to record the exception. This will keep the logs clean in normal operation but provide valuable information when debugging.

Suggested change
except Exception:
pass
except Exception as e:
log.debug("Failed to get embedding via '%s': %s", method_name, e)
pass

@sng-asyncfunc sng-asyncfunc merged commit 79dac6e into AsyncFuncAI:main Dec 27, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants