fix: improve embedding diagnostics and handle invalid db #423

danielfrey63 · 2025-12-23T20:21:15Z

Summary

Improve embedding diagnostics and make startup more robust when a persisted LocalDB contains missing/empty embedding vectors. Also sanitize Google embedder logging and harden parsing to avoid silently producing empty embeddings.

Problem

When loading an existing persisted database (.pkl) the application can end up with documents whose vector embeddings are missing/empty. This leads to:

Many Document X has empty embedding vector, skipping
No valid embeddings found in any documents
Retriever cannot be created

Additionally, Google embedding responses can vary in structure depending on client/version; if parsing is too strict, embeddings may silently end up empty.

Changes

Database load diagnostics (api/data_pipeline.py)
- When loading an existing LocalDB, log:
  - number of documents
  - count of non-empty vs empty embeddings
  - sample embedding dimensions
- If all embeddings are empty , automatically fall back to rebuilding embeddings instead of returning unusable docs.
Google embedder logging + parsing robustness (api/google_embedder_client.py)
- Log sanitized kwargs for embedding calls (no raw content; only lengths/counts).
- Harden
  
  parse_embedding_response to extract embeddings from dict-like and attribute-based response objects.
- Log parsed embedding count + dimension when successful.

Testing

Start backend, open a repo (e.g. via WebSocket flow) and observe:
- DB load log includes embedding stats and sample dims
- Retriever builds successfully (Index built ..., FAISS retriever created successfully)
- Query embeddings are generated (Parsed 1 embedding(s) (dim=...))

Notes / Security

Logging intentionally avoids printing raw text input or API keys (sanitized logs only).

gemini-code-assist · 2025-12-23T20:21:32Z

Summary of Changes

Hello @danielfrey63, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and diagnostic capabilities of the application's embedding system. It introduces improved logging for embedding vector states when loading persisted databases, ensuring that issues with missing or empty embeddings are clearly identified. Crucially, it adds a fallback mechanism to automatically rebuild embeddings if a loaded database is found to be unusable. Furthermore, the changes harden the parsing of Google embedder responses to handle diverse data structures and sanitize logging to protect sensitive information, ultimately leading to a more reliable and transparent embedding pipeline.

Highlights

Improved Embedding Diagnostics: Enhanced logging during LocalDB loading to show counts of non-empty/empty embeddings and sample dimensions.
Robust LocalDB Loading: Automatically rebuilds embeddings if a loaded LocalDB contains no usable vectors, preventing application failures.
Hardened Google Embedder Parsing: Made the Google embedder more resilient to variations in API response structures, ensuring embeddings are correctly extracted.
Sanitized Google Embedder Logging: Modified logging for Google embedding calls to prevent sensitive content from being logged, instead showing aggregated stats like content length.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces valuable improvements to embedding diagnostics and robustness. When loading a persisted database, it now checks for empty embeddings and can trigger a rebuild, preventing the application from starting in a broken state. The logging for this process is also much more informative. On the Google embedder client side, the response parsing is now significantly more robust, able to handle various response structures. Additionally, logging of embedding API calls has been sanitized to prevent leaking sensitive content, which is a great security improvement. I've added one suggestion to enhance debuggability in the new parsing logic. Overall, this is a solid pull request that improves the reliability and security of the data pipeline.

gemini-code-assist · 2025-12-23T20:23:26Z

api/google_embedder_client.py

+                        except Exception:
+                            pass


The broad except Exception can hide underlying issues that might be useful for debugging why embedding parsing is failing. While the goal is robustness, completely silencing exceptions can make it difficult to diagnose problems with new or unexpected response object structures. Consider adding a debug level log to record the exception. This will keep the logs clean in normal operation but provide valuable information when debugging.

Suggested change

except Exception:

pass

except Exception as e:

log.debug("Failed to get embedding via '%s': %s", method_name, e)

pass

fix: improve embedding diagnostics and handle invalid db

adb68e4

gemini-code-assist bot reviewed Dec 23, 2025

View reviewed changes

sng-asyncfunc merged commit 79dac6e into AsyncFuncAI:main Dec 27, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: improve embedding diagnostics and handle invalid db #423

fix: improve embedding diagnostics and handle invalid db #423

danielfrey63 commented Dec 23, 2025

Uh oh!

gemini-code-assist bot commented Dec 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-                        except Exception:
-                            pass
+                        except Exception as e:
+                            log.debug("Failed to get embedding via '%s': %s", method_name, e)
+                            pass

fix: improve embedding diagnostics and handle invalid db #423

fix: improve embedding diagnostics and handle invalid db #423

Conversation

danielfrey63 commented Dec 23, 2025

Summary

Problem

Changes

Testing

Notes / Security

Uh oh!

gemini-code-assist bot commented Dec 23, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants