-
Notifications
You must be signed in to change notification settings - Fork 2
Bug/145 caption #147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug/145 caption #147
Conversation
Summary of ChangesHello @inoray, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refines the document preprocessing pipeline by upgrading core dependencies, enhancing the handling of complex document structures like tables with rich cells and associated captions, and updating API configurations. The changes aim to improve the accuracy and consistency of document parsing, particularly for intelligent processing modes, while temporarily adjusting regression test strictness to facilitate these updates. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request appears to address a bug related to handling captions, particularly for tables that lack a defined structure. The core logic changes involve creating 'rich cells' to contain child elements of such tables and adjusting chunking logic to keep related items like captions and pictures together with their parent elements. While the bug fix seems reasonable, this PR introduces several critical issues. Multiple regression and unit tests have been disabled by commenting them out, which significantly increases the risk of undetected regressions. Additionally, API keys are hardcoded in basic_processor.py, posing a serious security vulnerability. These critical issues must be addressed before this PR can be merged.
| toc_api_key="f8f89bb0251b4af8945a240c0783dc9d", | ||
| metadata_api_key="f8f89bb0251b4af8945a240c0783dc9d", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # assert current_result["num_vectors"] == baseline["num_vectors"], \ | ||
| # f"[{docx_path.name}] Vector count mismatch: {current_result['num_vectors']} != {baseline['num_vectors']}" | ||
|
|
||
| # assert current_result["label_distribution"] == baseline["label_distribution"], \ | ||
| # f"[{docx_path.name}] Label distribution mismatch:\nCurrent: {current_result['label_distribution']}\nBaseline: {baseline['label_distribution']}" | ||
|
|
||
| # char_diff = abs(current_result["total_characters"] - baseline["total_characters"]) | ||
| # char_ratio = char_diff / max(baseline["total_characters"], 1) | ||
| # assert char_ratio < 0.05, \ | ||
| # f"[{docx_path.name}] Character count difference too large: {char_diff} chars ({char_ratio:.1%} change)" | ||
|
|
||
| # for i, (current_vector, baseline_vector) in enumerate(zip(current_result["vectors"], baseline["vectors"])): | ||
| # current_text = current_vector.get("text", "") | ||
| # baseline_text = baseline_vector.get("text", "") | ||
| # similarity = difflib.SequenceMatcher( | ||
| # None, | ||
| # current_text, | ||
| # baseline_text | ||
| # ).ratio() | ||
| # assert similarity > 0.85, \ | ||
| # f"[{docx_path.name}] Vector {i} text similarity too low: {similarity:.2%}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core assertions for the regression test have been commented out. This effectively disables the test, hiding potential regressions. Tests should not be disabled. If the output has changed intentionally, the baseline files should be updated, and the tests should remain active to catch future unintended changes.
assert current_result["num_vectors"] == baseline["num_vectors"], \
f"[{docx_path.name}] Vector count mismatch: {current_result['num_vectors']} != {baseline['num_vectors']}"
assert current_result["label_distribution"] == baseline["label_distribution"], \
f"[{docx_path.name}] Label distribution mismatch:\nCurrent: {current_result['label_distribution']}\nBaseline: {baseline['label_distribution']}"
char_diff = abs(current_result["total_characters"] - baseline["total_characters"])
char_ratio = char_diff / max(baseline["total_characters"], 1)
assert char_ratio < 0.05, \
f"[{docx_path.name}] Character count difference too large: {char_diff} chars ({char_ratio:.1%} change)"
for i, (current_vector, baseline_vector) in enumerate(zip(current_result["vectors"], baseline["vectors"])):
current_text = current_vector.get("text", "")
baseline_text = baseline_vector.get("text", "")
similarity = difflib.SequenceMatcher(
None,
current_text,
baseline_text
).ratio()
assert similarity > 0.85, \
f"[{docx_path.name}] Vector {i} text similarity too low: {similarity:.2%}"| # assert current_result["num_vectors"] == baseline["num_vectors"], \ | ||
| # f"[{hwpx_path.name}] Vector count mismatch: {current_result['num_vectors']} != {baseline['num_vectors']}" | ||
|
|
||
| assert current_result["label_distribution"] == baseline["label_distribution"], \ | ||
| f"[{hwpx_path.name}] Label distribution mismatch:\nCurrent: {current_result['label_distribution']}\nBaseline: {baseline['label_distribution']}" | ||
| # assert current_result["label_distribution"] == baseline["label_distribution"], \ | ||
| # f"[{hwpx_path.name}] Label distribution mismatch:\nCurrent: {current_result['label_distribution']}\nBaseline: {baseline['label_distribution']}" | ||
|
|
||
| char_diff = abs(current_result["total_characters"] - baseline["total_characters"]) | ||
| char_ratio = char_diff / max(baseline["total_characters"], 1) | ||
| assert char_ratio < 0.05, \ | ||
| f"[{hwpx_path.name}] Character count difference too large: {char_diff} chars ({char_ratio:.1%} change)" | ||
| # char_diff = abs(current_result["total_characters"] - baseline["total_characters"]) | ||
| # char_ratio = char_diff / max(baseline["total_characters"], 1) | ||
| # assert char_ratio < 0.05, \ | ||
| # f"[{hwpx_path.name}] Character count difference too large: {char_diff} chars ({char_ratio:.1%} change)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to other test files in this PR, the assertions for regression testing are commented out. Disabling tests is a dangerous practice that can hide bugs. Please re-enable them. If the changes are expected, the baselines should be updated accordingly.
| # assert current_result["num_vectors"] == baseline["num_vectors"], \ | |
| # f"[{hwpx_path.name}] Vector count mismatch: {current_result['num_vectors']} != {baseline['num_vectors']}" | |
| assert current_result["label_distribution"] == baseline["label_distribution"], \ | |
| f"[{hwpx_path.name}] Label distribution mismatch:\nCurrent: {current_result['label_distribution']}\nBaseline: {baseline['label_distribution']}" | |
| # assert current_result["label_distribution"] == baseline["label_distribution"], \ | |
| # f"[{hwpx_path.name}] Label distribution mismatch:\nCurrent: {current_result['label_distribution']}\nBaseline: {baseline['label_distribution']}" | |
| char_diff = abs(current_result["total_characters"] - baseline["total_characters"]) | |
| char_ratio = char_diff / max(baseline["total_characters"], 1) | |
| assert char_ratio < 0.05, \ | |
| f"[{hwpx_path.name}] Character count difference too large: {char_diff} chars ({char_ratio:.1%} change)" | |
| # char_diff = abs(current_result["total_characters"] - baseline["total_characters"]) | |
| # char_ratio = char_diff / max(baseline["total_characters"], 1) | |
| # assert char_ratio < 0.05, \ | |
| # f"[{hwpx_path.name}] Character count difference too large: {char_diff} chars ({char_ratio:.1%} change)" | |
| assert current_result["num_vectors"] == baseline["num_vectors"], \ | |
| f"[{hwpx_path.name}] Vector count mismatch: {current_result['num_vectors']} != {baseline['num_vectors']}" | |
| assert current_result["label_distribution"] == baseline["label_distribution"], \ | |
| f"[{hwpx_path.name}] Label distribution mismatch:\nCurrent: {current_result['label_distribution']}\nBaseline: {baseline['label_distribution']}" | |
| char_diff = abs(current_result["total_characters"] - baseline["total_characters"]) | |
| char_ratio = char_diff / max(baseline["total_characters"], 1) | |
| assert char_ratio < 0.05, \ | |
| f"[{hwpx_path.name}] Character count difference too large: {char_diff} chars ({char_ratio:.1%} change)" |
| # assert current_result["num_vectors"] == baseline["num_vectors"], \ | ||
| # f"[{md_path.name}] Vector count mismatch: {current_result['num_vectors']} != {baseline['num_vectors']}" | ||
|
|
||
| # assert current_result["label_distribution"] == baseline["label_distribution"], \ | ||
| # f"[{md_path.name}] Label distribution mismatch:\nCurrent: {current_result['label_distribution']}\nBaseline: {baseline['label_distribution']}" | ||
|
|
||
| # char_diff = abs(current_result["total_characters"] - baseline["total_characters"]) | ||
| # char_ratio = char_diff / max(baseline["total_characters"], 1) | ||
| # assert char_ratio < 0.05, \ | ||
| # f"[{md_path.name}] Character count difference too large: {char_diff} chars ({char_ratio:.1%} change)" | ||
|
|
||
| # for i, (current_vector, baseline_vector) in enumerate(zip(current_result["vectors"], baseline["vectors"])): | ||
| # current_text = current_vector.get("text", "") | ||
| # baseline_text = baseline_vector.get("text", "") | ||
| # similarity = difflib.SequenceMatcher( | ||
| # None, | ||
| # current_text, | ||
| # baseline_text | ||
| # ).ratio() | ||
| # assert similarity > 0.85, \ | ||
| # f"[{md_path.name}] Vector {i} text similarity too low: {similarity:.2%}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regression test assertions have been disabled by commenting them out. This undermines the purpose of having regression tests and should be avoided. Please uncomment these checks to ensure code quality and prevent future regressions.
assert current_result["num_vectors"] == baseline["num_vectors"], \
f"[{md_path.name}] Vector count mismatch: {current_result['num_vectors']} != {baseline['num_vectors']}"
assert current_result["label_distribution"] == baseline["label_distribution"], \
f"[{md_path.name}] Label distribution mismatch:\nCurrent: {current_result['label_distribution']}\nBaseline: {baseline['label_distribution']}"
char_diff = abs(current_result["total_characters"] - baseline["total_characters"])
char_ratio = char_diff / max(baseline["total_characters"], 1)
assert char_ratio < 0.05, \
f"[{md_path.name}] Character count difference too large: {char_diff} chars ({char_ratio:.1%} change)"
for i, (current_vector, baseline_vector) in enumerate(zip(current_result["vectors"], baseline["vectors"])):
current_text = current_vector.get("text", "")
baseline_text = baseline_vector.get("text", "")
similarity = difflib.SequenceMatcher(
None,
current_text,
baseline_text
).ratio()
assert similarity > 0.85, \
f"[{md_path.name}] Vector {i} text similarity too low: {similarity:.2%}"| # assert current_result["num_vectors"] == baseline["num_vectors"], \ | ||
| # f"[{pdf_path.name}] Vector count mismatch: {current_result['num_vectors']} != {baseline['num_vectors']}" | ||
|
|
||
| # # assert current_result["label_distribution"] == baseline["label_distribution"], \ | ||
| # # f"[{pdf_path.name}] Label distribution mismatch:\nCurrent: {current_result['label_distribution']}\nBaseline: {baseline['label_distribution']}" | ||
|
|
||
| # char_diff = abs(current_result["total_characters"] - baseline["total_characters"]) | ||
| # char_ratio = char_diff / max(baseline["total_characters"], 1) | ||
| # assert char_ratio < 0.05, \ | ||
| # f"[{pdf_path.name}] Character count difference too large: {char_diff} chars ({char_ratio:.1%} change)" | ||
|
|
||
| # for i, (current_vector, baseline_vector) in enumerate(zip(current_result["vectors"], baseline["vectors"])): | ||
| # current_text = current_vector.get("text", "") | ||
| # baseline_text = baseline_vector.get("text", "") | ||
| # similarity = difflib.SequenceMatcher( | ||
| # None, | ||
| # current_text, | ||
| # baseline_text | ||
| # ).ratio() | ||
| # assert similarity > 0.85, \ | ||
| # f"[{pdf_path.name}] Vector {i} text similarity too low: {similarity:.2%}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertions in this regression test are commented out. This is a critical issue as it prevents the detection of regressions. Please re-enable these tests. If the tests fail due to expected changes, the baseline files should be updated.
assert current_result["num_vectors"] == baseline["num_vectors"], \
f"[{pdf_path.name}] Vector count mismatch: {current_result['num_vectors']} != {baseline['num_vectors']}"
# assert current_result["label_distribution"] == baseline["label_distribution"], \
# f"[{pdf_path.name}] Label distribution mismatch:\nCurrent: {current_result['label_distribution']}\nBaseline: {baseline['label_distribution']}"
char_diff = abs(current_result["total_characters"] - baseline["total_characters"])
char_ratio = char_diff / max(baseline["total_characters"], 1)
assert char_ratio < 0.05, \
f"[{pdf_path.name}] Character count difference too large: {char_diff} chars ({char_ratio:.1%} change)"
for i, (current_vector, baseline_vector) in enumerate(zip(current_result["vectors"], baseline["vectors"])):
current_text = current_vector.get("text", "")
baseline_text = baseline_vector.get("text", "")
similarity = difflib.SequenceMatcher(
None,
current_text,
baseline_text
).ratio()
assert similarity > 0.85, \
f"[{pdf_path.name}] Vector {i} text similarity too low: {similarity:.2%}"| # lines = text.splitlines() | ||
| # assert _has_adjacent_duplicate(lines) is False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion, which checks for adjacent duplicate lines, has been commented out. Was this intentional? Disabling tests can hide bugs. If the test was failing due to an intended change, the test logic should be updated. If it was failing due to a new bug, that bug should be fixed.
| # lines = text.splitlines() | |
| # assert _has_adjacent_duplicate(lines) is False | |
| lines = text.splitlines() | |
| assert _has_adjacent_duplicate(lines) is False |
| if ( | ||
| element.num_rows == 0 | ||
| and element.num_cols == 0 | ||
| and element.cluster.children | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| document: DoclingDocument = self.load_documents(file_path, **kwargs) | ||
| artifacts_dir, reference_path = self.get_paths(file_path) | ||
| document = document._with_pictures_refs(image_dir=artifacts_dir, reference_path=reference_path) | ||
| document = document._with_pictures_refs(image_dir=artifacts_dir, page_no=None, reference_path=reference_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file, attachment_processor_origin.py, appears to be a duplicate or an older version of attachment_processor.py. Having such _origin.py files in the codebase can lead to confusion and maintenance issues, as changes need to be applied in multiple places. If this file is a backup or is no longer needed, it should be removed to keep the repository clean.
| def adjust_captions(items_group): | ||
|
|
||
| b_modified = False | ||
| for idx, group in enumerate(items_group): | ||
| if group is None: | ||
| continue | ||
| item = group[0][0] | ||
| ref_idx_list = [] | ||
| if hasattr(item, 'captions') and item.captions: | ||
| for cap in item.captions: | ||
| cap_ref = cap.cref | ||
| cap_idx = -1 | ||
| for j, it in enumerate(items_group): | ||
| if it is None: | ||
| continue | ||
| if getattr(it[0][0], 'self_ref', None) == cap_ref: | ||
| cap_idx = j | ||
| break | ||
| if cap_idx != -1: | ||
| ref_idx_list.append(cap_idx) | ||
| if ref_idx_list: | ||
| ref_idx_list = sorted(ref_idx_list) | ||
|
|
||
| if not ref_idx_list: | ||
| continue | ||
|
|
||
| # caption 아이템들을 부모 아이템 바로 뒤로 이동 | ||
| for cap_idx in ref_idx_list: | ||
| for g in items_group[cap_idx]: | ||
| items_group[idx].append(g) | ||
| items_group[cap_idx] = None # 나중에 None 제거 | ||
| b_modified = True | ||
|
|
||
| if b_modified: | ||
| items_group = [it for it in items_group if it is not None] | ||
|
|
||
| return items_group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new helper function uses a nested loop structure (for cap in item.captions: and then for j, it in enumerate(items_group):), which results in O(N*M) complexity where N is the number of item groups and M is the number of captions. For documents with many elements, this could become a performance bottleneck.
A more efficient approach would be to first create a map of ref to index for all items, allowing for O(1) lookups inside the main loop. This would reduce the complexity to O(N).
Example:
def adjust_captions(items_group):
ref_to_idx = {
getattr(it[0][0], 'self_ref', None): j
for j, it in enumerate(items_group)
if it is not None
}
b_modified = False
for idx, group in enumerate(items_group):
# ...
if hasattr(item, 'captions') and item.captions:
for cap in item.captions:
cap_ref = cap.cref
cap_idx = ref_to_idx.get(cap_ref, -1)
# ... rest of the logic| if pic_page_no != table_page_no: | ||
| continue | ||
| ios = pic_bbox.intersection_over_self(table_bbox) | ||
| if ios > 0.5: # picture가 50% 이상 table 안에 포함되면 table 안의 picture로 간주 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value 0.5 is used here as a threshold for intersection-over-self to determine if a picture is inside a table. This is a 'magic number'. It would be better to define this as a named constant at the top of the file or class, like PICTURE_IN_TABLE_IOU_THRESHOLD = 0.5. This improves readability and makes it easier to change the threshold in the future.
Checklist: