Skip to content

Conversation

@inoray
Copy link
Collaborator

@inoray inoray commented Jan 13, 2026

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

@gemini-code-assist
Copy link

Summary of Changes

Hello @inoray, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the document preprocessing pipeline by upgrading core dependencies, enhancing the handling of complex document structures like tables with rich cells and associated captions, and updating API configurations. The changes aim to improve the accuracy and consistency of document parsing, particularly for intelligent processing modes, while temporarily adjusting regression test strictness to facilitate these updates.

Highlights

  • Dependency Updates: Updated docling-core to 2.58.0 and docling-ibm-models to 3.10.3 across multiple pyproject.toml files. Several tree-sitter related dependencies were also added.
  • Table and Caption Processing Enhancements: Introduced new logic in docling/models/readingorder_model.py to handle rich table cells, creating a 1x1 rich cell for tables without structural prediction but with children. New functions adjust_captions and adjust_pictures_in_tables were added in intelligent_processor_law.py to improve the grouping of captions with their parent items and pictures within tables during document splitting.
  • API Endpoint and Key Updates: Modified API base URLs and keys for TOC and metadata enrichment in basic_processor.py and 다우기술.py, transitioning to new https://genos.genon.ai endpoints.
  • Type Hint Modernization: Updated Python type hints in docling/models/readingorder_model.py from typing.List and typing.Dict to the more modern built-in list and dict.
  • Regression Test Adjustments: Several assertions related to vector count, label distribution, character count, and text similarity have been commented out in regression tests for DOCX, HWPX, MD, and PDF files, likely to accommodate changes from the dependency updates or new processing logic.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request appears to address a bug related to handling captions, particularly for tables that lack a defined structure. The core logic changes involve creating 'rich cells' to contain child elements of such tables and adjusting chunking logic to keep related items like captions and pictures together with their parent elements. While the bug fix seems reasonable, this PR introduces several critical issues. Multiple regression and unit tests have been disabled by commenting them out, which significantly increases the risk of undetected regressions. Additionally, API keys are hardcoded in basic_processor.py, posing a serious security vulnerability. These critical issues must be addressed before this PR can be merged.

Comment on lines +1092 to +1093
toc_api_key="f8f89bb0251b4af8945a240c0783dc9d",
metadata_api_key="f8f89bb0251b4af8945a240c0783dc9d",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

These API keys are hardcoded directly in the source code. This is a significant security risk. Secrets like API keys should be managed outside of the codebase, for example, using environment variables or a secret management service, and loaded at runtime.

Comment on lines +57 to +77
# assert current_result["num_vectors"] == baseline["num_vectors"], \
# f"[{docx_path.name}] Vector count mismatch: {current_result['num_vectors']} != {baseline['num_vectors']}"

# assert current_result["label_distribution"] == baseline["label_distribution"], \
# f"[{docx_path.name}] Label distribution mismatch:\nCurrent: {current_result['label_distribution']}\nBaseline: {baseline['label_distribution']}"

# char_diff = abs(current_result["total_characters"] - baseline["total_characters"])
# char_ratio = char_diff / max(baseline["total_characters"], 1)
# assert char_ratio < 0.05, \
# f"[{docx_path.name}] Character count difference too large: {char_diff} chars ({char_ratio:.1%} change)"

# for i, (current_vector, baseline_vector) in enumerate(zip(current_result["vectors"], baseline["vectors"])):
# current_text = current_vector.get("text", "")
# baseline_text = baseline_vector.get("text", "")
# similarity = difflib.SequenceMatcher(
# None,
# current_text,
# baseline_text
# ).ratio()
# assert similarity > 0.85, \
# f"[{docx_path.name}] Vector {i} text similarity too low: {similarity:.2%}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The core assertions for the regression test have been commented out. This effectively disables the test, hiding potential regressions. Tests should not be disabled. If the output has changed intentionally, the baseline files should be updated, and the tests should remain active to catch future unintended changes.

    assert current_result["num_vectors"] == baseline["num_vectors"], \
        f"[{docx_path.name}] Vector count mismatch: {current_result['num_vectors']} != {baseline['num_vectors']}"

    assert current_result["label_distribution"] == baseline["label_distribution"], \
        f"[{docx_path.name}] Label distribution mismatch:\nCurrent: {current_result['label_distribution']}\nBaseline: {baseline['label_distribution']}"

    char_diff = abs(current_result["total_characters"] - baseline["total_characters"])
    char_ratio = char_diff / max(baseline["total_characters"], 1)
    assert char_ratio < 0.05, \
        f"[{docx_path.name}] Character count difference too large: {char_diff} chars ({char_ratio:.1%} change)"

    for i, (current_vector, baseline_vector) in enumerate(zip(current_result["vectors"], baseline["vectors"])):
        current_text = current_vector.get("text", "")
        baseline_text = baseline_vector.get("text", "")
        similarity = difflib.SequenceMatcher(
            None,
            current_text,
            baseline_text
        ).ratio()
        assert similarity > 0.85, \
            f"[{docx_path.name}] Vector {i} text similarity too low: {similarity:.2%}"

Comment on lines +57 to +66
# assert current_result["num_vectors"] == baseline["num_vectors"], \
# f"[{hwpx_path.name}] Vector count mismatch: {current_result['num_vectors']} != {baseline['num_vectors']}"

assert current_result["label_distribution"] == baseline["label_distribution"], \
f"[{hwpx_path.name}] Label distribution mismatch:\nCurrent: {current_result['label_distribution']}\nBaseline: {baseline['label_distribution']}"
# assert current_result["label_distribution"] == baseline["label_distribution"], \
# f"[{hwpx_path.name}] Label distribution mismatch:\nCurrent: {current_result['label_distribution']}\nBaseline: {baseline['label_distribution']}"

char_diff = abs(current_result["total_characters"] - baseline["total_characters"])
char_ratio = char_diff / max(baseline["total_characters"], 1)
assert char_ratio < 0.05, \
f"[{hwpx_path.name}] Character count difference too large: {char_diff} chars ({char_ratio:.1%} change)"
# char_diff = abs(current_result["total_characters"] - baseline["total_characters"])
# char_ratio = char_diff / max(baseline["total_characters"], 1)
# assert char_ratio < 0.05, \
# f"[{hwpx_path.name}] Character count difference too large: {char_diff} chars ({char_ratio:.1%} change)"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to other test files in this PR, the assertions for regression testing are commented out. Disabling tests is a dangerous practice that can hide bugs. Please re-enable them. If the changes are expected, the baselines should be updated accordingly.

Suggested change
# assert current_result["num_vectors"] == baseline["num_vectors"], \
# f"[{hwpx_path.name}] Vector count mismatch: {current_result['num_vectors']} != {baseline['num_vectors']}"
assert current_result["label_distribution"] == baseline["label_distribution"], \
f"[{hwpx_path.name}] Label distribution mismatch:\nCurrent: {current_result['label_distribution']}\nBaseline: {baseline['label_distribution']}"
# assert current_result["label_distribution"] == baseline["label_distribution"], \
# f"[{hwpx_path.name}] Label distribution mismatch:\nCurrent: {current_result['label_distribution']}\nBaseline: {baseline['label_distribution']}"
char_diff = abs(current_result["total_characters"] - baseline["total_characters"])
char_ratio = char_diff / max(baseline["total_characters"], 1)
assert char_ratio < 0.05, \
f"[{hwpx_path.name}] Character count difference too large: {char_diff} chars ({char_ratio:.1%} change)"
# char_diff = abs(current_result["total_characters"] - baseline["total_characters"])
# char_ratio = char_diff / max(baseline["total_characters"], 1)
# assert char_ratio < 0.05, \
# f"[{hwpx_path.name}] Character count difference too large: {char_diff} chars ({char_ratio:.1%} change)"
assert current_result["num_vectors"] == baseline["num_vectors"], \
f"[{hwpx_path.name}] Vector count mismatch: {current_result['num_vectors']} != {baseline['num_vectors']}"
assert current_result["label_distribution"] == baseline["label_distribution"], \
f"[{hwpx_path.name}] Label distribution mismatch:\nCurrent: {current_result['label_distribution']}\nBaseline: {baseline['label_distribution']}"
char_diff = abs(current_result["total_characters"] - baseline["total_characters"])
char_ratio = char_diff / max(baseline["total_characters"], 1)
assert char_ratio < 0.05, \
f"[{hwpx_path.name}] Character count difference too large: {char_diff} chars ({char_ratio:.1%} change)"

Comment on lines +57 to +77
# assert current_result["num_vectors"] == baseline["num_vectors"], \
# f"[{md_path.name}] Vector count mismatch: {current_result['num_vectors']} != {baseline['num_vectors']}"

# assert current_result["label_distribution"] == baseline["label_distribution"], \
# f"[{md_path.name}] Label distribution mismatch:\nCurrent: {current_result['label_distribution']}\nBaseline: {baseline['label_distribution']}"

# char_diff = abs(current_result["total_characters"] - baseline["total_characters"])
# char_ratio = char_diff / max(baseline["total_characters"], 1)
# assert char_ratio < 0.05, \
# f"[{md_path.name}] Character count difference too large: {char_diff} chars ({char_ratio:.1%} change)"

# for i, (current_vector, baseline_vector) in enumerate(zip(current_result["vectors"], baseline["vectors"])):
# current_text = current_vector.get("text", "")
# baseline_text = baseline_vector.get("text", "")
# similarity = difflib.SequenceMatcher(
# None,
# current_text,
# baseline_text
# ).ratio()
# assert similarity > 0.85, \
# f"[{md_path.name}] Vector {i} text similarity too low: {similarity:.2%}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The regression test assertions have been disabled by commenting them out. This undermines the purpose of having regression tests and should be avoided. Please uncomment these checks to ensure code quality and prevent future regressions.

    assert current_result["num_vectors"] == baseline["num_vectors"], \
        f"[{md_path.name}] Vector count mismatch: {current_result['num_vectors']} != {baseline['num_vectors']}"

    assert current_result["label_distribution"] == baseline["label_distribution"], \
        f"[{md_path.name}] Label distribution mismatch:\nCurrent: {current_result['label_distribution']}\nBaseline: {baseline['label_distribution']}"

    char_diff = abs(current_result["total_characters"] - baseline["total_characters"])
    char_ratio = char_diff / max(baseline["total_characters"], 1)
    assert char_ratio < 0.05, \
        f"[{md_path.name}] Character count difference too large: {char_diff} chars ({char_ratio:.1%} change)"

    for i, (current_vector, baseline_vector) in enumerate(zip(current_result["vectors"], baseline["vectors"])):
        current_text = current_vector.get("text", "")
        baseline_text = baseline_vector.get("text", "")
        similarity = difflib.SequenceMatcher(
            None,
            current_text,
            baseline_text
        ).ratio()
        assert similarity > 0.85, \
            f"[{md_path.name}] Vector {i} text similarity too low: {similarity:.2%}"

Comment on lines +70 to +90
# assert current_result["num_vectors"] == baseline["num_vectors"], \
# f"[{pdf_path.name}] Vector count mismatch: {current_result['num_vectors']} != {baseline['num_vectors']}"

# # assert current_result["label_distribution"] == baseline["label_distribution"], \
# # f"[{pdf_path.name}] Label distribution mismatch:\nCurrent: {current_result['label_distribution']}\nBaseline: {baseline['label_distribution']}"

# char_diff = abs(current_result["total_characters"] - baseline["total_characters"])
# char_ratio = char_diff / max(baseline["total_characters"], 1)
# assert char_ratio < 0.05, \
# f"[{pdf_path.name}] Character count difference too large: {char_diff} chars ({char_ratio:.1%} change)"

# for i, (current_vector, baseline_vector) in enumerate(zip(current_result["vectors"], baseline["vectors"])):
# current_text = current_vector.get("text", "")
# baseline_text = baseline_vector.get("text", "")
# similarity = difflib.SequenceMatcher(
# None,
# current_text,
# baseline_text
# ).ratio()
# assert similarity > 0.85, \
# f"[{pdf_path.name}] Vector {i} text similarity too low: {similarity:.2%}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The assertions in this regression test are commented out. This is a critical issue as it prevents the detection of regressions. Please re-enable these tests. If the tests fail due to expected changes, the baseline files should be updated.

    assert current_result["num_vectors"] == baseline["num_vectors"], \
        f"[{pdf_path.name}] Vector count mismatch: {current_result['num_vectors']} != {baseline['num_vectors']}"

    # assert current_result["label_distribution"] == baseline["label_distribution"], \
    #     f"[{pdf_path.name}] Label distribution mismatch:\nCurrent: {current_result['label_distribution']}\nBaseline: {baseline['label_distribution']}"

    char_diff = abs(current_result["total_characters"] - baseline["total_characters"])
    char_ratio = char_diff / max(baseline["total_characters"], 1)
    assert char_ratio < 0.05, \
        f"[{pdf_path.name}] Character count difference too large: {char_diff} chars ({char_ratio:.1%} change)"

    for i, (current_vector, baseline_vector) in enumerate(zip(current_result["vectors"], baseline["vectors"])):
        current_text = current_vector.get("text", "")
        baseline_text = baseline_vector.get("text", "")
        similarity = difflib.SequenceMatcher(
            None,
            current_text,
            baseline_text
        ).ratio()
        assert similarity > 0.85, \
            f"[{pdf_path.name}] Vector {i} text similarity too low: {similarity:.2%}"

Comment on lines +63 to +64
# lines = text.splitlines()
# assert _has_adjacent_duplicate(lines) is False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This assertion, which checks for adjacent duplicate lines, has been commented out. Was this intentional? Disabling tests can hide bugs. If the test was failing due to an intended change, the test logic should be updated. If it was failing due to a new bug, that bug should be fixed.

Suggested change
# lines = text.splitlines()
# assert _has_adjacent_duplicate(lines) is False
lines = text.splitlines()
assert _has_adjacent_duplicate(lines) is False

Comment on lines +261 to +265
if (
element.num_rows == 0
and element.num_cols == 0
and element.cluster.children
):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This condition is partially re-evaluated from the check on line 217. For clarity and to avoid redundant checks, you could define a boolean variable table_has_no_structure at the beginning of the elif block and reuse it here and in the earlier check.

document: DoclingDocument = self.load_documents(file_path, **kwargs)
artifacts_dir, reference_path = self.get_paths(file_path)
document = document._with_pictures_refs(image_dir=artifacts_dir, reference_path=reference_path)
document = document._with_pictures_refs(image_dir=artifacts_dir, page_no=None, reference_path=reference_path)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This file, attachment_processor_origin.py, appears to be a duplicate or an older version of attachment_processor.py. Having such _origin.py files in the codebase can lead to confusion and maintenance issues, as changes need to be applied in multiple places. If this file is a backup or is no longer needed, it should be removed to keep the repository clean.

Comment on lines +574 to +610
def adjust_captions(items_group):

b_modified = False
for idx, group in enumerate(items_group):
if group is None:
continue
item = group[0][0]
ref_idx_list = []
if hasattr(item, 'captions') and item.captions:
for cap in item.captions:
cap_ref = cap.cref
cap_idx = -1
for j, it in enumerate(items_group):
if it is None:
continue
if getattr(it[0][0], 'self_ref', None) == cap_ref:
cap_idx = j
break
if cap_idx != -1:
ref_idx_list.append(cap_idx)
if ref_idx_list:
ref_idx_list = sorted(ref_idx_list)

if not ref_idx_list:
continue

# caption 아이템들을 부모 아이템 바로 뒤로 이동
for cap_idx in ref_idx_list:
for g in items_group[cap_idx]:
items_group[idx].append(g)
items_group[cap_idx] = None # 나중에 None 제거
b_modified = True

if b_modified:
items_group = [it for it in items_group if it is not None]

return items_group

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This new helper function uses a nested loop structure (for cap in item.captions: and then for j, it in enumerate(items_group):), which results in O(N*M) complexity where N is the number of item groups and M is the number of captions. For documents with many elements, this could become a performance bottleneck.

A more efficient approach would be to first create a map of ref to index for all items, allowing for O(1) lookups inside the main loop. This would reduce the complexity to O(N).

Example:

def adjust_captions(items_group):
    ref_to_idx = {
        getattr(it[0][0], 'self_ref', None): j
        for j, it in enumerate(items_group)
        if it is not None
    }

    b_modified = False
    for idx, group in enumerate(items_group):
        # ...
        if hasattr(item, 'captions') and item.captions:
            for cap in item.captions:
                cap_ref = cap.cref
                cap_idx = ref_to_idx.get(cap_ref, -1)
                # ... rest of the logic

if pic_page_no != table_page_no:
continue
ios = pic_bbox.intersection_over_self(table_bbox)
if ios > 0.5: # picture가 50% 이상 table 안에 포함되면 table 안의 picture로 간주

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The value 0.5 is used here as a threshold for intersection-over-self to determine if a picture is inside a table. This is a 'magic number'. It would be better to define this as a named constant at the top of the file or class, like PICTURE_IN_TABLE_IOU_THRESHOLD = 0.5. This improves readability and makes it easier to change the threshold in the future.

@inoray inoray requested a review from JaeseungYang January 14, 2026 07:01
@JaeseungYang JaeseungYang merged commit 96be577 into develop Jan 14, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants