Skip to content

Conversation

@rad-pat
Copy link
Contributor

@rad-pat rad-pat commented Dec 9, 2025

Why I'm doing:

When importing CSV, without sampling the whole file, it is possible to get type errors when inserting from FILES. Workarounds include setting the conflicting data to null, which is not ideal.

What I'm doing:

This PR allows the user to disable the type auto-detection and return all columns as string which can then be manipulated with SQL functions to obtain the desired data type.

Fixes #66473

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 4.0
    • 3.5
    • 3.4
    • 3.3

Note

Introduces auto_detect_types to FILES() CSV schema detection, allowing all columns to default to STRING when disabled, with FE/BE support, tests, and docs.

  • FILES() schema detection (CSV):
    • Add auto_detect_types property (default true) to control type inference; when false, all sampled columns are STRING.
  • Backend:
    • CSVScanner: get_type_desc now respects _scan_range.params.schema_sample_types; schema sampling returns VARCHAR when disabled.
  • Frontend:
    • TableFunctionTable: parse auto_detect_types, validate boolean, and pass via TBrokerScanRangeParams.schema_sample_types.
  • Thrift:
    • Add TBrokerScanRangeParams.schema_sample_types (default true).
  • Tests:
    • BE: add CSV schema detection cases with/without type inference; add type_sniff.csv.
    • FE: unit tests for auto_detect_types parsing and validation.
  • Docs:
    • Update EN/JA/ZH files.md to document auto_detect_types and behavior when disabled.

Written by Cursor Bugbot for commit 7d6ebcb. This will update automatically on new commits. Configure here.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Dec 9, 2025
@wanpengfei-git wanpengfei-git requested a review from a team December 9, 2025 09:52
@mergify mergify bot assigned rad-pat Dec 9, 2025
@mergify
Copy link
Contributor

mergify bot commented Dec 9, 2025

🧪 CI Insights

Here's what we observed from your CI run for 7d6ebcb.

🟢 All jobs passed!

But CI Insights is watching 👀

@rad-pat rad-pat marked this pull request as ready for review December 9, 2025 10:21
@rad-pat rad-pat requested review from a team as code owners December 9, 2025 10:21
@alvin-celerdata
Copy link
Contributor

@cursor review

@rad-pat rad-pat force-pushed the csv-type-detect-disable branch 4 times, most recently from dcb05a8 to 89c9fc9 Compare December 10, 2025 10:58
@github-actions
Copy link

[FE Incremental Coverage Report]

fail : 3 / 4 (75.00%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/catalog/TableFunctionTable.java 3 4 75.00% [511]

@rad-pat rad-pat force-pushed the csv-type-detect-disable branch 3 times, most recently from 8585f60 to 6bbc22a Compare December 10, 2025 17:01
@alvin-celerdata
Copy link
Contributor

@cursor review

ExceptionChecker.expectThrowsWithMsg(SemanticException.class,
"Illegal value of auto_detect_types: notaboolean, only true/false allowed",
() -> new TableFunctionTable(new ArrayList<>(), properties, new SessionVariable())
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Test expects wrong exception type for validation error

The test expects SemanticException but the implementation throws DdlException when auto_detect_types has an invalid value. The similar validation for csv.trim_space in the same class also throws DdlException, confirming the implementation follows the existing pattern. The test will fail at runtime because the wrong exception type is caught.

Additional Locations (1)

Fix in Cursor Fix in Web

Assertions.assertDoesNotThrow(() -> {
TableFunctionTable table = new TableFunctionTable(new ArrayList<>(), properties, new SessionVariable());
Assertions.assertTrue((Boolean) field.get(table));
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Test uses wrong constructor that doesn't parse the property

The testAutoDetectTypes test uses the unload constructor TableFunctionTable(List, Map, SessionVariable) which calls parsePropertiesForUnload, but the auto_detect_types property is only parsed in parsePropertiesForLoad. This means the property is never actually read, and autoDetectTypes stays at its default value of true. Test Case 2 (expecting false) will fail because the field remains true. Test Case 4 (expecting SemanticException for invalid value) will fail because no validation occurs and no exception is thrown.

Additional Locations (2)

Fix in Cursor Fix in Web

@rad-pat rad-pat force-pushed the csv-type-detect-disable branch from 6bbc22a to 7d6ebcb Compare December 10, 2025 20:11
@sonarqubecloud
Copy link

@github-actions
Copy link

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions
Copy link

[BE Incremental Coverage Report]

pass : 7 / 7 (100.00%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 be/src/exec/file_scanner/csv_scanner.cpp 7 7 100.00% []

@alvin-celerdata
Copy link
Contributor

@cursor review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.3 3.4 3.5 4.0 documentation Improvements or additions to documentation PROTO-REVIEW

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CSV Import - Option to prevent data type guessing and provide String for all columns

3 participants