-
Notifications
You must be signed in to change notification settings - Fork 21
Enable array of struct arrow parsing for chat history #213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughConverts DuckDB STRUCT columns to JSON after JSON-like indexing, preserves complex Series when modifying rows, adds Struct/StructArray → JSON serialization and exposes the converter, refactors GET handlers to prefer staged-DB reads with commit-tree fallback, and updates get_slice to accept a ParsedResource. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Controller as File/DataFrame Controller
participant StagedDB
participant Repo
participant CommitTree
Client->>Controller: GET resource/path
alt workspace present
Controller->>StagedDB: query by workspace + path
alt found
StagedDB-->>Controller: return staged file entry
else not found
Controller->>Repo: resolve workspace.commit
Repo-->>Controller: commit
Controller->>CommitTree: read file at commit/path
CommitTree-->>Controller: file entry or NotFound
end
else no workspace
Controller->>Repo: resolve resource.commit
Repo-->>Controller: commit
Controller->>CommitTree: read file at commit/path
CommitTree-->>Controller: file entry or NotFound
end
Controller->>Client: stream file or return NotFound
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…_completion_history
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
oxen-rust/src/lib/src/core/df/tabular.rs (1)
892-898: Confirm explicit JSON casting for list/struct values in DuckDB bindings.
value_to_tosqlconverts list/struct data to JSON strings viajson_array.to_string()and binds them as VARCHAR parameters without explicitCAST(? AS JSON). While DuckDB does parse VARCHAR→JSON implicitly, this approach carries risk: DuckDB's auto-casting has known issues with surplus quotes when inserting JSON strings. For the update/insert paths inrows.rsanddf_db.rs, either confirm that target columns are strongly typed asJSON(which would ensure safe parsing), or add explicitCAST(? AS JSON)to the SQL to avoid unexpected quote-escaping behavior.
🧹 Nitpick comments (4)
oxen-rust/src/lib/src/core/db/data_frames/rows.rs (2)
99-107: Usecoldirectly instead of re-looking updf.column(col_name)(simpler + cheaper).
You already have the series ascol;df.column(col_name)?is redundant and can be replaced bycol(clone as needed).- for col in df.get_columns() { - // Replace that column - copy the entire Series to preserve complex types (lists, structs, etc.) - let col_name = col.name(); - let col_series = df.column(col_name)?; + for col in df.get_columns() { + // Copy the entire Series to preserve complex types (lists, structs, etc.) + let col_name = col.name(); + let col_series = col; if let Some(col_idx) = new_row.get_column_index(col_name) { new_row.replace_column(col_idx, col_series.clone())?; } else { new_row.with_column(col_series.clone())?; } }
167-176: Same “usecoldirectly” refactor applies inmodify_rows.
Mirrors the same redundant lookup pattern asmodify_row.oxen-rust/src/server/src/controllers/file.rs (1)
123-127:base_repois always&repohere; can simplify.
(staged_repo, base_repo)currently assignsbase_repoto&repoin both cases, so you can drop the tuple and just keepstaged_repo.oxen-rust/src/lib/src/core/df/tabular.rs (1)
824-844: Consider avoidingDataFramematerialization forStructArrayconversion. The current code convertsStructArray→DataFrame→ manual row iteration, which creates unnecessary intermediate allocations. Since_fieldsis already available, you can iteratestruct_array.fields()andstruct_array.values()directly (or usestruct_array.iter()for row-wise access), similar to the pattern already used forAnyValue::StructOwnedjust above this function. This would be more efficient and idiomatic.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
oxen-rust/src/lib/src/core/db/data_frames/df_db.rs(1 hunks)oxen-rust/src/lib/src/core/db/data_frames/rows.rs(2 hunks)oxen-rust/src/lib/src/core/df/tabular.rs(3 hunks)oxen-rust/src/server/src/controllers/file.rs(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-10T22:08:33.965Z
Learnt from: Lilianxr
Repo: Oxen-AI/Oxen PR: 142
File: oxen-rust/src/lib/src/core/v_latest/workspaces/files.rs:348-364
Timestamp: 2025-09-10T22:08:33.965Z
Learning: In the Oxen codebase, the `util::fs::path_relative_to_dir` function properly handles path sanitization to ensure paths stay within the specified workspace boundary, making additional canonicalization checks unnecessary for workspace-scoped operations.
Applied to files:
oxen-rust/src/server/src/controllers/file.rs
📚 Learning: 2025-09-18T19:11:42.443Z
Learnt from: Lilianxr
Repo: Oxen-AI/Oxen PR: 146
File: oxen-rust/src/benches/oxen/download.rs:176-179
Timestamp: 2025-09-18T19:11:42.443Z
Learning: In oxen-rust/src/benches/oxen/download.rs, the setup_repo_for_download_benchmark function always stages files under a "files" directory structure in the repository, regardless of whether data_path is provided or not. The hardcoded Path::new("files") in the repositories::download call is correct because of this consistent setup.
Applied to files:
oxen-rust/src/server/src/controllers/file.rs
🧬 Code graph analysis (1)
oxen-rust/src/server/src/controllers/file.rs (6)
oxen-rust/src/lib/src/error.rs (2)
basic_str(158-160)path_does_not_exist(237-239)oxen-rust/src/lib/src/repositories/commits/commit_writer.rs (1)
commit(76-79)oxen-rust/src/server/src/controllers/workspaces.rs (1)
commit(489-540)oxen-rust/src/lib/src/core/v_latest/commits.rs (1)
commit(22-24)oxen-rust/src/lib/src/core/v_latest/workspaces/commit.rs (1)
commit(27-104)oxen-rust/src/lib/src/repositories/tree.rs (1)
get_file_by_path(247-259)
🪛 GitHub Actions: 🐂 Continuous integration
oxen-rust/src/lib/src/core/df/tabular.rs
[error] 817-817: returning the result of a let binding from a block
🔇 Additional comments (3)
oxen-rust/src/lib/src/core/db/data_frames/df_db.rs (1)
673-688: No action required; the code pattern is correct for DuckDB 1.1.1.DuckDB supports implicit casting from STRUCT to JSON, so
ALTER COLUMN ... TYPE JSONwithout aUSINGclause will work for STRUCT columns and nested types (includingSTRUCT[]). The implementation is sound.oxen-rust/src/lib/src/core/df/tabular.rs (1)
775-793: The List(Struct) JSON conversion logic is correct for Polars 0.49. The code properly:
- Calls
l.struct_()on the inner Series (which has dtype Struct)- Handles null fields via
unwrap_or(AnyValue::Null)without panicking- Supports nested structures through recursive
any_val_to_jsoncalls on field valuesoxen-rust/src/server/src/controllers/file.rs (1)
129-147: Staged-first fallback looks good; recommend normalizing paths in staged DB reads for consistency.Files are normalized via
util::fs::path_relative_to_dirwhen staged (lines 50, 182, 209, 224 ofworkspaces/files.rs), butread_from_staged_dbat line 148 uses the path directly fromparse_resource. While this is safe in practice—since the staged DB only contains normalized keys, un-normalized queries simply won't match—consider normalizing the path before the staged DB lookup to maintain consistency with the staging flow and improve robustness if code paths change in the future.⛔ Skipped due to learnings
Learnt from: Lilianxr Repo: Oxen-AI/Oxen PR: 142 File: oxen-rust/src/lib/src/core/v_latest/workspaces/files.rs:348-364 Timestamp: 2025-09-10T22:08:33.965Z Learning: In the Oxen codebase, the `util::fs::path_relative_to_dir` function properly handles path sanitization to ensure paths stay within the specified workspace boundary, making additional canonicalization checks unnecessary for workspace-scoped operations.Learnt from: Lilianxr Repo: Oxen-AI/Oxen PR: 146 File: oxen-rust/src/benches/oxen/download.rs:176-179 Timestamp: 2025-09-18T19:11:42.443Z Learning: In oxen-rust/src/benches/oxen/download.rs, the setup_repo_for_download_benchmark function always stages files under a "files" directory structure in the repository, regardless of whether data_path is provided or not. The hardcoded Path::new("files") in the repositories::download call is correct because of this consistent setup.
|
|
||
| for col in df.get_columns() { | ||
| // Replace that column in the existing df if it exists | ||
| // Replace that column - copy the entire Series to preserve complex types (lists, structs, etc.) | ||
| let col_name = col.name(); | ||
| let new_val = df.column(col_name)?.get(0)?; | ||
| new_row.with_column(Series::new(PlSmallStr::from_str(col_name), vec![new_val]))?; | ||
| let col_series = df.column(col_name)?; | ||
| if let Some(col_idx) = new_row.get_column_index(col_name) { | ||
| new_row.replace_column(col_idx, col_series.clone())?; | ||
| } else { | ||
| new_row.with_column(col_series.clone())?; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard “row not found” before merging columns (avoid odd errors later).
If df_db::select returns 0 rows, the subsequent merge + get_hash_and_status_for_modification will likely return a generic error instead of a clean resource_not_found(uuid). Consider an early check right after maybe_db_data is loaded.
🤖 Prompt for AI Agents
In oxen-rust/src/lib/src/core/db/data_frames/rows.rs around lines 98 to 108, add
an explicit guard immediately after loading maybe_db_data (the result of
df_db::select) to check if it contains zero rows and, if so, return a
resource_not_found(uuid) error instead of proceeding; place the check before the
loop that merges columns so you bail out early with the appropriate error,
avoiding later generic failures from merge/get_hash_and_status_for_modification.
| AnyValue::Struct(_len, struct_array, fields) => { | ||
| let json_value = struct_array_to_json(struct_array, fields); | ||
| json_value | ||
| } | ||
|
|
||
| other => panic!("Unsupported dtype in JSON conversion: {other:?}"), | ||
| } | ||
| } | ||
|
|
||
| fn struct_array_to_json( | ||
| struct_array: &polars::prelude::StructArray, | ||
| _fields: &[polars::prelude::Field], | ||
| ) -> Value { | ||
| // Convert StructArray -> DataFrame, then to JSON | ||
| match DataFrame::try_from(struct_array.clone()) { | ||
| Ok(df) => { | ||
| let mut rows = Vec::with_capacity(df.height()); | ||
| for i in 0..df.height() { | ||
| let mut map = serde_json::Map::new(); | ||
| for col in df.get_columns() { | ||
| let val = col.get(i).unwrap_or(AnyValue::Null); | ||
| map.insert(col.name().to_string(), any_val_to_json(val)); | ||
| } | ||
| rows.push(Value::Object(map)); | ||
| } | ||
| Value::Array(rows) | ||
| } | ||
| Err(_) => Value::String(format!("{struct_array:?}")), | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
Polars Rust AnyValue::Struct vs StructOwned single struct JSON representation
💡 Result:
Short answer — difference and JSON behaviour
-
AnyValue::Struct (enum variant) is the runtime AnyValue representation for a struct value often backed by an Arrow/Polars StructArray (it carries references into the array: index + child arrays + field names). AnyValue::Struct is a view/borrowed-style variant used when iterating rows or working with values from existing arrays. [1][2]
-
AnyValue::StructOwned (sometimes named StructOwned / Struct payload in Polars internals) is the owned form that contains an owned Vec/Box of AnyValue values (the fields) so the struct value can be stored independently of an array backing. It is used when you need an owned scalar (e.g. building buffers, creating rows, or serializing). The AV-buffer implementation shows the code paths that expect AnyValue::StructOwned when adding owned physical values. [3]
-
JSON representation: when you write a DataFrame containing Struct columns with JsonWriter (Polars uses arrow2::io::json under the hood), each struct row is emitted as a JSON object (field-name: value) — i.e. the struct is serialized as a single JSON object per row (or per JSON-lines record), not as a positional array. If you convert struct fields to strings first you can control how non-primitive children are rendered (arrow2/Polars formats non-primitives using Display by default). [4][1]
References
- Polars user guide — Structs (explanation of struct dtype and how struct rows look/are unnestable). [Polars docs - Structs]. [turn1search0]
- AnyValue enum documentation (Polars Rust API) — AnyValue struct variant details. [docs.rs AnyValue]. [turn2search3]
- Polars source glimpses (av_buffer / builder code) showing AnyValue::Struct vs AnyValue::StructOwned handling when building/adding values. [polars av_buffer.rs]. [turn3search0]
- JsonWriter docs — Polars JSON writer uses arrow2::io::json and serializes struct columns as JSON objects; notes about formatting non-primitives. [JsonWriter docs]. [turn0search1]
If you want, I can paste the exact AnyValue variant signatures from the crate source and show a tiny Rust example producing the JSON output.
🏁 Script executed:
# Search for StructOwned handling in the same file
rg "StructOwned" oxen-rust/src/lib/src/core/df/tabular.rs -B 2 -A 5Repository: Oxen-AI/Oxen
Length of output: 804
🏁 Script executed:
# Look at the broader context of the any_val_to_json function
cat -n oxen-rust/src/lib/src/core/df/tabular.rs | sed -n '780,850p'Repository: Oxen-AI/Oxen
Length of output: 3077
🏁 Script executed:
# Search for usages of AnyValue::Struct to understand when it's created
rg "AnyValue::Struct\(" oxen-rust/src/lib/src/core/df/tabular.rs -B 3 -A 3Repository: Oxen-AI/Oxen
Length of output: 221
🏁 Script executed:
# Look for tests related to struct JSON conversion
rg "struct_array_to_json|StructOwned.*json" oxen-rust/src/lib/src/core/df/ -lRepository: Oxen-AI/Oxen
Length of output: 96
🏁 Script executed:
# Check if there are any comments or context about why struct_array is wrapped in an array
git log -p --all -S "struct_array_to_json" -- oxen-rust/src/lib/src/core/df/tabular.rs | head -100Repository: Oxen-AI/Oxen
Length of output: 38
Struct JSON serialization inconsistency: returns array instead of object
AnyValue::StructOwned correctly returns a Value::Object (line 798–813), but AnyValue::Struct delegates to struct_array_to_json which wraps the result in Value::Array (line 839). Per Polars JSON convention, struct values should serialize as JSON objects, not arrays. This breaks the expected shape if downstream code expects an object for a single struct cell.
The fix is straightforward: in struct_array_to_json, if struct_array.len() == 1, return the single object directly; otherwise, return the array only if there are multiple rows.
🧰 Tools
🪛 GitHub Actions: 🐂 Continuous integration
[error] 817-817: returning the result of a let binding from a block
🤖 Prompt for AI Agents
In oxen-rust/src/lib/src/core/df/tabular.rs around lines 815 to 844,
struct_array_to_json currently always returns Value::Array even for a single-row
StructArray which breaks Polars convention; update the function so that after
building the Vec<Value::Object> rows it returns the single Object directly when
df.height() == 1 (i.e., Value::Object(rows[0].clone() or move the map out), and
only returns Value::Array(rows) when there are multiple rows; keep the existing
Err fallback unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
oxen-rust/src/lib/src/core/v_latest/workspaces/data_frames/rows.rs (1)
44-46: Consider conditional normalization to avoid unnecessary cloning.The data is cloned unconditionally before normalization, even when the structure doesn't contain messages or array content. For non-message data, this adds unnecessary allocation overhead.
Consider checking if normalization is needed before cloning:
- let mut normalized_data = data.clone(); - maybe_normalize_message_content(&mut normalized_data); - let df = tabular::parse_json_to_df(&normalized_data)?; + let df = if data.get("messages").is_some() { + let mut normalized_data = data.clone(); + maybe_normalize_message_content(&mut normalized_data); + tabular::parse_json_to_df(&normalized_data)? + } else { + tabular::parse_json_to_df(data)? + };
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
oxen-rust/src/lib/src/core/df/tabular.rs(5 hunks)oxen-rust/src/lib/src/core/v_latest/workspaces/data_frames/rows.rs(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Test Suite / Test Suite (ubuntu-latest)
- GitHub Check: Test Suite / Test Suite (macos-latest)
- GitHub Check: Test Suite / Test Suite (windows-latest)
- GitHub Check: Lint / Cargo check, format, clippy + Ruff
🔇 Additional comments (6)
oxen-rust/src/lib/src/core/v_latest/workspaces/data_frames/rows.rs (1)
7-7: LGTM: Import for JSON macro.The
json!macro import is needed for the normalization logic below.oxen-rust/src/lib/src/core/df/tabular.rs (5)
308-320: LGTM: Clean refactor to explicit reader.The extraction of the reader creation makes the code slightly more readable without changing behavior.
723-723: Public API change:any_val_to_jsonnow exposed.This function is now part of the module's public API. Ensure this is intentional and that the function's behavior is stable enough for external use.
779-797: LGTM: Proper List(Struct) to JSON array conversion.The logic correctly iterates through struct rows and builds JSON objects with recursive value conversion.
819-819: Verified: Clippy issue addressed.The let-then-return pattern flagged in the previous review has been resolved. The match arm now directly returns the result from
struct_array_to_json.
932-932: LGTM: Consistent Struct handling in SQL conversion.The addition properly routes
List(Struct)to JSON conversion, consistent with the pattern used for other complex types.
| fn struct_array_to_json( | ||
| struct_array: &polars::prelude::StructArray, | ||
| fields: &[polars::prelude::Field], | ||
| ) -> Value { | ||
| match DataFrame::try_from(struct_array.clone()) { | ||
| Ok(df) => { | ||
| let mut rows = Vec::with_capacity(df.height()); | ||
| for i in 0..df.height() { | ||
| let mut map = serde_json::Map::new(); | ||
| for col in df.get_columns() { | ||
| let val = col.get(i).unwrap_or(AnyValue::Null); | ||
| map.insert(col.name().to_string(), any_val_to_json(val)); | ||
| } | ||
| rows.push(Value::Object(map)); | ||
| } | ||
| Value::Array(rows) | ||
| } | ||
| Err(e) => { | ||
| log::warn!("Failed to convert StructArray to DataFrame: {e:?}"); | ||
| let mut map = serde_json::Map::new(); | ||
| for (i, field) in fields.iter().enumerate() { | ||
| if let Some(values) = struct_array.values().get(i) { | ||
| let len = values.len(); | ||
| if len > 0 { | ||
| let val_str = format!("{values:?}"); | ||
| // Look for actual URL or string content | ||
| if let Some(start) = val_str.find("https://") { | ||
| // Extract URL starting from https:// | ||
| let rest = &val_str[start..]; | ||
| if let Some(end) = rest.find('"').or(rest.find(']')) { | ||
| let url = &rest[..end]; | ||
| map.insert( | ||
| field.name().to_string(), | ||
| Value::String(url.to_string()), | ||
| ); | ||
| continue; | ||
| } | ||
| } else if let Some(start) = val_str.find("http://") { | ||
| let rest = &val_str[start..]; | ||
| if let Some(end) = rest.find('"').or(rest.find(']')) { | ||
| let url = &rest[..end]; | ||
| map.insert( | ||
| field.name().to_string(), | ||
| Value::String(url.to_string()), | ||
| ); | ||
| continue; | ||
| } | ||
| } | ||
| } | ||
| } | ||
| map.insert(field.name().to_string(), Value::Null); | ||
| } | ||
| if map.is_empty() { | ||
| Value::Null | ||
| } else { | ||
| Value::Object(map) | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, find and read the file at the specified lines
fd -t f "tabular.rs" --path "*core/df*" | head -5Repository: Oxen-AI/Oxen
Length of output: 289
🏁 Script executed:
# Find the tabular.rs file
find . -name "tabular.rs" -type f | grep -E "core/df"Repository: Oxen-AI/Oxen
Length of output: 98
🏁 Script executed:
# Read lines 825-884 from the file
sed -n '825,884p' ./oxen-rust/src/lib/src/core/df/tabular.rsRepository: Oxen-AI/Oxen
Length of output: 2663
🏁 Script executed:
# Search for tests and usages of struct_array_to_json
rg -n "struct_array_to_json" ./oxen-rust --type rustRepository: Oxen-AI/Oxen
Length of output: 274
🏁 Script executed:
# Read more context around the function call to understand the larger any_val_to_json function
sed -n '800,825p' ./oxen-rust/src/lib/src/core/df/tabular.rsRepository: Oxen-AI/Oxen
Length of output: 854
🏁 Script executed:
# Search for tests related to any_val_to_json or JSON conversion
rg -n "any_val_to_json|struct_array_to_json" ./oxen-rust --type rust -A 2 -B 2 | head -50Repository: Oxen-AI/Oxen
Length of output: 4038
🏁 Script executed:
# Search for tests related to df conversion and JSON
find ./oxen-rust -path "*/tests/*" -name "*.rs" -type f | xargs grep -l "json\|struct" | head -10Repository: Oxen-AI/Oxen
Length of output: 38
🏁 Script executed:
# Search for test files in the core/df directory
find ./oxen-rust -path "*/tests/*" -name "*.rs" -type fRepository: Oxen-AI/Oxen
Length of output: 81
🏁 Script executed:
# Search for tests related to struct_array, JSON, or tabular
rg -n "struct_array|any_val_to_json|tabular" ./oxen-rust/tests/test.rsRepository: Oxen-AI/Oxen
Length of output: 38
🏁 Script executed:
# Look for where any_val_to_json is called in the codebase
rg -n "any_val_to_json" ./oxen-rust/src/lib --type rust -B 3 -A 3Repository: Oxen-AI/Oxen
Length of output: 11356
🏁 Script executed:
# Read the struct test more thoroughly
sed -n '2025,2060p' ./oxen-rust/src/lib/src/core/df/tabular.rsRepository: Oxen-AI/Oxen
Length of output: 1177
🏁 Script executed:
# Search for git history or comments about array vs object issues
rg -n "array.*object|object.*array" ./oxen-rust/src/lib/src/core/df/tabular.rs | head -20Repository: Oxen-AI/Oxen
Length of output: 38
🏁 Script executed:
# Check if there are any tests for the error path of struct_array_to_json
rg -n "struct_array_to_json|DataFrame::try_from.*fail|Err.*StructArray" ./oxen-rust --type rust -A 5Repository: Oxen-AI/Oxen
Length of output: 1005
🏁 Script executed:
# Look at the broader context to understand what AnyValue::Struct represents
rg -n "AnyValue::Struct" ./oxen-rust/src/lib/src/core/df/tabular.rs -B 5 -A 5Repository: Oxen-AI/Oxen
Length of output: 1559
🏁 Script executed:
# Let me understand the context better - check how many rows are expected in a StructArray
sed -n '829,845p' ./oxen-rust/src/lib/src/core/df/tabular.rsRepository: Oxen-AI/Oxen
Length of output: 798
Brittle fallback parsing debug strings—only extracts URLs and silently loses other field values.
The error fallback (lines 842-883) uses format!("{values:?}") to parse debug format output for extracting URLs. This approach is fragile and will break if Polars changes its debug formatting.
Issues:
- Debug string parsing: Relies on unstable debug format output, not a public API
- Limited extraction logic: Only handles http/https URLs; all other field types are converted to
Value::Null - Silent data loss: Non-URL fields lose their actual values without recovery
Recommendations:
- Fix the root cause of
DataFrame::try_from()failures instead of relying on a brittle fallback - If a fallback is necessary, use proper Arrow/Polars APIs to extract field values
- Preserve field values for non-URL types rather than defaulting to null
// Instead of parsing debug strings, consider:
Err(e) => {
log::error!("Failed to convert StructArray to DataFrame: {e:?}");
return Value::Null; // or handle appropriately without parsing
}| fn maybe_normalize_message_content(value: &mut serde_json::Value) { | ||
| let Some(messages) = value.get_mut("messages").and_then(|m| m.as_array_mut()) else { | ||
| return; | ||
| }; | ||
|
|
||
| for msg in messages { | ||
| if !msg.is_object() { | ||
| continue; | ||
| } | ||
|
|
||
| let Some(msg_obj) = msg.as_object_mut() else { | ||
| continue; | ||
| }; | ||
|
|
||
| if let Some(content) = msg_obj.get_mut("content") { | ||
| // Only process if content is an array (user messages with images/objects) | ||
| if content.is_array() { | ||
| // Remove null image_url fields before stringifying | ||
| if let Some(content_array) = content.as_array_mut() { | ||
| for item in content_array.iter_mut() { | ||
| if let Some(item_obj) = item.as_object_mut() { | ||
| if let Some(image_url_val) = item_obj.get("image_url") { | ||
| if image_url_val.is_null() { | ||
| item_obj.remove("image_url"); | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Stringify the content array | ||
| if let Ok(content_str) = serde_json::to_string(content) { | ||
| *content = json!(content_str); | ||
| } | ||
| } | ||
| // String content stays as-is (simple text messages) | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling and documentation for domain-specific normalization.
This function silently ignores serialization errors and lacks documentation explaining why this chat-specific normalization is needed.
Issues:
- Line 399:
serde_json::to_stringerrors are silently ignored - if serialization fails, the content remains as an array instead of being stringified, potentially causing downstream parsing issues. - The function is tightly coupled to a specific chat message schema (messages → content → image_url) without documentation explaining the business logic or why array content must be stringified.
Consider:
- Log serialization errors or propagate them
- Add a doc comment explaining why this normalization is necessary for chat history
+/// Normalizes chat message content arrays by removing null image_url fields
+/// and stringifying array content for DuckDB compatibility.
fn maybe_normalize_message_content(value: &mut serde_json::Value) {
// ... existing code ...
// Stringify the content array
- if let Ok(content_str) = serde_json::to_string(content) {
+ match serde_json::to_string(content) {
+ Ok(content_str) => {
*content = json!(content_str);
+ }
+ Err(e) => {
+ log::warn!("Failed to stringify message content: {}", e);
+ }
}|
@EloyMartinez How can I test this? Do you have a specific DF with a problem that I can use? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
oxen-rust/src/server/src/controllers/data_frames.rs (1)
86-87: Unnecessary clone:resourcecan be passed by reference directly.Since
get_slicetakes&ParsedResource, the.clone()is redundant.let data_frame_slice = - repositories::data_frames::get_slice(&repo, &resource.clone(), &resource.path, &opts) + repositories::data_frames::get_slice(&repo, &resource, &resource.path, &opts) .await?;oxen-rust/src/lib/src/core/v_latest/data_frames.rs (1)
40-68: Consider clarifying the error message for the get_slice context.The error message "Only single file download is supported" on line 47 appears to be copied from a download handler. In the context of
get_slice, a message like "Expected a file node" would be clearer.let file_node = match staged_node.node.node { EMerkleTreeNode::File(f) => Ok(f), _ => Err(OxenError::basic_str( - "Only single file download is supported", + "Expected a file node for data frame slice", )), }?;
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
oxen-rust/src/lib/src/core/v_latest/data_frames.rs(3 hunks)oxen-rust/src/lib/src/repositories/data_frames.rs(2 hunks)oxen-rust/src/server/src/controllers/data_frames.rs(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
oxen-rust/src/lib/src/repositories/data_frames.rs (1)
oxen-rust/src/lib/src/core/v_latest/data_frames.rs (1)
get_slice(20-130)
oxen-rust/src/lib/src/core/v_latest/data_frames.rs (5)
oxen-rust/src/lib/src/core/staged/staged_db_manager.rs (1)
with_staged_db_manager(65-118)oxen-rust/src/lib/src/model/merkle_tree/node/file_node.rs (3)
node(118-122)metadata(46-46)metadata(184-186)oxen-rust/src/lib/src/model/merkle_tree/node.rs (1)
metadata(62-67)oxen-rust/src/lib/src/error.rs (2)
basic_str(161-163)path_does_not_exist(244-246)oxen-rust/src/lib/src/repositories/tree.rs (1)
get_file_by_path(247-259)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Lint / Cargo check, format, clippy + Ruff
- GitHub Check: Test Suite / Test Suite (macos-latest)
- GitHub Check: Test Suite / Test Suite (windows-latest)
- GitHub Check: Test Suite / Test Suite (ubuntu-latest)
🔇 Additional comments (4)
oxen-rust/src/server/src/controllers/data_frames.rs (1)
51-55: LGTM!The commit resolution logic correctly prioritizes the workspace commit when available, with appropriate fallback to resource.commit and NotFound error handling.
oxen-rust/src/lib/src/repositories/data_frames.rs (1)
12-21: LGTM!Clean signature update propagating the resource-based API change through the version dispatch layer.
oxen-rust/src/lib/src/core/v_latest/data_frames.rs (2)
26-38: LGTM!The commit derivation and repository selection logic correctly handles both workspace and non-workspace scenarios, establishing the proper context for downstream operations.
92-92: LGTM!The
handle_sql_queryingcall correctly uses the derived commit reference.
…n-AI/Oxen into feat/chat_completion_history
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
oxen-rust/src/server/src/controllers/workspaces/data_frames.rs (1)
116-157: Redundant clone on line 127.
&resource.clone()creates an unnecessary copy. Sinceresourceis only borrowed here and forresource.pathaccess, a simple&resourcesuffices.🔎 Proposed fix
let data_frame_slice = - repositories::data_frames::get_slice(&repo, &resource.clone(), &resource.path, &opts) + repositories::data_frames::get_slice(&repo, &resource, &resource.path, &opts) .await?;
Overall approach looks good. The non-indexed path now correctly fetches a slice from the committed version (via
workspace: Nonein theParsedResource) and returns actual data instead ofNone. The count derivation logic and schema fallback are sound.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
oxen-rust/src/server/src/controllers/workspaces/data_frames.rs(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
oxen-rust/src/server/src/controllers/workspaces/data_frames.rs (2)
oxen-rust/src/lib/src/core/v_latest/data_frames.rs (1)
get_slice(20-130)oxen-rust/src/lib/src/model/data_frame/schema.rs (1)
from_polars(61-73)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Test Suite / Test Suite (macos-latest)
- GitHub Check: Lint / Cargo check, format, clippy + Ruff
- GitHub Check: Test Suite / Test Suite (windows-latest)
- GitHub Check: Test Suite / Test Suite (ubuntu-latest)
🔇 Additional comments (1)
oxen-rust/src/server/src/controllers/workspaces/data_frames.rs (1)
13-13: LGTM!Import addition is appropriate for the new
ParsedResourceusage below.
| manager.with_conn(|conn| rows::append_row(conn, &df)) | ||
| })?; | ||
|
|
||
| println!("this is the result {result:?}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EloyMartinez can you remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!!!!
|
|
||
| let mut df = tabular::parse_json_to_df(data)?; | ||
|
|
||
| log::debug!("update() df: {df:?}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EloyMartinez and this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this too!!!
In order to modify a row that contains an array of structs, we need to manually handle the polars -> duckdb case.
Summary by CodeRabbit
Bug Fixes
Refactor
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.