Skip to content

Conversation

@EloyMartinez
Copy link
Collaborator

@EloyMartinez EloyMartinez commented Dec 12, 2025

In order to modify a row that contains an array of structs, we need to manually handle the polars -> duckdb case.

Summary by CodeRabbit

  • Bug Fixes

    • STRUCT/list fields from JSON-like uploads are now stored as JSON to avoid binding issues.
    • Preserve complex nested types during row updates so lists/structs remain intact.
    • Improved JSON serialization for struct/list values to ensure nested records convert correctly.
  • Refactor

    • Prefer staged data when resolving workspace files, with commit-tree fallback.
  • New Features

    • Non-indexed dataframe requests now return dataframe views and include an is_indexed flag.
  • Chores

    • Added minor runtime diagnostics and extra update logging.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 12, 2025

Walkthrough

Converts DuckDB STRUCT columns to JSON after JSON-like indexing, preserves complex Series when modifying rows, adds Struct/StructArray → JSON serialization and exposes the converter, refactors GET handlers to prefer staged-DB reads with commit-tree fallback, and updates get_slice to accept a ParsedResource.

Changes

Cohort / File(s) Summary
DataFrame indexing
oxen-rust/src/lib/src/core/db/data_frames/df_db.rs
After creating tables for jsonl/json/ndjson (including index_file_with_id), discover columns whose data_type begins with STRUCT and run ALTER TABLE ... ALTER COLUMN ... TYPE JSON per column so struct columns are stored as JSON.
Row modification / preservation
oxen-rust/src/lib/src/core/db/data_frames/rows.rs
modify_row / modify_rows now copy incoming column Series into the output row using replace_column or with_column instead of constructing single-value Series, preserving lists, structs, and other complex types.
Tabular / JSON serialization
oxen-rust/src/lib/src/core/df/tabular.rs
any_val_to_json made pub and extended to handle AnyValue::Struct / StructArray and List(Struct) via struct_array_to_json; struct-containing lists route through JSON conversion for SQL insertion.
get_slice API + staged DB logic
oxen-rust/src/lib/src/core/v_latest/data_frames.rs, oxen-rust/src/lib/src/repositories/data_frames.rs, oxen-rust/src/server/src/controllers/data_frames.rs
get_slice signature changed from &Commit&ParsedResource; callers updated. Logic derives commit/workspace from the resource and supports reading from staged DB first with fallback to the commit-tree.
File GET handler refactor
oxen-rust/src/server/src/controllers/file.rs
GET flow now extracts workspace/path, attempts staged-DB lookup first when workspace present, and falls back to commit-tree reads using workspace commit or base repo; streaming/response handling preserved.
Workspace DF parsing & diagnostics
oxen-rust/src/lib/src/core/v_latest/workspaces/data_frames/rows.rs
parse_json_to_df now receives &data (by ref) and add/remove small debug/diagnostic prints in add/update paths.
Workspace controller: non-indexed DF views
oxen-rust/src/server/src/controllers/workspaces/data_frames.rs, liboxen/view/json_data_frame_view.rs
Non-indexed branch now builds DF views from a slice and sets data_frame: Some(...); response struct gained new public field is_indexed: bool.
Minor logging
oxen-rust/src/server/src/controllers/workspaces/data_frames/rows.rs
Adds debug logging to include JSON data in update(...) paths.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Controller as File/DataFrame Controller
    participant StagedDB
    participant Repo
    participant CommitTree

    Client->>Controller: GET resource/path
    alt workspace present
        Controller->>StagedDB: query by workspace + path
        alt found
            StagedDB-->>Controller: return staged file entry
        else not found
            Controller->>Repo: resolve workspace.commit
            Repo-->>Controller: commit
            Controller->>CommitTree: read file at commit/path
            CommitTree-->>Controller: file entry or NotFound
        end
    else no workspace
        Controller->>Repo: resolve resource.commit
        Repo-->>Controller: commit
        Controller->>CommitTree: read file at commit/path
        CommitTree-->>Controller: file entry or NotFound
    end
    Controller->>Client: stream file or return NotFound
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Pay special attention to SQL generation and identifier quoting for STRUCT→JSON ALTERs (df_db.rs).
  • Validate series preservation and ordering/null handling after replace_column/with_column changes (rows.rs).
  • Test struct_array_to_json with nested/null values and lists-of-structs (tabular.rs).
  • Verify get_slice signature propagation and staged-vs-commit resolution across controllers/repositories and the new is_indexed response field.

Possibly related PRs

  • Check merkle tree when fetching files from workspace #169 — similar staged-DB-first then commit-tree fallback logic in file retrieval controllers.
  • Oxen-AI/Oxen#540 — related Struct/List → JSON serialization and indexing changes in tabular.rs.
  • Oxen-AI/Oxen#485 — related modifications to df_db.rs indexing logic and ALTER/struct handling.

Suggested reviewers

  • jcelliott
  • Eric-Laurence

Poem

🐇 I hop through rows and tuck structs into JSON light,
I copy whole Series so nested types sleep tight.
I peek at staged patches, then follow back to the tree,
I alter columns gently so DuckDB agrees with me.
A rabbit's tiny patch — tidy, nimble, and free.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Enable array of struct arrow parsing for chat history' accurately describes the main change: adding support for parsing Arrow data containing arrays of structs to enable row modifications in chat history. The title is specific, concise, and directly related to the primary technical objective.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/chat_completion_history

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c05d72b and 67150f0.

📒 Files selected for processing (1)
  • oxen-rust/src/server/src/controllers/file.rs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • oxen-rust/src/server/src/controllers/file.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Test Suite / Test Suite (macos-latest)
  • GitHub Check: Test Suite / Test Suite (windows-latest)
  • GitHub Check: Test Suite / Test Suite (ubuntu-latest)
  • GitHub Check: Lint / Cargo check, format, clippy + Ruff

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
oxen-rust/src/lib/src/core/df/tabular.rs (1)

892-898: Confirm explicit JSON casting for list/struct values in DuckDB bindings.
value_to_tosql converts list/struct data to JSON strings via json_array.to_string() and binds them as VARCHAR parameters without explicit CAST(? AS JSON). While DuckDB does parse VARCHAR→JSON implicitly, this approach carries risk: DuckDB's auto-casting has known issues with surplus quotes when inserting JSON strings. For the update/insert paths in rows.rs and df_db.rs, either confirm that target columns are strongly typed as JSON (which would ensure safe parsing), or add explicit CAST(? AS JSON) to the SQL to avoid unexpected quote-escaping behavior.

🧹 Nitpick comments (4)
oxen-rust/src/lib/src/core/db/data_frames/rows.rs (2)

99-107: Use col directly instead of re-looking up df.column(col_name) (simpler + cheaper).
You already have the series as col; df.column(col_name)? is redundant and can be replaced by col (clone as needed).

-    for col in df.get_columns() {
-        // Replace that column - copy the entire Series to preserve complex types (lists, structs, etc.)
-        let col_name = col.name();
-        let col_series = df.column(col_name)?;
+    for col in df.get_columns() {
+        // Copy the entire Series to preserve complex types (lists, structs, etc.)
+        let col_name = col.name();
+        let col_series = col;
         if let Some(col_idx) = new_row.get_column_index(col_name) {
             new_row.replace_column(col_idx, col_series.clone())?;
         } else {
             new_row.with_column(col_series.clone())?;
         }
     }

167-176: Same “use col directly” refactor applies in modify_rows.
Mirrors the same redundant lookup pattern as modify_row.

oxen-rust/src/server/src/controllers/file.rs (1)

123-127: base_repo is always &repo here; can simplify.
(staged_repo, base_repo) currently assigns base_repo to &repo in both cases, so you can drop the tuple and just keep staged_repo.

oxen-rust/src/lib/src/core/df/tabular.rs (1)

824-844: Consider avoiding DataFrame materialization for StructArray conversion. The current code converts StructArrayDataFrame → manual row iteration, which creates unnecessary intermediate allocations. Since _fields is already available, you can iterate struct_array.fields() and struct_array.values() directly (or use struct_array.iter() for row-wise access), similar to the pattern already used for AnyValue::StructOwned just above this function. This would be more efficient and idiomatic.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6c3e54a and 428053a.

📒 Files selected for processing (4)
  • oxen-rust/src/lib/src/core/db/data_frames/df_db.rs (1 hunks)
  • oxen-rust/src/lib/src/core/db/data_frames/rows.rs (2 hunks)
  • oxen-rust/src/lib/src/core/df/tabular.rs (3 hunks)
  • oxen-rust/src/server/src/controllers/file.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-10T22:08:33.965Z
Learnt from: Lilianxr
Repo: Oxen-AI/Oxen PR: 142
File: oxen-rust/src/lib/src/core/v_latest/workspaces/files.rs:348-364
Timestamp: 2025-09-10T22:08:33.965Z
Learning: In the Oxen codebase, the `util::fs::path_relative_to_dir` function properly handles path sanitization to ensure paths stay within the specified workspace boundary, making additional canonicalization checks unnecessary for workspace-scoped operations.

Applied to files:

  • oxen-rust/src/server/src/controllers/file.rs
📚 Learning: 2025-09-18T19:11:42.443Z
Learnt from: Lilianxr
Repo: Oxen-AI/Oxen PR: 146
File: oxen-rust/src/benches/oxen/download.rs:176-179
Timestamp: 2025-09-18T19:11:42.443Z
Learning: In oxen-rust/src/benches/oxen/download.rs, the setup_repo_for_download_benchmark function always stages files under a "files" directory structure in the repository, regardless of whether data_path is provided or not. The hardcoded Path::new("files") in the repositories::download call is correct because of this consistent setup.

Applied to files:

  • oxen-rust/src/server/src/controllers/file.rs
🧬 Code graph analysis (1)
oxen-rust/src/server/src/controllers/file.rs (6)
oxen-rust/src/lib/src/error.rs (2)
  • basic_str (158-160)
  • path_does_not_exist (237-239)
oxen-rust/src/lib/src/repositories/commits/commit_writer.rs (1)
  • commit (76-79)
oxen-rust/src/server/src/controllers/workspaces.rs (1)
  • commit (489-540)
oxen-rust/src/lib/src/core/v_latest/commits.rs (1)
  • commit (22-24)
oxen-rust/src/lib/src/core/v_latest/workspaces/commit.rs (1)
  • commit (27-104)
oxen-rust/src/lib/src/repositories/tree.rs (1)
  • get_file_by_path (247-259)
🪛 GitHub Actions: 🐂 Continuous integration
oxen-rust/src/lib/src/core/df/tabular.rs

[error] 817-817: returning the result of a let binding from a block

🔇 Additional comments (3)
oxen-rust/src/lib/src/core/db/data_frames/df_db.rs (1)

673-688: No action required; the code pattern is correct for DuckDB 1.1.1.

DuckDB supports implicit casting from STRUCT to JSON, so ALTER COLUMN ... TYPE JSON without a USING clause will work for STRUCT columns and nested types (including STRUCT[]). The implementation is sound.

oxen-rust/src/lib/src/core/df/tabular.rs (1)

775-793: The List(Struct) JSON conversion logic is correct for Polars 0.49. The code properly:

  • Calls l.struct_() on the inner Series (which has dtype Struct)
  • Handles null fields via unwrap_or(AnyValue::Null) without panicking
  • Supports nested structures through recursive any_val_to_json calls on field values
oxen-rust/src/server/src/controllers/file.rs (1)

129-147: Staged-first fallback looks good; recommend normalizing paths in staged DB reads for consistency.

Files are normalized via util::fs::path_relative_to_dir when staged (lines 50, 182, 209, 224 of workspaces/files.rs), but read_from_staged_db at line 148 uses the path directly from parse_resource. While this is safe in practice—since the staged DB only contains normalized keys, un-normalized queries simply won't match—consider normalizing the path before the staged DB lookup to maintain consistency with the staging flow and improve robustness if code paths change in the future.

⛔ Skipped due to learnings
Learnt from: Lilianxr
Repo: Oxen-AI/Oxen PR: 142
File: oxen-rust/src/lib/src/core/v_latest/workspaces/files.rs:348-364
Timestamp: 2025-09-10T22:08:33.965Z
Learning: In the Oxen codebase, the `util::fs::path_relative_to_dir` function properly handles path sanitization to ensure paths stay within the specified workspace boundary, making additional canonicalization checks unnecessary for workspace-scoped operations.
Learnt from: Lilianxr
Repo: Oxen-AI/Oxen PR: 146
File: oxen-rust/src/benches/oxen/download.rs:176-179
Timestamp: 2025-09-18T19:11:42.443Z
Learning: In oxen-rust/src/benches/oxen/download.rs, the setup_repo_for_download_benchmark function always stages files under a "files" directory structure in the repository, regardless of whether data_path is provided or not. The hardcoded Path::new("files") in the repositories::download call is correct because of this consistent setup.

Comment on lines +98 to 108

for col in df.get_columns() {
// Replace that column in the existing df if it exists
// Replace that column - copy the entire Series to preserve complex types (lists, structs, etc.)
let col_name = col.name();
let new_val = df.column(col_name)?.get(0)?;
new_row.with_column(Series::new(PlSmallStr::from_str(col_name), vec![new_val]))?;
let col_series = df.column(col_name)?;
if let Some(col_idx) = new_row.get_column_index(col_name) {
new_row.replace_column(col_idx, col_series.clone())?;
} else {
new_row.with_column(col_series.clone())?;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard “row not found” before merging columns (avoid odd errors later).
If df_db::select returns 0 rows, the subsequent merge + get_hash_and_status_for_modification will likely return a generic error instead of a clean resource_not_found(uuid). Consider an early check right after maybe_db_data is loaded.

🤖 Prompt for AI Agents
In oxen-rust/src/lib/src/core/db/data_frames/rows.rs around lines 98 to 108, add
an explicit guard immediately after loading maybe_db_data (the result of
df_db::select) to check if it contains zero rows and, if so, return a
resource_not_found(uuid) error instead of proceeding; place the check before the
loop that merges columns so you bail out early with the appropriate error,
avoiding later generic failures from merge/get_hash_and_status_for_modification.

Comment on lines 815 to 844
AnyValue::Struct(_len, struct_array, fields) => {
let json_value = struct_array_to_json(struct_array, fields);
json_value
}

other => panic!("Unsupported dtype in JSON conversion: {other:?}"),
}
}

fn struct_array_to_json(
struct_array: &polars::prelude::StructArray,
_fields: &[polars::prelude::Field],
) -> Value {
// Convert StructArray -> DataFrame, then to JSON
match DataFrame::try_from(struct_array.clone()) {
Ok(df) => {
let mut rows = Vec::with_capacity(df.height());
for i in 0..df.height() {
let mut map = serde_json::Map::new();
for col in df.get_columns() {
let val = col.get(i).unwrap_or(AnyValue::Null);
map.insert(col.name().to_string(), any_val_to_json(val));
}
rows.push(Value::Object(map));
}
Value::Array(rows)
}
Err(_) => Value::String(format!("{struct_array:?}")),
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Polars Rust AnyValue::Struct vs StructOwned single struct JSON representation

💡 Result:

Short answer — difference and JSON behaviour

  • AnyValue::Struct (enum variant) is the runtime AnyValue representation for a struct value often backed by an Arrow/Polars StructArray (it carries references into the array: index + child arrays + field names). AnyValue::Struct is a view/borrowed-style variant used when iterating rows or working with values from existing arrays. [1][2]

  • AnyValue::StructOwned (sometimes named StructOwned / Struct payload in Polars internals) is the owned form that contains an owned Vec/Box of AnyValue values (the fields) so the struct value can be stored independently of an array backing. It is used when you need an owned scalar (e.g. building buffers, creating rows, or serializing). The AV-buffer implementation shows the code paths that expect AnyValue::StructOwned when adding owned physical values. [3]

  • JSON representation: when you write a DataFrame containing Struct columns with JsonWriter (Polars uses arrow2::io::json under the hood), each struct row is emitted as a JSON object (field-name: value) — i.e. the struct is serialized as a single JSON object per row (or per JSON-lines record), not as a positional array. If you convert struct fields to strings first you can control how non-primitive children are rendered (arrow2/Polars formats non-primitives using Display by default). [4][1]

References

  1. Polars user guide — Structs (explanation of struct dtype and how struct rows look/are unnestable). [Polars docs - Structs]. [turn1search0]
  2. AnyValue enum documentation (Polars Rust API) — AnyValue struct variant details. [docs.rs AnyValue]. [turn2search3]
  3. Polars source glimpses (av_buffer / builder code) showing AnyValue::Struct vs AnyValue::StructOwned handling when building/adding values. [polars av_buffer.rs]. [turn3search0]
  4. JsonWriter docs — Polars JSON writer uses arrow2::io::json and serializes struct columns as JSON objects; notes about formatting non-primitives. [JsonWriter docs]. [turn0search1]

If you want, I can paste the exact AnyValue variant signatures from the crate source and show a tiny Rust example producing the JSON output.


🏁 Script executed:

# Search for StructOwned handling in the same file
rg "StructOwned" oxen-rust/src/lib/src/core/df/tabular.rs -B 2 -A 5

Repository: Oxen-AI/Oxen

Length of output: 804


🏁 Script executed:

# Look at the broader context of the any_val_to_json function
cat -n oxen-rust/src/lib/src/core/df/tabular.rs | sed -n '780,850p'

Repository: Oxen-AI/Oxen

Length of output: 3077


🏁 Script executed:

# Search for usages of AnyValue::Struct to understand when it's created
rg "AnyValue::Struct\(" oxen-rust/src/lib/src/core/df/tabular.rs -B 3 -A 3

Repository: Oxen-AI/Oxen

Length of output: 221


🏁 Script executed:

# Look for tests related to struct JSON conversion
rg "struct_array_to_json|StructOwned.*json" oxen-rust/src/lib/src/core/df/ -l

Repository: Oxen-AI/Oxen

Length of output: 96


🏁 Script executed:

# Check if there are any comments or context about why struct_array is wrapped in an array
git log -p --all -S "struct_array_to_json" -- oxen-rust/src/lib/src/core/df/tabular.rs | head -100

Repository: Oxen-AI/Oxen

Length of output: 38


Struct JSON serialization inconsistency: returns array instead of object

AnyValue::StructOwned correctly returns a Value::Object (line 798–813), but AnyValue::Struct delegates to struct_array_to_json which wraps the result in Value::Array (line 839). Per Polars JSON convention, struct values should serialize as JSON objects, not arrays. This breaks the expected shape if downstream code expects an object for a single struct cell.

The fix is straightforward: in struct_array_to_json, if struct_array.len() == 1, return the single object directly; otherwise, return the array only if there are multiple rows.

🧰 Tools
🪛 GitHub Actions: 🐂 Continuous integration

[error] 817-817: returning the result of a let binding from a block

🤖 Prompt for AI Agents
In oxen-rust/src/lib/src/core/df/tabular.rs around lines 815 to 844,
struct_array_to_json currently always returns Value::Array even for a single-row
StructArray which breaks Polars convention; update the function so that after
building the Vec<Value::Object> rows it returns the single Object directly when
df.height() == 1 (i.e., Value::Object(rows[0].clone() or move the map out), and
only returns Value::Array(rows) when there are multiple rows; keep the existing
Err fallback unchanged.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
oxen-rust/src/lib/src/core/v_latest/workspaces/data_frames/rows.rs (1)

44-46: Consider conditional normalization to avoid unnecessary cloning.

The data is cloned unconditionally before normalization, even when the structure doesn't contain messages or array content. For non-message data, this adds unnecessary allocation overhead.

Consider checking if normalization is needed before cloning:

-    let mut normalized_data = data.clone();
-    maybe_normalize_message_content(&mut normalized_data);
-    let df = tabular::parse_json_to_df(&normalized_data)?;
+    let df = if data.get("messages").is_some() {
+        let mut normalized_data = data.clone();
+        maybe_normalize_message_content(&mut normalized_data);
+        tabular::parse_json_to_df(&normalized_data)?
+    } else {
+        tabular::parse_json_to_df(data)?
+    };
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 94e7738 and 78bb851.

📒 Files selected for processing (2)
  • oxen-rust/src/lib/src/core/df/tabular.rs (5 hunks)
  • oxen-rust/src/lib/src/core/v_latest/workspaces/data_frames/rows.rs (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Test Suite / Test Suite (ubuntu-latest)
  • GitHub Check: Test Suite / Test Suite (macos-latest)
  • GitHub Check: Test Suite / Test Suite (windows-latest)
  • GitHub Check: Lint / Cargo check, format, clippy + Ruff
🔇 Additional comments (6)
oxen-rust/src/lib/src/core/v_latest/workspaces/data_frames/rows.rs (1)

7-7: LGTM: Import for JSON macro.

The json! macro import is needed for the normalization logic below.

oxen-rust/src/lib/src/core/df/tabular.rs (5)

308-320: LGTM: Clean refactor to explicit reader.

The extraction of the reader creation makes the code slightly more readable without changing behavior.


723-723: Public API change: any_val_to_json now exposed.

This function is now part of the module's public API. Ensure this is intentional and that the function's behavior is stable enough for external use.


779-797: LGTM: Proper List(Struct) to JSON array conversion.

The logic correctly iterates through struct rows and builds JSON objects with recursive value conversion.


819-819: Verified: Clippy issue addressed.

The let-then-return pattern flagged in the previous review has been resolved. The match arm now directly returns the result from struct_array_to_json.


932-932: LGTM: Consistent Struct handling in SQL conversion.

The addition properly routes List(Struct) to JSON conversion, consistent with the pattern used for other complex types.

Comment on lines +825 to +884
fn struct_array_to_json(
struct_array: &polars::prelude::StructArray,
fields: &[polars::prelude::Field],
) -> Value {
match DataFrame::try_from(struct_array.clone()) {
Ok(df) => {
let mut rows = Vec::with_capacity(df.height());
for i in 0..df.height() {
let mut map = serde_json::Map::new();
for col in df.get_columns() {
let val = col.get(i).unwrap_or(AnyValue::Null);
map.insert(col.name().to_string(), any_val_to_json(val));
}
rows.push(Value::Object(map));
}
Value::Array(rows)
}
Err(e) => {
log::warn!("Failed to convert StructArray to DataFrame: {e:?}");
let mut map = serde_json::Map::new();
for (i, field) in fields.iter().enumerate() {
if let Some(values) = struct_array.values().get(i) {
let len = values.len();
if len > 0 {
let val_str = format!("{values:?}");
// Look for actual URL or string content
if let Some(start) = val_str.find("https://") {
// Extract URL starting from https://
let rest = &val_str[start..];
if let Some(end) = rest.find('"').or(rest.find(']')) {
let url = &rest[..end];
map.insert(
field.name().to_string(),
Value::String(url.to_string()),
);
continue;
}
} else if let Some(start) = val_str.find("http://") {
let rest = &val_str[start..];
if let Some(end) = rest.find('"').or(rest.find(']')) {
let url = &rest[..end];
map.insert(
field.name().to_string(),
Value::String(url.to_string()),
);
continue;
}
}
}
}
map.insert(field.name().to_string(), Value::Null);
}
if map.is_empty() {
Value::Null
} else {
Value::Object(map)
}
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find and read the file at the specified lines
fd -t f "tabular.rs" --path "*core/df*" | head -5

Repository: Oxen-AI/Oxen

Length of output: 289


🏁 Script executed:

# Find the tabular.rs file
find . -name "tabular.rs" -type f | grep -E "core/df"

Repository: Oxen-AI/Oxen

Length of output: 98


🏁 Script executed:

# Read lines 825-884 from the file
sed -n '825,884p' ./oxen-rust/src/lib/src/core/df/tabular.rs

Repository: Oxen-AI/Oxen

Length of output: 2663


🏁 Script executed:

# Search for tests and usages of struct_array_to_json
rg -n "struct_array_to_json" ./oxen-rust --type rust

Repository: Oxen-AI/Oxen

Length of output: 274


🏁 Script executed:

# Read more context around the function call to understand the larger any_val_to_json function
sed -n '800,825p' ./oxen-rust/src/lib/src/core/df/tabular.rs

Repository: Oxen-AI/Oxen

Length of output: 854


🏁 Script executed:

# Search for tests related to any_val_to_json or JSON conversion
rg -n "any_val_to_json|struct_array_to_json" ./oxen-rust --type rust -A 2 -B 2 | head -50

Repository: Oxen-AI/Oxen

Length of output: 4038


🏁 Script executed:

# Search for tests related to df conversion and JSON
find ./oxen-rust -path "*/tests/*" -name "*.rs" -type f | xargs grep -l "json\|struct" | head -10

Repository: Oxen-AI/Oxen

Length of output: 38


🏁 Script executed:

# Search for test files in the core/df directory
find ./oxen-rust -path "*/tests/*" -name "*.rs" -type f

Repository: Oxen-AI/Oxen

Length of output: 81


🏁 Script executed:

# Search for tests related to struct_array, JSON, or tabular
rg -n "struct_array|any_val_to_json|tabular" ./oxen-rust/tests/test.rs

Repository: Oxen-AI/Oxen

Length of output: 38


🏁 Script executed:

# Look for where any_val_to_json is called in the codebase
rg -n "any_val_to_json" ./oxen-rust/src/lib --type rust -B 3 -A 3

Repository: Oxen-AI/Oxen

Length of output: 11356


🏁 Script executed:

# Read the struct test more thoroughly
sed -n '2025,2060p' ./oxen-rust/src/lib/src/core/df/tabular.rs

Repository: Oxen-AI/Oxen

Length of output: 1177


🏁 Script executed:

# Search for git history or comments about array vs object issues
rg -n "array.*object|object.*array" ./oxen-rust/src/lib/src/core/df/tabular.rs | head -20

Repository: Oxen-AI/Oxen

Length of output: 38


🏁 Script executed:

# Check if there are any tests for the error path of struct_array_to_json
rg -n "struct_array_to_json|DataFrame::try_from.*fail|Err.*StructArray" ./oxen-rust --type rust -A 5

Repository: Oxen-AI/Oxen

Length of output: 1005


🏁 Script executed:

# Look at the broader context to understand what AnyValue::Struct represents
rg -n "AnyValue::Struct" ./oxen-rust/src/lib/src/core/df/tabular.rs -B 5 -A 5

Repository: Oxen-AI/Oxen

Length of output: 1559


🏁 Script executed:

# Let me understand the context better - check how many rows are expected in a StructArray
sed -n '829,845p' ./oxen-rust/src/lib/src/core/df/tabular.rs

Repository: Oxen-AI/Oxen

Length of output: 798


Brittle fallback parsing debug strings—only extracts URLs and silently loses other field values.

The error fallback (lines 842-883) uses format!("{values:?}") to parse debug format output for extracting URLs. This approach is fragile and will break if Polars changes its debug formatting.

Issues:

  1. Debug string parsing: Relies on unstable debug format output, not a public API
  2. Limited extraction logic: Only handles http/https URLs; all other field types are converted to Value::Null
  3. Silent data loss: Non-URL fields lose their actual values without recovery

Recommendations:

  1. Fix the root cause of DataFrame::try_from() failures instead of relying on a brittle fallback
  2. If a fallback is necessary, use proper Arrow/Polars APIs to extract field values
  3. Preserve field values for non-URL types rather than defaulting to null
// Instead of parsing debug strings, consider:
Err(e) => {
    log::error!("Failed to convert StructArray to DataFrame: {e:?}");
    return Value::Null; // or handle appropriately without parsing
}

Comment on lines 368 to 406
fn maybe_normalize_message_content(value: &mut serde_json::Value) {
let Some(messages) = value.get_mut("messages").and_then(|m| m.as_array_mut()) else {
return;
};

for msg in messages {
if !msg.is_object() {
continue;
}

let Some(msg_obj) = msg.as_object_mut() else {
continue;
};

if let Some(content) = msg_obj.get_mut("content") {
// Only process if content is an array (user messages with images/objects)
if content.is_array() {
// Remove null image_url fields before stringifying
if let Some(content_array) = content.as_array_mut() {
for item in content_array.iter_mut() {
if let Some(item_obj) = item.as_object_mut() {
if let Some(image_url_val) = item_obj.get("image_url") {
if image_url_val.is_null() {
item_obj.remove("image_url");
}
}
}
}
}

// Stringify the content array
if let Ok(content_str) = serde_json::to_string(content) {
*content = json!(content_str);
}
}
// String content stays as-is (simple text messages)
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add error handling and documentation for domain-specific normalization.

This function silently ignores serialization errors and lacks documentation explaining why this chat-specific normalization is needed.

Issues:

  1. Line 399: serde_json::to_string errors are silently ignored - if serialization fails, the content remains as an array instead of being stringified, potentially causing downstream parsing issues.
  2. The function is tightly coupled to a specific chat message schema (messages → content → image_url) without documentation explaining the business logic or why array content must be stringified.

Consider:

  1. Log serialization errors or propagate them
  2. Add a doc comment explaining why this normalization is necessary for chat history
+/// Normalizes chat message content arrays by removing null image_url fields
+/// and stringifying array content for DuckDB compatibility.
 fn maybe_normalize_message_content(value: &mut serde_json::Value) {
     // ... existing code ...
     
                 // Stringify the content array
-                if let Ok(content_str) = serde_json::to_string(content) {
+                match serde_json::to_string(content) {
+                    Ok(content_str) => {
                     *content = json!(content_str);
+                    }
+                    Err(e) => {
+                        log::warn!("Failed to stringify message content: {}", e);
+                    }
                 }

@jcelliott
Copy link
Collaborator

@EloyMartinez How can I test this? Do you have a specific DF with a problem that I can use?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
oxen-rust/src/server/src/controllers/data_frames.rs (1)

86-87: Unnecessary clone: resource can be passed by reference directly.

Since get_slice takes &ParsedResource, the .clone() is redundant.

     let data_frame_slice =
-        repositories::data_frames::get_slice(&repo, &resource.clone(), &resource.path, &opts)
+        repositories::data_frames::get_slice(&repo, &resource, &resource.path, &opts)
             .await?;
oxen-rust/src/lib/src/core/v_latest/data_frames.rs (1)

40-68: Consider clarifying the error message for the get_slice context.

The error message "Only single file download is supported" on line 47 appears to be copied from a download handler. In the context of get_slice, a message like "Expected a file node" would be clearer.

                 let file_node = match staged_node.node.node {
                     EMerkleTreeNode::File(f) => Ok(f),
                     _ => Err(OxenError::basic_str(
-                        "Only single file download is supported",
+                        "Expected a file node for data frame slice",
                     )),
                 }?;
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 78bb851 and 7d81ece.

📒 Files selected for processing (3)
  • oxen-rust/src/lib/src/core/v_latest/data_frames.rs (3 hunks)
  • oxen-rust/src/lib/src/repositories/data_frames.rs (2 hunks)
  • oxen-rust/src/server/src/controllers/data_frames.rs (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
oxen-rust/src/lib/src/repositories/data_frames.rs (1)
oxen-rust/src/lib/src/core/v_latest/data_frames.rs (1)
  • get_slice (20-130)
oxen-rust/src/lib/src/core/v_latest/data_frames.rs (5)
oxen-rust/src/lib/src/core/staged/staged_db_manager.rs (1)
  • with_staged_db_manager (65-118)
oxen-rust/src/lib/src/model/merkle_tree/node/file_node.rs (3)
  • node (118-122)
  • metadata (46-46)
  • metadata (184-186)
oxen-rust/src/lib/src/model/merkle_tree/node.rs (1)
  • metadata (62-67)
oxen-rust/src/lib/src/error.rs (2)
  • basic_str (161-163)
  • path_does_not_exist (244-246)
oxen-rust/src/lib/src/repositories/tree.rs (1)
  • get_file_by_path (247-259)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Lint / Cargo check, format, clippy + Ruff
  • GitHub Check: Test Suite / Test Suite (macos-latest)
  • GitHub Check: Test Suite / Test Suite (windows-latest)
  • GitHub Check: Test Suite / Test Suite (ubuntu-latest)
🔇 Additional comments (4)
oxen-rust/src/server/src/controllers/data_frames.rs (1)

51-55: LGTM!

The commit resolution logic correctly prioritizes the workspace commit when available, with appropriate fallback to resource.commit and NotFound error handling.

oxen-rust/src/lib/src/repositories/data_frames.rs (1)

12-21: LGTM!

Clean signature update propagating the resource-based API change through the version dispatch layer.

oxen-rust/src/lib/src/core/v_latest/data_frames.rs (2)

26-38: LGTM!

The commit derivation and repository selection logic correctly handles both workspace and non-workspace scenarios, establishing the proper context for downstream operations.


92-92: LGTM!

The handle_sql_querying call correctly uses the derived commit reference.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
oxen-rust/src/server/src/controllers/workspaces/data_frames.rs (1)

116-157: Redundant clone on line 127.

&resource.clone() creates an unnecessary copy. Since resource is only borrowed here and for resource.path access, a simple &resource suffices.

🔎 Proposed fix
         let data_frame_slice =
-            repositories::data_frames::get_slice(&repo, &resource.clone(), &resource.path, &opts)
+            repositories::data_frames::get_slice(&repo, &resource, &resource.path, &opts)
                 .await?;

Overall approach looks good. The non-indexed path now correctly fetches a slice from the committed version (via workspace: None in the ParsedResource) and returns actual data instead of None. The count derivation logic and schema fallback are sound.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c0169c1 and 95c9839.

📒 Files selected for processing (1)
  • oxen-rust/src/server/src/controllers/workspaces/data_frames.rs (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
oxen-rust/src/server/src/controllers/workspaces/data_frames.rs (2)
oxen-rust/src/lib/src/core/v_latest/data_frames.rs (1)
  • get_slice (20-130)
oxen-rust/src/lib/src/model/data_frame/schema.rs (1)
  • from_polars (61-73)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Test Suite / Test Suite (macos-latest)
  • GitHub Check: Lint / Cargo check, format, clippy + Ruff
  • GitHub Check: Test Suite / Test Suite (windows-latest)
  • GitHub Check: Test Suite / Test Suite (ubuntu-latest)
🔇 Additional comments (1)
oxen-rust/src/server/src/controllers/workspaces/data_frames.rs (1)

13-13: LGTM!

Import addition is appropriate for the new ParsedResource usage below.

@gschoeni gschoeni merged commit 08a9cc0 into main Dec 19, 2025
5 checks passed
@gschoeni gschoeni deleted the feat/chat_completion_history branch December 19, 2025 23:43
manager.with_conn(|conn| rows::append_row(conn, &df))
})?;

println!("this is the result {result:?}");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EloyMartinez can you remove this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!!!!


let mut df = tabular::parse_json_to_df(data)?;

log::debug!("update() df: {df:?}");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EloyMartinez and this too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this too!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants