Enable array of struct arrow parsing for chat history #213

EloyMartinez · 2025-12-12T18:41:19Z

In order to modify a row that contains an array of structs, we need to manually handle the polars -> duckdb case.

Summary by CodeRabbit

Bug Fixes
- STRUCT/list fields from JSON-like uploads are now stored as JSON to avoid binding issues.
- Preserve complex nested types during row updates so lists/structs remain intact.
- Improved JSON serialization for struct/list values to ensure nested records convert correctly.
Refactor
- Prefer staged data when resolving workspace files, with commit-tree fallback.
New Features
- Non-indexed dataframe requests now return dataframe views and include an is_indexed flag.
Chores
- Added minor runtime diagnostics and extra update logging.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-12T18:41:35Z

Walkthrough

Converts DuckDB STRUCT columns to JSON after JSON-like indexing, preserves complex Series when modifying rows, adds Struct/StructArray → JSON serialization and exposes the converter, refactors GET handlers to prefer staged-DB reads with commit-tree fallback, and updates get_slice to accept a ParsedResource.

Changes

Cohort / File(s)	Summary
DataFrame indexing `oxen-rust/src/lib/src/core/db/data_frames/df_db.rs`	After creating tables for `jsonl/json/ndjson` (including `index_file_with_id`), discover columns whose `data_type` begins with `STRUCT` and run `ALTER TABLE ... ALTER COLUMN ... TYPE JSON` per column so struct columns are stored as JSON.
Row modification / preservation `oxen-rust/src/lib/src/core/db/data_frames/rows.rs`	`modify_row` / `modify_rows` now copy incoming column `Series` into the output row using `replace_column` or `with_column` instead of constructing single-value Series, preserving lists, structs, and other complex types.
Tabular / JSON serialization `oxen-rust/src/lib/src/core/df/tabular.rs`	`any_val_to_json` made `pub` and extended to handle `AnyValue::Struct` / `StructArray` and `List(Struct)` via `struct_array_to_json`; struct-containing lists route through JSON conversion for SQL insertion.
get_slice API + staged DB logic `oxen-rust/src/lib/src/core/v_latest/data_frames.rs`, `oxen-rust/src/lib/src/repositories/data_frames.rs`, `oxen-rust/src/server/src/controllers/data_frames.rs`	`get_slice` signature changed from `&Commit` → `&ParsedResource`; callers updated. Logic derives commit/workspace from the resource and supports reading from staged DB first with fallback to the commit-tree.
File GET handler refactor `oxen-rust/src/server/src/controllers/file.rs`	GET flow now extracts workspace/path, attempts staged-DB lookup first when workspace present, and falls back to commit-tree reads using workspace commit or base repo; streaming/response handling preserved.
Workspace DF parsing & diagnostics `oxen-rust/src/lib/src/core/v_latest/workspaces/data_frames/rows.rs`	`parse_json_to_df` now receives `&data` (by ref) and add/remove small debug/diagnostic prints in add/update paths.
Workspace controller: non-indexed DF views `oxen-rust/src/server/src/controllers/workspaces/data_frames.rs`, `liboxen/view/json_data_frame_view.rs`	Non-indexed branch now builds DF views from a slice and sets `data_frame: Some(...)`; response struct gained new public field `is_indexed: bool`.
Minor logging `oxen-rust/src/server/src/controllers/workspaces/data_frames/rows.rs`	Adds debug logging to include JSON data in `update(...)` paths.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Controller as File/DataFrame Controller
    participant StagedDB
    participant Repo
    participant CommitTree

    Client->>Controller: GET resource/path
    alt workspace present
        Controller->>StagedDB: query by workspace + path
        alt found
            StagedDB-->>Controller: return staged file entry
        else not found
            Controller->>Repo: resolve workspace.commit
            Repo-->>Controller: commit
            Controller->>CommitTree: read file at commit/path
            CommitTree-->>Controller: file entry or NotFound
        end
    else no workspace
        Controller->>Repo: resolve resource.commit
        Repo-->>Controller: commit
        Controller->>CommitTree: read file at commit/path
        CommitTree-->>Controller: file entry or NotFound
    end
    Controller->>Client: stream file or return NotFound

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pay special attention to SQL generation and identifier quoting for STRUCT→JSON ALTERs (df_db.rs).
Validate series preservation and ordering/null handling after replace_column/with_column changes (rows.rs).
Test struct_array_to_json with nested/null values and lists-of-structs (tabular.rs).
Verify get_slice signature propagation and staged-vs-commit resolution across controllers/repositories and the new is_indexed response field.

Possibly related PRs

Check merkle tree when fetching files from workspace #169 — similar staged-DB-first then commit-tree fallback logic in file retrieval controllers.
Oxen-AI/Oxen#540 — related Struct/List → JSON serialization and indexing changes in tabular.rs.
Oxen-AI/Oxen#485 — related modifications to df_db.rs indexing logic and ALTER/struct handling.

Suggested reviewers

jcelliott
Eric-Laurence

Poem

🐇 I hop through rows and tuck structs into JSON light,
I copy whole Series so nested types sleep tight.
I peek at staged patches, then follow back to the tree,
I alter columns gently so DuckDB agrees with me.
A rabbit's tiny patch — tidy, nimble, and free.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'Enable array of struct arrow parsing for chat history' accurately describes the main change: adding support for parsing Arrow data containing arrays of structs to enable row modifications in chat history. The title is specific, concise, and directly related to the primary technical objective.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/chat_completion_history

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c05d72b and 67150f0.

📒 Files selected for processing (1)

oxen-rust/src/server/src/controllers/file.rs (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

oxen-rust/src/server/src/controllers/file.rs

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Test Suite / Test Suite (macos-latest)
GitHub Check: Test Suite / Test Suite (windows-latest)
GitHub Check: Test Suite / Test Suite (ubuntu-latest)
GitHub Check: Lint / Cargo check, format, clippy + Ruff

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…_completion_history

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

oxen-rust/src/lib/src/core/df/tabular.rs (1)

892-898: Confirm explicit JSON casting for list/struct values in DuckDB bindings.
value_to_tosql converts list/struct data to JSON strings via json_array.to_string() and binds them as VARCHAR parameters without explicit CAST(? AS JSON). While DuckDB does parse VARCHAR→JSON implicitly, this approach carries risk: DuckDB's auto-casting has known issues with surplus quotes when inserting JSON strings. For the update/insert paths in rows.rs and df_db.rs, either confirm that target columns are strongly typed as JSON (which would ensure safe parsing), or add explicit CAST(? AS JSON) to the SQL to avoid unexpected quote-escaping behavior.

🧹 Nitpick comments (4)

oxen-rust/src/lib/src/core/db/data_frames/rows.rs (2)
99-107: Use col directly instead of re-looking up df.column(col_name) (simpler + cheaper).
You already have the series as col; df.column(col_name)? is redundant and can be replaced by col (clone as needed).
-    for col in df.get_columns() {
-        // Replace that column - copy the entire Series to preserve complex types (lists, structs, etc.)
-        let col_name = col.name();
-        let col_series = df.column(col_name)?;
+    for col in df.get_columns() {
+        // Copy the entire Series to preserve complex types (lists, structs, etc.)
+        let col_name = col.name();
+        let col_series = col;
         if let Some(col_idx) = new_row.get_column_index(col_name) {
             new_row.replace_column(col_idx, col_series.clone())?;
         } else {
             new_row.with_column(col_series.clone())?;
         }
     }
167-176: Same “use col directly” refactor applies in modify_rows.
Mirrors the same redundant lookup pattern as modify_row.
oxen-rust/src/server/src/controllers/file.rs (1)

123-127: base_repo is always &repo here; can simplify.
(staged_repo, base_repo) currently assigns base_repo to &repo in both cases, so you can drop the tuple and just keep staged_repo.

oxen-rust/src/lib/src/core/df/tabular.rs (1)

824-844: Consider avoiding DataFrame materialization for StructArray conversion. The current code converts StructArray → DataFrame → manual row iteration, which creates unnecessary intermediate allocations. Since _fields is already available, you can iterate struct_array.fields() and struct_array.values() directly (or use struct_array.iter() for row-wise access), similar to the pattern already used for AnyValue::StructOwned just above this function. This would be more efficient and idiomatic.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6c3e54a and 428053a.

📒 Files selected for processing (4)

oxen-rust/src/lib/src/core/db/data_frames/df_db.rs (1 hunks)
oxen-rust/src/lib/src/core/db/data_frames/rows.rs (2 hunks)
oxen-rust/src/lib/src/core/df/tabular.rs (3 hunks)
oxen-rust/src/server/src/controllers/file.rs (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-09-10T22:08:33.965Z

Learnt from: Lilianxr
Repo: Oxen-AI/Oxen PR: 142
File: oxen-rust/src/lib/src/core/v_latest/workspaces/files.rs:348-364
Timestamp: 2025-09-10T22:08:33.965Z
Learning: In the Oxen codebase, the `util::fs::path_relative_to_dir` function properly handles path sanitization to ensure paths stay within the specified workspace boundary, making additional canonicalization checks unnecessary for workspace-scoped operations.

Applied to files:

oxen-rust/src/server/src/controllers/file.rs

📚 Learning: 2025-09-18T19:11:42.443Z

Learnt from: Lilianxr
Repo: Oxen-AI/Oxen PR: 146
File: oxen-rust/src/benches/oxen/download.rs:176-179
Timestamp: 2025-09-18T19:11:42.443Z
Learning: In oxen-rust/src/benches/oxen/download.rs, the setup_repo_for_download_benchmark function always stages files under a "files" directory structure in the repository, regardless of whether data_path is provided or not. The hardcoded Path::new("files") in the repositories::download call is correct because of this consistent setup.

Applied to files:

oxen-rust/src/server/src/controllers/file.rs

🧬 Code graph analysis (1)

oxen-rust/src/server/src/controllers/file.rs (6)

oxen-rust/src/lib/src/error.rs (2)

basic_str (158-160)

path_does_not_exist (237-239)

oxen-rust/src/lib/src/repositories/commits/commit_writer.rs (1)

commit (76-79)

oxen-rust/src/server/src/controllers/workspaces.rs (1)

commit (489-540)

oxen-rust/src/lib/src/core/v_latest/commits.rs (1)

commit (22-24)

oxen-rust/src/lib/src/core/v_latest/workspaces/commit.rs (1)

commit (27-104)

oxen-rust/src/lib/src/repositories/tree.rs (1)

get_file_by_path (247-259)

🪛 GitHub Actions: 🐂 Continuous integration

oxen-rust/src/lib/src/core/df/tabular.rs

[error] 817-817: returning the result of a let binding from a block

🔇 Additional comments (3)

oxen-rust/src/lib/src/core/db/data_frames/df_db.rs (1)

673-688: No action required; the code pattern is correct for DuckDB 1.1.1.

DuckDB supports implicit casting from STRUCT to JSON, so ALTER COLUMN ... TYPE JSON without a USING clause will work for STRUCT columns and nested types (including STRUCT[]). The implementation is sound.

oxen-rust/src/lib/src/core/df/tabular.rs (1)

775-793: The List(Struct) JSON conversion logic is correct for Polars 0.49. The code properly:

Calls l.struct_() on the inner Series (which has dtype Struct)

Handles null fields via unwrap_or(AnyValue::Null) without panicking

Supports nested structures through recursive any_val_to_json calls on field values
oxen-rust/src/server/src/controllers/file.rs (1)
129-147: Staged-first fallback looks good; recommend normalizing paths in staged DB reads for consistency.

Files are normalized via util::fs::path_relative_to_dir when staged (lines 50, 182, 209, 224 of workspaces/files.rs), but read_from_staged_db at line 148 uses the path directly from parse_resource. While this is safe in practice—since the staged DB only contains normalized keys, un-normalized queries simply won't match—consider normalizing the path before the staged DB lookup to maintain consistency with the staging flow and improve robustness if code paths change in the future.
⛔ Skipped due to learnings
Learnt from: Lilianxr
Repo: Oxen-AI/Oxen PR: 142
File: oxen-rust/src/lib/src/core/v_latest/workspaces/files.rs:348-364
Timestamp: 2025-09-10T22:08:33.965Z
Learning: In the Oxen codebase, the `util::fs::path_relative_to_dir` function properly handles path sanitization to ensure paths stay within the specified workspace boundary, making additional canonicalization checks unnecessary for workspace-scoped operations.
Learnt from: Lilianxr
Repo: Oxen-AI/Oxen PR: 146
File: oxen-rust/src/benches/oxen/download.rs:176-179
Timestamp: 2025-09-18T19:11:42.443Z
Learning: In oxen-rust/src/benches/oxen/download.rs, the setup_repo_for_download_benchmark function always stages files under a "files" directory structure in the repository, regardless of whether data_path is provided or not. The hardcoded Path::new("files") in the repositories::download call is correct because of this consistent setup.

coderabbitai · 2025-12-12T18:45:54Z

oxen-rust/src/lib/src/core/db/data_frames/rows.rs

+
    for col in df.get_columns() {
-        // Replace that column in the existing df if it exists
+        // Replace that column - copy the entire Series to preserve complex types (lists, structs, etc.)
        let col_name = col.name();
-        let new_val = df.column(col_name)?.get(0)?;
-        new_row.with_column(Series::new(PlSmallStr::from_str(col_name), vec![new_val]))?;
+        let col_series = df.column(col_name)?;
+        if let Some(col_idx) = new_row.get_column_index(col_name) {
+            new_row.replace_column(col_idx, col_series.clone())?;
+        } else {
+            new_row.with_column(col_series.clone())?;
+        }
    }


⚠️ Potential issue | 🟠 Major

Guard “row not found” before merging columns (avoid odd errors later).
If df_db::select returns 0 rows, the subsequent merge + get_hash_and_status_for_modification will likely return a generic error instead of a clean resource_not_found(uuid). Consider an early check right after maybe_db_data is loaded.

🤖 Prompt for AI Agents

In oxen-rust/src/lib/src/core/db/data_frames/rows.rs around lines 98 to 108, add an explicit guard immediately after loading maybe_db_data (the result of df_db::select) to check if it contains zero rows and, if so, return a resource_not_found(uuid) error instead of proceeding; place the check before the loop that merges columns so you bail out early with the appropriate error, avoiding later generic failures from merge/get_hash_and_status_for_modification.

oxen-rust/src/lib/src/core/df/tabular.rs

coderabbitai · 2025-12-12T18:45:54Z

oxen-rust/src/lib/src/core/df/tabular.rs

+        AnyValue::Struct(_len, struct_array, fields) => {
+            let json_value = struct_array_to_json(struct_array, fields);
+            json_value
+        }

        other => panic!("Unsupported dtype in JSON conversion: {other:?}"),
    }
 }

+fn struct_array_to_json(
+    struct_array: &polars::prelude::StructArray,
+    _fields: &[polars::prelude::Field],
+) -> Value {
+    // Convert StructArray -> DataFrame, then to JSON
+    match DataFrame::try_from(struct_array.clone()) {
+        Ok(df) => {
+            let mut rows = Vec::with_capacity(df.height());
+            for i in 0..df.height() {
+                let mut map = serde_json::Map::new();
+                for col in df.get_columns() {
+                    let val = col.get(i).unwrap_or(AnyValue::Null);
+                    map.insert(col.name().to_string(), any_val_to_json(val));
+                }
+                rows.push(Value::Object(map));
+            }
+            Value::Array(rows)
+        }
+        Err(_) => Value::String(format!("{struct_array:?}")),
+    }
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Polars Rust AnyValue::Struct vs StructOwned single struct JSON representation

💡 Result:

Short answer — difference and JSON behaviour

AnyValue::Struct (enum variant) is the runtime AnyValue representation for a struct value often backed by an Arrow/Polars StructArray (it carries references into the array: index + child arrays + field names). AnyValue::Struct is a view/borrowed-style variant used when iterating rows or working with values from existing arrays. [1][2]

AnyValue::StructOwned (sometimes named StructOwned / Struct payload in Polars internals) is the owned form that contains an owned Vec/Box of AnyValue values (the fields) so the struct value can be stored independently of an array backing. It is used when you need an owned scalar (e.g. building buffers, creating rows, or serializing). The AV-buffer implementation shows the code paths that expect AnyValue::StructOwned when adding owned physical values. [3]

JSON representation: when you write a DataFrame containing Struct columns with JsonWriter (Polars uses arrow2::io::json under the hood), each struct row is emitted as a JSON object (field-name: value) — i.e. the struct is serialized as a single JSON object per row (or per JSON-lines record), not as a positional array. If you convert struct fields to strings first you can control how non-primitive children are rendered (arrow2/Polars formats non-primitives using Display by default). [4][1]

References

Polars user guide — Structs (explanation of struct dtype and how struct rows look/are unnestable). [Polars docs - Structs]. [turn1search0]

AnyValue enum documentation (Polars Rust API) — AnyValue struct variant details. [docs.rs AnyValue]. [turn2search3]

Polars source glimpses (av_buffer / builder code) showing AnyValue::Struct vs AnyValue::StructOwned handling when building/adding values. [polars av_buffer.rs]. [turn3search0]

JsonWriter docs — Polars JSON writer uses arrow2::io::json and serializes struct columns as JSON objects; notes about formatting non-primitives. [JsonWriter docs]. [turn0search1]

If you want, I can paste the exact AnyValue variant signatures from the crate source and show a tiny Rust example producing the JSON output.

🏁 Script executed:

# Search for StructOwned handling in the same file rg "StructOwned" oxen-rust/src/lib/src/core/df/tabular.rs -B 2 -A 5

Repository: Oxen-AI/Oxen

Length of output: 804

🏁 Script executed:

# Look at the broader context of the any_val_to_json function cat -n oxen-rust/src/lib/src/core/df/tabular.rs | sed -n '780,850p'

Repository: Oxen-AI/Oxen

Length of output: 3077

🏁 Script executed:

# Search for usages of AnyValue::Struct to understand when it's created rg "AnyValue::Struct\(" oxen-rust/src/lib/src/core/df/tabular.rs -B 3 -A 3

Repository: Oxen-AI/Oxen

Length of output: 221

🏁 Script executed:

# Look for tests related to struct JSON conversion rg "struct_array_to_json|StructOwned.*json" oxen-rust/src/lib/src/core/df/ -l

Repository: Oxen-AI/Oxen

Length of output: 96

🏁 Script executed:

# Check if there are any comments or context about why struct_array is wrapped in an array git log -p --all -S "struct_array_to_json" -- oxen-rust/src/lib/src/core/df/tabular.rs | head -100

Repository: Oxen-AI/Oxen

Length of output: 38

Struct JSON serialization inconsistency: returns array instead of object

AnyValue::StructOwned correctly returns a Value::Object (line 798–813), but AnyValue::Struct delegates to struct_array_to_json which wraps the result in Value::Array (line 839). Per Polars JSON convention, struct values should serialize as JSON objects, not arrays. This breaks the expected shape if downstream code expects an object for a single struct cell.

The fix is straightforward: in struct_array_to_json, if struct_array.len() == 1, return the single object directly; otherwise, return the array only if there are multiple rows.

🧰 Tools

🪛 GitHub Actions: 🐂 Continuous integration

[error] 817-817: returning the result of a let binding from a block

🤖 Prompt for AI Agents

In oxen-rust/src/lib/src/core/df/tabular.rs around lines 815 to 844, struct_array_to_json currently always returns Value::Array even for a single-row StructArray which breaks Polars convention; update the function so that after building the Vec<Value::Object> rows it returns the single Object directly when df.height() == 1 (i.e., Value::Object(rows[0].clone() or move the map out), and only returns Value::Array(rows) when there are multiple rows; keep the existing Err fallback unchanged.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

oxen-rust/src/lib/src/core/v_latest/workspaces/data_frames/rows.rs (1)
44-46: Consider conditional normalization to avoid unnecessary cloning.

The data is cloned unconditionally before normalization, even when the structure doesn't contain messages or array content. For non-message data, this adds unnecessary allocation overhead.

Consider checking if normalization is needed before cloning:
-    let mut normalized_data = data.clone();
-    maybe_normalize_message_content(&mut normalized_data);
-    let df = tabular::parse_json_to_df(&normalized_data)?;
+    let df = if data.get("messages").is_some() {
+        let mut normalized_data = data.clone();
+        maybe_normalize_message_content(&mut normalized_data);
+        tabular::parse_json_to_df(&normalized_data)?
+    } else {
+        tabular::parse_json_to_df(data)?
+    };

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 94e7738 and 78bb851.

📒 Files selected for processing (2)

oxen-rust/src/lib/src/core/df/tabular.rs (5 hunks)
oxen-rust/src/lib/src/core/v_latest/workspaces/data_frames/rows.rs (4 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Test Suite / Test Suite (ubuntu-latest)
GitHub Check: Test Suite / Test Suite (macos-latest)
GitHub Check: Test Suite / Test Suite (windows-latest)
GitHub Check: Lint / Cargo check, format, clippy + Ruff

🔇 Additional comments (6)

oxen-rust/src/lib/src/core/v_latest/workspaces/data_frames/rows.rs (1)

7-7: LGTM: Import for JSON macro.

The json! macro import is needed for the normalization logic below.

oxen-rust/src/lib/src/core/df/tabular.rs (5)

308-320: LGTM: Clean refactor to explicit reader.

The extraction of the reader creation makes the code slightly more readable without changing behavior.

723-723: Public API change: any_val_to_json now exposed.

This function is now part of the module's public API. Ensure this is intentional and that the function's behavior is stable enough for external use.

779-797: LGTM: Proper List(Struct) to JSON array conversion.

The logic correctly iterates through struct rows and builds JSON objects with recursive value conversion.

819-819: Verified: Clippy issue addressed.

The let-then-return pattern flagged in the previous review has been resolved. The match arm now directly returns the result from struct_array_to_json.

932-932: LGTM: Consistent Struct handling in SQL conversion.

The addition properly routes List(Struct) to JSON conversion, consistent with the pattern used for other complex types.

coderabbitai · 2025-12-16T15:40:17Z

oxen-rust/src/lib/src/core/df/tabular.rs

+fn struct_array_to_json(
+    struct_array: &polars::prelude::StructArray,
+    fields: &[polars::prelude::Field],
+) -> Value {
+    match DataFrame::try_from(struct_array.clone()) {
+        Ok(df) => {
+            let mut rows = Vec::with_capacity(df.height());
+            for i in 0..df.height() {
+                let mut map = serde_json::Map::new();
+                for col in df.get_columns() {
+                    let val = col.get(i).unwrap_or(AnyValue::Null);
+                    map.insert(col.name().to_string(), any_val_to_json(val));
+                }
+                rows.push(Value::Object(map));
+            }
+            Value::Array(rows)
+        }
+        Err(e) => {
+            log::warn!("Failed to convert StructArray to DataFrame: {e:?}");
+            let mut map = serde_json::Map::new();
+            for (i, field) in fields.iter().enumerate() {
+                if let Some(values) = struct_array.values().get(i) {
+                    let len = values.len();
+                    if len > 0 {
+                        let val_str = format!("{values:?}");
+                        // Look for actual URL or string content
+                        if let Some(start) = val_str.find("https://") {
+                            // Extract URL starting from https://
+                            let rest = &val_str[start..];
+                            if let Some(end) = rest.find('"').or(rest.find(']')) {
+                                let url = &rest[..end];
+                                map.insert(
+                                    field.name().to_string(),
+                                    Value::String(url.to_string()),
+                                );
+                                continue;
+                            }
+                        } else if let Some(start) = val_str.find("http://") {
+                            let rest = &val_str[start..];
+                            if let Some(end) = rest.find('"').or(rest.find(']')) {
+                                let url = &rest[..end];
+                                map.insert(
+                                    field.name().to_string(),
+                                    Value::String(url.to_string()),
+                                );
+                                continue;
+                            }
+                        }
+                    }
+                }
+                map.insert(field.name().to_string(), Value::Null);
+            }
+            if map.is_empty() {
+                Value::Null
+            } else {
+                Value::Object(map)
+            }
+        }
+    }
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find and read the file at the specified lines fd -t f "tabular.rs" --path "*core/df*" | head -5

Repository: Oxen-AI/Oxen

Length of output: 289

🏁 Script executed:

# Find the tabular.rs file find . -name "tabular.rs" -type f | grep -E "core/df"

Repository: Oxen-AI/Oxen

Length of output: 98

🏁 Script executed:

# Read lines 825-884 from the file sed -n '825,884p' ./oxen-rust/src/lib/src/core/df/tabular.rs

Repository: Oxen-AI/Oxen

Length of output: 2663

🏁 Script executed:

# Search for tests and usages of struct_array_to_json rg -n "struct_array_to_json" ./oxen-rust --type rust

Repository: Oxen-AI/Oxen

Length of output: 274

🏁 Script executed:

# Read more context around the function call to understand the larger any_val_to_json function sed -n '800,825p' ./oxen-rust/src/lib/src/core/df/tabular.rs

Repository: Oxen-AI/Oxen

Length of output: 854

🏁 Script executed:

# Search for tests related to any_val_to_json or JSON conversion rg -n "any_val_to_json|struct_array_to_json" ./oxen-rust --type rust -A 2 -B 2 | head -50

Repository: Oxen-AI/Oxen

Length of output: 4038

🏁 Script executed:

# Search for tests related to df conversion and JSON find ./oxen-rust -path "*/tests/*" -name "*.rs" -type f | xargs grep -l "json\|struct" | head -10

Repository: Oxen-AI/Oxen

Length of output: 38

🏁 Script executed:

# Search for test files in the core/df directory find ./oxen-rust -path "*/tests/*" -name "*.rs" -type f

Repository: Oxen-AI/Oxen

Length of output: 81

🏁 Script executed:

# Search for tests related to struct_array, JSON, or tabular rg -n "struct_array|any_val_to_json|tabular" ./oxen-rust/tests/test.rs

Repository: Oxen-AI/Oxen

Length of output: 38

🏁 Script executed:

# Look for where any_val_to_json is called in the codebase rg -n "any_val_to_json" ./oxen-rust/src/lib --type rust -B 3 -A 3

Repository: Oxen-AI/Oxen

Length of output: 11356

🏁 Script executed:

# Read the struct test more thoroughly sed -n '2025,2060p' ./oxen-rust/src/lib/src/core/df/tabular.rs

Repository: Oxen-AI/Oxen

Length of output: 1177

🏁 Script executed:

# Search for git history or comments about array vs object issues rg -n "array.*object|object.*array" ./oxen-rust/src/lib/src/core/df/tabular.rs | head -20

Repository: Oxen-AI/Oxen

Length of output: 38

🏁 Script executed:

# Check if there are any tests for the error path of struct_array_to_json rg -n "struct_array_to_json|DataFrame::try_from.*fail|Err.*StructArray" ./oxen-rust --type rust -A 5

Repository: Oxen-AI/Oxen

Length of output: 1005

🏁 Script executed:

# Look at the broader context to understand what AnyValue::Struct represents rg -n "AnyValue::Struct" ./oxen-rust/src/lib/src/core/df/tabular.rs -B 5 -A 5

Repository: Oxen-AI/Oxen

Length of output: 1559

🏁 Script executed:

# Let me understand the context better - check how many rows are expected in a StructArray sed -n '829,845p' ./oxen-rust/src/lib/src/core/df/tabular.rs

Repository: Oxen-AI/Oxen

Length of output: 798

Brittle fallback parsing debug strings—only extracts URLs and silently loses other field values.

The error fallback (lines 842-883) uses format!("{values:?}") to parse debug format output for extracting URLs. This approach is fragile and will break if Polars changes its debug formatting.

Issues:

Debug string parsing: Relies on unstable debug format output, not a public API

Limited extraction logic: Only handles http/https URLs; all other field types are converted to Value::Null

Silent data loss: Non-URL fields lose their actual values without recovery

Recommendations:

Fix the root cause of DataFrame::try_from() failures instead of relying on a brittle fallback

If a fallback is necessary, use proper Arrow/Polars APIs to extract field values

Preserve field values for non-URL types rather than defaulting to null

// Instead of parsing debug strings, consider: Err(e) => { log::error!("Failed to convert StructArray to DataFrame: {e:?}"); return Value::Null; // or handle appropriately without parsing }

coderabbitai · 2025-12-16T15:40:17Z

oxen-rust/src/lib/src/core/v_latest/workspaces/data_frames/rows.rs

+fn maybe_normalize_message_content(value: &mut serde_json::Value) {
+    let Some(messages) = value.get_mut("messages").and_then(|m| m.as_array_mut()) else {
+        return;
+    };
+
+    for msg in messages {
+        if !msg.is_object() {
+            continue;
+        }
+
+        let Some(msg_obj) = msg.as_object_mut() else {
+            continue;
+        };
+
+        if let Some(content) = msg_obj.get_mut("content") {
+            // Only process if content is an array (user messages with images/objects)
+            if content.is_array() {
+                // Remove null image_url fields before stringifying
+                if let Some(content_array) = content.as_array_mut() {
+                    for item in content_array.iter_mut() {
+                        if let Some(item_obj) = item.as_object_mut() {
+                            if let Some(image_url_val) = item_obj.get("image_url") {
+                                if image_url_val.is_null() {
+                                    item_obj.remove("image_url");
+                                }
+                            }
+                        }
+                    }
+                }
+
+                // Stringify the content array
+                if let Ok(content_str) = serde_json::to_string(content) {
+                    *content = json!(content_str);
+                }
+            }
+            // String content stays as-is (simple text messages)
+        }
+    }
+}


⚠️ Potential issue | 🟡 Minor

Add error handling and documentation for domain-specific normalization.

This function silently ignores serialization errors and lacks documentation explaining why this chat-specific normalization is needed.

Issues:

Line 399: serde_json::to_string errors are silently ignored - if serialization fails, the content remains as an array instead of being stringified, potentially causing downstream parsing issues.

The function is tightly coupled to a specific chat message schema (messages → content → image_url) without documentation explaining the business logic or why array content must be stringified.

Consider:

Log serialization errors or propagate them

Add a doc comment explaining why this normalization is necessary for chat history

+/// Normalizes chat message content arrays by removing null image_url fields +/// and stringifying array content for DuckDB compatibility. fn maybe_normalize_message_content(value: &mut serde_json::Value) { // ... existing code ... // Stringify the content array - if let Ok(content_str) = serde_json::to_string(content) { + match serde_json::to_string(content) { + Ok(content_str) => { *content = json!(content_str); + } + Err(e) => { + log::warn!("Failed to stringify message content: {}", e); + } }

jcelliott · 2025-12-17T15:50:31Z

@EloyMartinez How can I test this? Do you have a specific DF with a problem that I can use?

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

oxen-rust/src/server/src/controllers/data_frames.rs (1)
86-87: Unnecessary clone: resource can be passed by reference directly.

Since get_slice takes &ParsedResource, the .clone() is redundant.
     let data_frame_slice =
-        repositories::data_frames::get_slice(&repo, &resource.clone(), &resource.path, &opts)
+        repositories::data_frames::get_slice(&repo, &resource, &resource.path, &opts)
             .await?;
oxen-rust/src/lib/src/core/v_latest/data_frames.rs (1)
40-68: Consider clarifying the error message for the get_slice context.

The error message "Only single file download is supported" on line 47 appears to be copied from a download handler. In the context of get_slice, a message like "Expected a file node" would be clearer.
                 let file_node = match staged_node.node.node {
                     EMerkleTreeNode::File(f) => Ok(f),
                     _ => Err(OxenError::basic_str(
-                        "Only single file download is supported",
+                        "Expected a file node for data frame slice",
                     )),
                 }?;

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 78bb851 and 7d81ece.

📒 Files selected for processing (3)

oxen-rust/src/lib/src/core/v_latest/data_frames.rs (3 hunks)
oxen-rust/src/lib/src/repositories/data_frames.rs (2 hunks)
oxen-rust/src/server/src/controllers/data_frames.rs (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

oxen-rust/src/lib/src/repositories/data_frames.rs (1)

oxen-rust/src/lib/src/core/v_latest/data_frames.rs (1)

get_slice (20-130)

oxen-rust/src/lib/src/core/v_latest/data_frames.rs (5)

oxen-rust/src/lib/src/core/staged/staged_db_manager.rs (1)

with_staged_db_manager (65-118)

oxen-rust/src/lib/src/model/merkle_tree/node/file_node.rs (3)

node (118-122)

metadata (46-46)

metadata (184-186)

oxen-rust/src/lib/src/model/merkle_tree/node.rs (1)

metadata (62-67)

oxen-rust/src/lib/src/error.rs (2)

basic_str (161-163)

path_does_not_exist (244-246)

oxen-rust/src/lib/src/repositories/tree.rs (1)

get_file_by_path (247-259)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Lint / Cargo check, format, clippy + Ruff
GitHub Check: Test Suite / Test Suite (macos-latest)
GitHub Check: Test Suite / Test Suite (windows-latest)
GitHub Check: Test Suite / Test Suite (ubuntu-latest)

🔇 Additional comments (4)

oxen-rust/src/server/src/controllers/data_frames.rs (1)

51-55: LGTM!

The commit resolution logic correctly prioritizes the workspace commit when available, with appropriate fallback to resource.commit and NotFound error handling.

oxen-rust/src/lib/src/repositories/data_frames.rs (1)

12-21: LGTM!

Clean signature update propagating the resource-based API change through the version dispatch layer.

oxen-rust/src/lib/src/core/v_latest/data_frames.rs (2)

26-38: LGTM!

The commit derivation and repository selection logic correctly handles both workspace and non-workspace scenarios, establishing the proper context for downstream operations.

92-92: LGTM!

The handle_sql_querying call correctly uses the derived commit reference.

…n-AI/Oxen into feat/chat_completion_history

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

oxen-rust/src/server/src/controllers/workspaces/data_frames.rs (1)
116-157: Redundant clone on line 127.

&resource.clone() creates an unnecessary copy. Since resource is only borrowed here and for resource.path access, a simple &resource suffices.
🔎 Proposed fix
         let data_frame_slice =
-            repositories::data_frames::get_slice(&repo, &resource.clone(), &resource.path, &opts)
+            repositories::data_frames::get_slice(&repo, &resource, &resource.path, &opts)
                 .await?;
Overall approach looks good. The non-indexed path now correctly fetches a slice from the committed version (via workspace: None in the ParsedResource) and returns actual data instead of None. The count derivation logic and schema fallback are sound.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c0169c1 and 95c9839.

📒 Files selected for processing (1)

oxen-rust/src/server/src/controllers/workspaces/data_frames.rs (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

oxen-rust/src/server/src/controllers/workspaces/data_frames.rs (2)

oxen-rust/src/lib/src/core/v_latest/data_frames.rs (1)

get_slice (20-130)

oxen-rust/src/lib/src/model/data_frame/schema.rs (1)

from_polars (61-73)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Test Suite / Test Suite (macos-latest)
GitHub Check: Lint / Cargo check, format, clippy + Ruff
GitHub Check: Test Suite / Test Suite (windows-latest)
GitHub Check: Test Suite / Test Suite (ubuntu-latest)

🔇 Additional comments (1)

oxen-rust/src/server/src/controllers/workspaces/data_frames.rs (1)

13-13: LGTM!

Import addition is appropriate for the new ParsedResource usage below.

jcelliott · 2025-12-19T23:59:37Z

oxen-rust/src/lib/src/core/v_latest/workspaces/data_frames/rows.rs

        manager.with_conn(|conn| rows::append_row(conn, &df))
    })?;
-
+    println!("this is the result {result:?}");


@EloyMartinez can you remove this?

jcelliott · 2025-12-20T00:00:07Z

oxen-rust/src/lib/src/core/v_latest/workspaces/data_frames/rows.rs


    let mut df = tabular::parse_json_to_df(data)?;
-
+    log::debug!("update() df: {df:?}");


@EloyMartinez and this too.

this too!!!

EloyMartinez added 2 commits December 10, 2025 18:48

init chat completion history implementation

8b248d7

implement chat completino history

428053a

EloyMartinez added 3 commits December 12, 2025 14:43

fix format

c8b9ae5

fix format

50fd9c4

Merge branch 'main' of https://github.com/Oxen-AI/Oxen into feat/chat…

3d23187

…_completion_history

coderabbitai bot reviewed Dec 12, 2025

View reviewed changes

EloyMartinez added 3 commits December 12, 2025 15:01

fmt

94e7738

parse image content

f58e07b

parse image content

78bb851

coderabbitai bot reviewed Dec 16, 2025

View reviewed changes

EloyMartinez added 2 commits December 17, 2025 13:09

format

fba4c33

cargo clippy

7d81ece

coderabbitai bot reviewed Dec 17, 2025

View reviewed changes

gschoeni and others added 5 commits December 17, 2025 16:05

fix update row to normalize the data before inserting

4d19ced

normalize messages in the hub

c563c26

remove unused use serde_json::json

c0169c1

redirect to proper workspace

c9c75be

Merge branch 'feat/chat_completion_history' of https://github.com/Oxe…

95c9839

…n-AI/Oxen into feat/chat_completion_history

coderabbitai bot reviewed Dec 19, 2025

View reviewed changes

EloyMartinez and others added 2 commits December 19, 2025 19:05

format

c05d72b

Merge branch 'main' into feat/chat_completion_history

67150f0

gschoeni merged commit 08a9cc0 into main Dec 19, 2025
5 checks passed

gschoeni deleted the feat/chat_completion_history branch December 19, 2025 23:43

jcelliott reviewed Dec 19, 2025

View reviewed changes

jcelliott reviewed Dec 20, 2025

View reviewed changes


		let mut df = tabular::parse_json_to_df(data)?;

		log::debug!("update() df: {df:?}");

Enable array of struct arrow parsing for chat history #213

Enable array of struct arrow parsing for chat history #213

Uh oh!

Conversation

EloyMartinez commented Dec 12, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

jcelliott commented Dec 17, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jcelliott Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

EloyMartinez Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

jcelliott Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

EloyMartinez Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

EloyMartinez commented Dec 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 12, 2025 •

edited

Loading