Skip to content

Conversation

@rkritika1508
Copy link
Collaborator

@rkritika1508 rkritika1508 commented Jan 19, 2026

Summary

Target issue is #7.
Explain the motivation for making this change. What existing problem does the pull request solve?
This PR introduces a structured, offline evaluation framework for Guardrails validators, with support for metrics generation, performance profiling, and artifact export. We can benchmark validators like PII Remover and Lexical Slur Detection easily now.

  1. Offline evaluation scripts
    Added standalone evaluation runners under app/eval/ for:
  • Lexical slur validator
  • PII remover validator
    Evaluations run validators directly (no API / Guard orchestration), ensuring deterministic and fast benchmarks.
  1. Metrics & profiling utilities
  • Binary classification metrics (tp, fp, fn, tn, precision, recall, F1).
  • Entity-level precision/recall/F1 for PII redaction (placeholder-based evaluation).
  • Lightweight profiling: Per-sample latency (mean / p95 / max), Peak memory usage (via tracemalloc)
  1. Downloadable evaluation artifacts
    Each validator produces:
  • predictions.csv – row-level outputs for debugging and analysis
  • metrics.json – aggregated accuracy + performance metrics

Standardized output structure:

app/eval/outputs/
  lexical_slur/
    predictions.csv
    metrics.json
  pii_remover/
    predictions.csv
    metrics.json
  1. Testing
  • Evaluation scripts tested locally on existing lexical slur and PII datasets.
  • Outputs validated manually (predictions.csv, metrics.json) for correctness.
  • No changes to production runtime paths or APIs.

How to test

  • Lexical Slur Validator Evaluation
    Run the offline evaluation script: python app/eval/lexical_slur/run.py

Expected outputs:

app/eval/outputs/lexical_slur/
├── predictions.csv
└── metrics.json

predictions.csv contains row-level inputs, predictions, and labels.
metrics.json contains binary classification metrics and performance stats
(latency + peak memory).

  • PII Remover Evaluation
    Run the PII evaluation script: python app/eval/pii/run.py

Expected outputs:

app/eval/outputs/pii_remover/
├── predictions.csv
└── metrics.json

predictions.csv contains original text, anonymized output, ground-truth masked text
metrics.json contains entity-level precision, recall, and F1 per PII type.

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

  • New Features

    • Added evaluation framework for validators with comprehensive metrics computation, latency profiling, and performance reporting capabilities.
  • Chores

    • Added scikit-learn dependency for scientific computing.
    • Improved test execution by filtering slow and redteam tests for faster feedback cycles.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 19, 2026

📝 Walkthrough

Walkthrough

This PR introduces a new evaluation framework for testing validators with performance profiling and metrics computation, adds utility helpers for I/O operations, refactors test fixtures to use monkeypatch, and includes dependency and test configuration updates.

Changes

Cohort / File(s) Summary
Configuration & Dependencies
.gitignore, backend/pyproject.toml, backend/scripts/test.sh
Added output file entries to .gitignore, added scikit-learn dependency, configured pytest to exclude slow and redteam test markers
Evaluation Infrastructure Utilities
backend/app/eval/common/io.py, backend/app/eval/common/metrics.py, backend/app/eval/common/profiling.py
New utility modules providing CSV/JSON writers, binary metrics computation (tp/tn/fp/fn, precision, recall, F1), and a Profiler context manager for latency and peak memory tracking
Validator Evaluation Runners
backend/app/eval/lexical_slur/run.py, backend/app/eval/pii/run.py
Two new evaluation scripts that load datasets, instantiate validators, profile execution, compute classification metrics, and persist per-sample predictions and aggregated metrics to JSON/CSV outputs
PII Entity Metrics
backend/app/eval/pii/entity_metrics.py
New module for per-entity metrics computation from masked text, extracting entity labels via regex and computing TP/FP/FN per entity with precision/recall/F1 derivation
Guardrails API Robustness
backend/app/api/routes/guardrails.py
Modified add_validator_logs to use safe attribute access via getattr for nested fields (history, last, iterations, outputs, validator_logs), preventing exceptions on missing attributes
Test Infrastructure Refactoring
backend/app/tests/conftest.py, backend/app/tests/test_guardrails_api.py, backend/app/tests/test_validate_with_guard.py
Refactored fixtures to use monkeypatch for CRUD mocking, introduced separate mocks for RequestLogCrud and ValidatorLogCrud, updated error message expectations from "PII detected" to "Validation failed", added validator_log_crud parameter threading through validation tests

Sequence Diagram

sequenceDiagram
    participant Dataset as Dataset<br/>(CSV)
    participant Validator as Validator<br/>(LexicalSlur/PII)
    participant Profiler as Profiler<br/>(Context Mgr)
    participant Metrics as Metrics<br/>Computation
    participant Output as Output<br/>(CSV/JSON)
    
    Dataset->>Validator: Load samples (commentText)
    loop Per sample
        Profiler->>Profiler: __enter__() start tracemalloc
        Profiler->>Validator: record(validate, text)
        Validator->>Validator: Execute validation
        Profiler->>Profiler: Measure latency (ms)
        Profiler->>Profiler: Store in latencies[]
    end
    Profiler->>Profiler: __exit__() capture peak_memory_mb
    
    Validator->>Metrics: y_true, y_pred (FailResult → 1/0)
    Metrics->>Metrics: compute_binary_metrics()
    Metrics->>Metrics: Calculate tp/tn/fp/fn, precision, recall, F1
    
    Metrics->>Output: Write predictions.csv (samples + preds)
    Metrics->>Output: Write metrics.json (stats + latency + memory)
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 New metrics bloom where validators run,
With profilers measuring time and memory done,
From CSV to JSON, results neatly penned,
Guardrails now safer, test fixtures refined—
Evaluation infrastructure, perfectly aligned! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.73% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Added testing setup' is too vague and does not clearly convey the main purpose of the PR, which is implementing an offline evaluation framework with metrics, profiling utilities, and evaluation scripts for validators. Consider a more descriptive title like 'Add offline evaluation framework with metrics and profiling utilities' to better communicate the scope and primary objective of the changes.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@rkritika1508 rkritika1508 linked an issue Jan 27, 2026 that may be closed by this pull request
@rkritika1508 rkritika1508 marked this pull request as ready for review January 27, 2026 11:28
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@backend/app/eval/common/metrics.py`:
- Around line 1-5: compute_binary_metrics currently uses zip(y_true, y_pred)
which silently truncates on length mismatch; update each zip call in
compute_binary_metrics (the lines computing tp, tn, fp, fn) to use zip(y_true,
y_pred, strict=True) so a ValueError is raised if lengths differ, preserving
correct metric calculations.

In `@backend/app/eval/lexical_slur/run.py`:
- Around line 35-40: The performance latency computation in run.py assumes
p.latencies is non-empty and will throw on empty datasets; update the
"performance" block to guard p.latencies (e.g., check if p.latencies truthy) and
only compute mean, p95 (sorted index), and max when there are values, otherwise
set those fields to a safe default such as None (or 0) so empty datasets don't
raise; locate the code using p.latencies in the "performance": {"latency_ms":
...} block and wrap or inline-conditional the mean, p95, and max calculations
accordingly.

In `@backend/app/eval/pii/entity_metrics.py`:
- Line 41: The loop pairing gold and predicted texts uses zip without strict
checking; update the loop "for gold_txt, pred_txt in zip(gold_texts,
pred_texts):" in entity_metrics.py to use strict=True (i.e., zip(gold_texts,
pred_texts, strict=True)) so a length mismatch raises immediately; ensure any
callers that rely on silent truncation are adjusted or tests updated if needed.

In `@backend/app/eval/pii/run.py`:
- Around line 29-31: The export currently writes raw text including potential
PII to predictions.csv; before calling write_csv(df, OUT_DIR /
"predictions.csv") remove or mask the source_text column (or any columns named
source_text, text, raw_text, etc.) from df, or gate inclusion behind an explicit
flag (e.g., a keep_raw_text boolean) that defaults to false; update the code
path that prepares df (the variable named df) so write_csv only receives non-PII
columns, and ensure the change is applied where write_csv and OUT_DIR /
"predictions.csv" are used to prevent accidental export of raw PII.

In `@backend/pyproject.toml`:
- Line 35: The unbounded "scikit-learn" entry in the pyproject.toml production
dependencies should be removed from the main dependencies list; either move the
"scikit-learn" line into dev-dependencies (or [project.optional-dependencies] if
used as an opt-in tool) or, if it truly must remain in production, replace the
unbounded string "scikit-learn" with a version-bounded specifier (e.g.,
"scikit-learn>=1.0,<2.0") to match the project's dependency pinning conventions;
update the dependencies section (the "scikit-learn" entry) accordingly.
🧹 Nitpick comments (1)
backend/scripts/test.sh (1)

6-6: Make the marker filter configurable for full-suite runs.

Hard-coding the exclusion prevents running the full suite from this script and can skew coverage. Consider an env override with a sensible default.

♻️ Proposed tweak
+MARK_EXPR=${PYTEST_MARK_EXPR:-"not slow and not redteam"}
-coverage run -m pytest app/tests -m "not slow and not redteam"
+coverage run -m pytest app/tests -m "$MARK_EXPR"

Comment on lines +1 to +5
def compute_binary_metrics(y_true, y_pred):
tp = sum((yt == 1 and yp == 1) for yt, yp in zip(y_true, y_pred))
tn = sum((yt == 0 and yp == 0) for yt, yp in zip(y_true, y_pred))
fp = sum((yt == 0 and yp == 1) for yt, yp in zip(y_true, y_pred))
fn = sum((yt == 1 and yp == 0) for yt, yp in zip(y_true, y_pred))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

git ls-files | grep -E "(backend/app|setup|pyproject|Pipfile)" | head -20

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 727


🏁 Script executed:

cat backend/app/eval/common/metrics.py 2>/dev/null || echo "File not found, searching for it..."
fd -t f -name "metrics.py"

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 1003


🏁 Script executed:

fd -t f -g 'pyproject.toml' -o -g 'setup.cfg' -o -g 'setup.py' -o -g '.python-version' -o -g 'Pipfile'

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 194


🏁 Script executed:

find . -type f \( -name "pyproject.toml" -o -name "setup.cfg" -o -name "setup.py" -o -name ".python-version" \) | head -10

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 102


🏁 Script executed:

cat pyproject.toml 2>/dev/null || cat setup.py 2>/dev/null || cat setup.cfg 2>/dev/null || echo "No config files found"

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 99


🏁 Script executed:

rg "python|requires.python" -i --type toml --type txt --max-count 5

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 164


🏁 Script executed:

cat backend/pyproject.toml

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 2352


Add strict=True to zip() calls to prevent silent truncation.

The zip() calls on lines 2–5 will silently drop items if y_true and y_pred have different lengths, producing incorrect metrics. Since the project requires Python ≥3.10, use strict=True to raise a ValueError immediately on length mismatch.

Proposed fix
-    tp = sum((yt == 1 and yp == 1) for yt, yp in zip(y_true, y_pred))
-    tn = sum((yt == 0 and yp == 0) for yt, yp in zip(y_true, y_pred))
-    fp = sum((yt == 0 and yp == 1) for yt, yp in zip(y_true, y_pred))
-    fn = sum((yt == 1 and yp == 0) for yt, yp in zip(y_true, y_pred))
+    tp = sum((yt == 1 and yp == 1) for yt, yp in zip(y_true, y_pred, strict=True))
+    tn = sum((yt == 0 and yp == 0) for yt, yp in zip(y_true, y_pred, strict=True))
+    fp = sum((yt == 0 and yp == 1) for yt, yp in zip(y_true, y_pred, strict=True))
+    fn = sum((yt == 1 and yp == 0) for yt, yp in zip(y_true, y_pred, strict=True))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def compute_binary_metrics(y_true, y_pred):
tp = sum((yt == 1 and yp == 1) for yt, yp in zip(y_true, y_pred))
tn = sum((yt == 0 and yp == 0) for yt, yp in zip(y_true, y_pred))
fp = sum((yt == 0 and yp == 1) for yt, yp in zip(y_true, y_pred))
fn = sum((yt == 1 and yp == 0) for yt, yp in zip(y_true, y_pred))
def compute_binary_metrics(y_true, y_pred):
tp = sum((yt == 1 and yp == 1) for yt, yp in zip(y_true, y_pred, strict=True))
tn = sum((yt == 0 and yp == 0) for yt, yp in zip(y_true, y_pred, strict=True))
fp = sum((yt == 0 and yp == 1) for yt, yp in zip(y_true, y_pred, strict=True))
fn = sum((yt == 1 and yp == 0) for yt, yp in zip(y_true, y_pred, strict=True))
🧰 Tools
🪛 Ruff (0.14.14)

2-2: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


3-3: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


4-4: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


5-5: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
In `@backend/app/eval/common/metrics.py` around lines 1 - 5,
compute_binary_metrics currently uses zip(y_true, y_pred) which silently
truncates on length mismatch; update each zip call in compute_binary_metrics
(the lines computing tp, tn, fp, fn) to use zip(y_true, y_pred, strict=True) so
a ValueError is raised if lengths differ, preserving correct metric
calculations.

Comment on lines +35 to +40
"performance": {
"latency_ms": {
"mean": sum(p.latencies) / len(p.latencies),
"p95": sorted(p.latencies)[int(len(p.latencies) * 0.95)],
"max": max(p.latencies),
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard latency stats for empty datasets.
If the dataset is empty, mean, p95, and max will raise. A small guard avoids failures in edge cases.

🛠 Proposed guard
+latencies = p.latencies
+if latencies:
+    latency_stats = {
+        "mean": sum(latencies) / len(latencies),
+        "p95": sorted(latencies)[int(len(latencies) * 0.95)],
+        "max": max(latencies),
+    }
+else:
+    latency_stats = {"mean": 0.0, "p95": 0.0, "max": 0.0}
+
 write_json(
     {
         "guardrail": "lexical_slur",
         "num_samples": len(df),
         "metrics": metrics,
         "performance": {
             "latency_ms": {
-                "mean": sum(p.latencies) / len(p.latencies),
-                "p95": sorted(p.latencies)[int(len(p.latencies) * 0.95)],
-                "max": max(p.latencies),
+                **latency_stats,
             },
             "memory_mb": p.peak_memory_mb,
         },
     },
     OUT_DIR / "metrics.json",
 )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"performance": {
"latency_ms": {
"mean": sum(p.latencies) / len(p.latencies),
"p95": sorted(p.latencies)[int(len(p.latencies) * 0.95)],
"max": max(p.latencies),
},
latencies = p.latencies
if latencies:
latency_stats = {
"mean": sum(latencies) / len(latencies),
"p95": sorted(latencies)[int(len(latencies) * 0.95)],
"max": max(latencies),
}
else:
latency_stats = {"mean": 0.0, "p95": 0.0, "max": 0.0}
write_json(
{
"guardrail": "lexical_slur",
"num_samples": len(df),
"metrics": metrics,
"performance": {
"latency_ms": {
**latency_stats,
},
"memory_mb": p.peak_memory_mb,
},
},
OUT_DIR / "metrics.json",
)
🤖 Prompt for AI Agents
In `@backend/app/eval/lexical_slur/run.py` around lines 35 - 40, The performance
latency computation in run.py assumes p.latencies is non-empty and will throw on
empty datasets; update the "performance" block to guard p.latencies (e.g., check
if p.latencies truthy) and only compute mean, p95 (sorted index), and max when
there are values, otherwise set those fields to a safe default such as None (or
0) so empty datasets don't raise; locate the code using p.latencies in the
"performance": {"latency_ms": ...} block and wrap or inline-conditional the
mean, p95, and max calculations accordingly.

"""
stats = defaultdict(lambda: {"tp": 0, "fp": 0, "fn": 0})

for gold_txt, pred_txt in zip(gold_texts, pred_texts):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find Python version requirement
fd -type f -name "setup.py" -o -name "pyproject.toml" -o -name ".python-version" -o -name "requirements.txt" | head -20

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 248


🏁 Script executed:

# Read the file and surrounding context
cat -n backend/app/eval/pii/entity_metrics.py | head -80

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 2861


🏁 Script executed:

# Also check if there are any other zip calls to understand the pattern
rg "zip\(" backend/app/eval/pii/ -A 2 -B 2

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 495


🏁 Script executed:

# Search for Python version requirement
rg "python_requires|version.*python|3\.[0-9]+" -t py --max-count=10 | head -20

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 60


🏁 Script executed:

# Look for pyproject.toml or setup files
cat pyproject.toml 2>/dev/null | head -40

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 60


🏁 Script executed:

# Check for any test files or examples showing usage of compute_entity_metrics
rg "compute_entity_metrics" --type py -B 3 -A 3

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 1305


🏁 Script executed:

# Search all config files
find . -name "setup.py" -o -name "pyproject.toml" -o -name "setup.cfg" -o -name ".python-version" 2>/dev/null | head -10

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 102


🏁 Script executed:

# Check if pyproject.toml exists in repo root
ls -la | grep -E "pyproject|setup|requirements"

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 60


🏁 Script executed:

# Check the actual run.py usage context
cat -n backend/app/eval/pii/run.py | grep -A 10 "entity_report = compute_entity_metrics"

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 365


🏁 Script executed:

# Check the pyproject.toml for Python version
cat backend/pyproject.toml

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 2352


Add strict=True to zip to catch length mismatches.
zip() without strict=True silently truncates when iterables have unequal lengths, which would produce incorrect entity metrics. Since the project requires Python 3.10+, use strict=True.

🔧 Proposed fix
-    for gold_txt, pred_txt in zip(gold_texts, pred_texts):
+    for gold_txt, pred_txt in zip(gold_texts, pred_texts, strict=True):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for gold_txt, pred_txt in zip(gold_texts, pred_texts):
for gold_txt, pred_txt in zip(gold_texts, pred_texts, strict=True):
🧰 Tools
🪛 Ruff (0.14.14)

41-41: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
In `@backend/app/eval/pii/entity_metrics.py` at line 41, The loop pairing gold and
predicted texts uses zip without strict checking; update the loop "for gold_txt,
pred_txt in zip(gold_texts, pred_texts):" in entity_metrics.py to use
strict=True (i.e., zip(gold_texts, pred_texts, strict=True)) so a length
mismatch raises immediately; ensure any callers that rely on silent truncation
are adjusted or tests updated if needed.

Comment on lines +29 to +31
# ---- Save outputs ----
write_csv(df, OUT_DIR / "predictions.csv")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid exporting raw PII in predictions artifacts.
predictions.csv currently includes source_text, which likely contains raw PII. If these artifacts are shared or stored, this becomes a compliance/privacy risk. Consider excluding raw text (or gating it behind an explicit flag).

🔒 Proposed safer export
-# ---- Save outputs ----
-write_csv(df, OUT_DIR / "predictions.csv")
+# ---- Save outputs ----
+# Avoid exporting raw PII by default
+write_csv(df.drop(columns=["source_text"]), OUT_DIR / "predictions.csv")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# ---- Save outputs ----
write_csv(df, OUT_DIR / "predictions.csv")
# ---- Save outputs ----
# Avoid exporting raw PII by default
write_csv(df.drop(columns=["source_text"]), OUT_DIR / "predictions.csv")
🤖 Prompt for AI Agents
In `@backend/app/eval/pii/run.py` around lines 29 - 31, The export currently
writes raw text including potential PII to predictions.csv; before calling
write_csv(df, OUT_DIR / "predictions.csv") remove or mask the source_text column
(or any columns named source_text, text, raw_text, etc.) from df, or gate
inclusion behind an explicit flag (e.g., a keep_raw_text boolean) that defaults
to false; update the code path that prepares df (the variable named df) so
write_csv only receives non-PII columns, and ensure the change is applied where
write_csv and OUT_DIR / "predictions.csv" are used to prevent accidental export
of raw PII.

"pandas>=2.3.2",
"numpy>=1.24.0",
"python-dotenv<2.0.0,>=1.0.0",
"scikit-learn",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cd backend && wc -l pyproject.toml

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 95


🏁 Script executed:

cd backend && cat -n pyproject.toml

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 3003


🏁 Script executed:

rg "scikit-learn|sklearn" --type py backend/ -l

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 60


🌐 Web query:

scikit-learn latest stable version 2024 2025 Python support

💡 Result:

  • Latest stable (as of Jan 27, 2026): scikit-learn 1.8.0 — released Dec 10, 2025; PyPI metadata says Requires-Python >= 3.11 (classifiers include 3.11–3.14). [1]
  • Latest stable released in 2025: 1.8.0 (Dec 10, 2025). [1]
    • Earlier 2025 line: 1.7.0 (Jun 5, 2025) and bugfixes (1.7.1/1.7.2). [1]
    • Docs note 1.7 requires Python 3.10+. [2]
  • Latest stable released in 2024: 1.6.0 — released Dec 9, 2024 (with 1.5.x and 1.4.x earlier in 2024). [1]
    • Docs note 1.4 and 1.5 support Python 3.9–3.12, and 1.6 supports Python 3.9–3.13. [2]

Sources: PyPI release history/metadata [1]; scikit-learn install docs (Python-version support notes) [2].


Remove unbounded scikit-learn from production dependencies.

scikit-learn is not imported anywhere in the backend codebase, indicating it's used only in offline evaluation scripts. Keep it out of the main dependencies list to avoid bloating the production runtime. Either move it to dev-dependencies (if used in committed eval scripts) or an optional extra [project.optional-dependencies] (if installed separately for offline tooling).

If it remains in production deps, add version bounds for consistency with the rest of the project:

-    "scikit-learn",
+    "scikit-learn>=1.6.0,<2.0.0",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"scikit-learn",
"scikit-learn>=1.6.0,<2.0.0",
🤖 Prompt for AI Agents
In `@backend/pyproject.toml` at line 35, The unbounded "scikit-learn" entry in the
pyproject.toml production dependencies should be removed from the main
dependencies list; either move the "scikit-learn" line into dev-dependencies (or
[project.optional-dependencies] if used as an opt-in tool) or, if it truly must
remain in production, replace the unbounded string "scikit-learn" with a
version-bounded specifier (e.g., "scikit-learn>=1.0,<2.0") to match the
project's dependency pinning conventions; update the dependencies section (the
"scikit-learn" entry) accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Validator: Add Validator Evaluation

2 participants