Added testing setup #17

rkritika1508 · 2026-01-19T12:25:34Z

Summary

Target issue is #7.
Explain the motivation for making this change. What existing problem does the pull request solve?
This PR introduces a structured, offline evaluation framework for Guardrails validators, with support for metrics generation, performance profiling, and artifact export. We can benchmark validators like PII Remover and Lexical Slur Detection easily now.

Offline evaluation scripts
Added standalone evaluation runners under app/eval/ for:

Lexical slur validator
PII remover validator
Evaluations run validators directly (no API / Guard orchestration), ensuring deterministic and fast benchmarks.

Metrics & profiling utilities

Binary classification metrics (tp, fp, fn, tn, precision, recall, F1).
Entity-level precision/recall/F1 for PII redaction (placeholder-based evaluation).
Lightweight profiling: Per-sample latency (mean / p95 / max), Peak memory usage (via tracemalloc)

Downloadable evaluation artifacts
Each validator produces:

predictions.csv – row-level outputs for debugging and analysis
metrics.json – aggregated accuracy + performance metrics

Standardized output structure:

app/eval/outputs/
  lexical_slur/
    predictions.csv
    metrics.json
  pii_remover/
    predictions.csv
    metrics.json

Testing

Evaluation scripts tested locally on existing lexical slur and PII datasets.
Outputs validated manually (predictions.csv, metrics.json) for correctness.
No changes to production runtime paths or APIs.

How to test

Lexical Slur Validator Evaluation
Run the offline evaluation script: python app/eval/lexical_slur/run.py

Expected outputs:

app/eval/outputs/lexical_slur/
├── predictions.csv
└── metrics.json

predictions.csv contains row-level inputs, predictions, and labels.
metrics.json contains binary classification metrics and performance stats
(latency + peak memory).

PII Remover Evaluation
Run the PII evaluation script: python app/eval/pii/run.py

Expected outputs:

app/eval/outputs/pii_remover/
├── predictions.csv
└── metrics.json

predictions.csv contains original text, anonymized output, ground-truth masked text
metrics.json contains entity-level precision, recall, and F1 per PII type.

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

New Features
- Added evaluation framework for validators with comprehensive metrics computation, latency profiling, and performance reporting capabilities.
Chores
- Added scikit-learn dependency for scientific computing.
- Improved test execution by filtering slow and redteam tests for faster feedback cycles.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-19T12:25:41Z

📝 Walkthrough

Walkthrough

This PR introduces a new evaluation framework for testing validators with performance profiling and metrics computation, adds utility helpers for I/O operations, refactors test fixtures to use monkeypatch, and includes dependency and test configuration updates.

Changes

Cohort / File(s)	Summary
Configuration & Dependencies `.gitignore`, `backend/pyproject.toml`, `backend/scripts/test.sh`	Added output file entries to .gitignore, added scikit-learn dependency, configured pytest to exclude slow and redteam test markers
Evaluation Infrastructure Utilities `backend/app/eval/common/io.py`, `backend/app/eval/common/metrics.py`, `backend/app/eval/common/profiling.py`	New utility modules providing CSV/JSON writers, binary metrics computation (tp/tn/fp/fn, precision, recall, F1), and a Profiler context manager for latency and peak memory tracking
Validator Evaluation Runners `backend/app/eval/lexical_slur/run.py`, `backend/app/eval/pii/run.py`	Two new evaluation scripts that load datasets, instantiate validators, profile execution, compute classification metrics, and persist per-sample predictions and aggregated metrics to JSON/CSV outputs
PII Entity Metrics `backend/app/eval/pii/entity_metrics.py`	New module for per-entity metrics computation from masked text, extracting entity labels via regex and computing TP/FP/FN per entity with precision/recall/F1 derivation
Guardrails API Robustness `backend/app/api/routes/guardrails.py`	Modified `add_validator_logs` to use safe attribute access via `getattr` for nested fields (history, last, iterations, outputs, validator_logs), preventing exceptions on missing attributes
Test Infrastructure Refactoring `backend/app/tests/conftest.py`, `backend/app/tests/test_guardrails_api.py`, `backend/app/tests/test_validate_with_guard.py`	Refactored fixtures to use monkeypatch for CRUD mocking, introduced separate mocks for RequestLogCrud and ValidatorLogCrud, updated error message expectations from "PII detected" to "Validation failed", added validator_log_crud parameter threading through validation tests

Sequence Diagram

sequenceDiagram
    participant Dataset as Dataset<br/>(CSV)
    participant Validator as Validator<br/>(LexicalSlur/PII)
    participant Profiler as Profiler<br/>(Context Mgr)
    participant Metrics as Metrics<br/>Computation
    participant Output as Output<br/>(CSV/JSON)
    
    Dataset->>Validator: Load samples (commentText)
    loop Per sample
        Profiler->>Profiler: __enter__() start tracemalloc
        Profiler->>Validator: record(validate, text)
        Validator->>Validator: Execute validation
        Profiler->>Profiler: Measure latency (ms)
        Profiler->>Profiler: Store in latencies[]
    end
    Profiler->>Profiler: __exit__() capture peak_memory_mb
    
    Validator->>Metrics: y_true, y_pred (FailResult → 1/0)
    Metrics->>Metrics: compute_binary_metrics()
    Metrics->>Metrics: Calculate tp/tn/fp/fn, precision, recall, F1
    
    Metrics->>Output: Write predictions.csv (samples + preds)
    Metrics->>Output: Write metrics.json (stats + latency + memory)

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 New metrics bloom where validators run,
With profilers measuring time and memory done,
From CSV to JSON, results neatly penned,
Guardrails now safer, test fixtures refined—
Evaluation infrastructure, perfectly aligned! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.73% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Added testing setup' is too vague and does not clearly convey the main purpose of the PR, which is implementing an offline evaluation framework with metrics, profiling utilities, and evaluation scripts for validators.	Consider a more descriptive title like 'Add offline evaluation framework with metrics and profiling utilities' to better communicate the scope and primary objective of the changes.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

* Validators: Added validator logs and organized code

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In `@backend/app/eval/common/metrics.py`:
- Around line 1-5: compute_binary_metrics currently uses zip(y_true, y_pred)
which silently truncates on length mismatch; update each zip call in
compute_binary_metrics (the lines computing tp, tn, fp, fn) to use zip(y_true,
y_pred, strict=True) so a ValueError is raised if lengths differ, preserving
correct metric calculations.

In `@backend/app/eval/lexical_slur/run.py`:
- Around line 35-40: The performance latency computation in run.py assumes
p.latencies is non-empty and will throw on empty datasets; update the
"performance" block to guard p.latencies (e.g., check if p.latencies truthy) and
only compute mean, p95 (sorted index), and max when there are values, otherwise
set those fields to a safe default such as None (or 0) so empty datasets don't
raise; locate the code using p.latencies in the "performance": {"latency_ms":
...} block and wrap or inline-conditional the mean, p95, and max calculations
accordingly.

In `@backend/app/eval/pii/entity_metrics.py`:
- Line 41: The loop pairing gold and predicted texts uses zip without strict
checking; update the loop "for gold_txt, pred_txt in zip(gold_texts,
pred_texts):" in entity_metrics.py to use strict=True (i.e., zip(gold_texts,
pred_texts, strict=True)) so a length mismatch raises immediately; ensure any
callers that rely on silent truncation are adjusted or tests updated if needed.

In `@backend/app/eval/pii/run.py`:
- Around line 29-31: The export currently writes raw text including potential
PII to predictions.csv; before calling write_csv(df, OUT_DIR /
"predictions.csv") remove or mask the source_text column (or any columns named
source_text, text, raw_text, etc.) from df, or gate inclusion behind an explicit
flag (e.g., a keep_raw_text boolean) that defaults to false; update the code
path that prepares df (the variable named df) so write_csv only receives non-PII
columns, and ensure the change is applied where write_csv and OUT_DIR /
"predictions.csv" are used to prevent accidental export of raw PII.

In `@backend/pyproject.toml`:
- Line 35: The unbounded "scikit-learn" entry in the pyproject.toml production
dependencies should be removed from the main dependencies list; either move the
"scikit-learn" line into dev-dependencies (or [project.optional-dependencies] if
used as an opt-in tool) or, if it truly must remain in production, replace the
unbounded string "scikit-learn" with a version-bounded specifier (e.g.,
"scikit-learn>=1.0,<2.0") to match the project's dependency pinning conventions;
update the dependencies section (the "scikit-learn" entry) accordingly.

🧹 Nitpick comments (1)

backend/scripts/test.sh (1)
6-6: Make the marker filter configurable for full-suite runs.

Hard-coding the exclusion prevents running the full suite from this script and can skew coverage. Consider an env override with a sensible default.
♻️ Proposed tweak
+MARK_EXPR=${PYTEST_MARK_EXPR:-"not slow and not redteam"}
-coverage run -m pytest app/tests -m "not slow and not redteam"
+coverage run -m pytest app/tests -m "$MARK_EXPR"

coderabbitai · 2026-01-27T11:36:12Z

backend/app/eval/common/metrics.py

+def compute_binary_metrics(y_true, y_pred):
+    tp = sum((yt == 1 and yp == 1) for yt, yp in zip(y_true, y_pred))
+    tn = sum((yt == 0 and yp == 0) for yt, yp in zip(y_true, y_pred))
+    fp = sum((yt == 0 and yp == 1) for yt, yp in zip(y_true, y_pred))
+    fn = sum((yt == 1 and yp == 0) for yt, yp in zip(y_true, y_pred))


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

git ls-files | grep -E "(backend/app|setup|pyproject|Pipfile)" | head -20

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 727

🏁 Script executed:

cat backend/app/eval/common/metrics.py 2>/dev/null || echo "File not found, searching for it..." fd -t f -name "metrics.py"

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 1003

🏁 Script executed:

fd -t f -g 'pyproject.toml' -o -g 'setup.cfg' -o -g 'setup.py' -o -g '.python-version' -o -g 'Pipfile'

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 194

🏁 Script executed:

find . -type f $ -name "pyproject.toml" -o -name "setup.cfg" -o -name "setup.py" -o -name ".python-version" $ | head -10

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 102

🏁 Script executed:

cat pyproject.toml 2>/dev/null || cat setup.py 2>/dev/null || cat setup.cfg 2>/dev/null || echo "No config files found"

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 99

🏁 Script executed:

rg "python|requires.python" -i --type toml --type txt --max-count 5

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 164

🏁 Script executed:

cat backend/pyproject.toml

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 2352

Add strict=True to zip() calls to prevent silent truncation.

The zip() calls on lines 2–5 will silently drop items if y_true and y_pred have different lengths, producing incorrect metrics. Since the project requires Python ≥3.10, use strict=True to raise a ValueError immediately on length mismatch.

Proposed fix

- tp = sum((yt == 1 and yp == 1) for yt, yp in zip(y_true, y_pred)) - tn = sum((yt == 0 and yp == 0) for yt, yp in zip(y_true, y_pred)) - fp = sum((yt == 0 and yp == 1) for yt, yp in zip(y_true, y_pred)) - fn = sum((yt == 1 and yp == 0) for yt, yp in zip(y_true, y_pred)) + tp = sum((yt == 1 and yp == 1) for yt, yp in zip(y_true, y_pred, strict=True)) + tn = sum((yt == 0 and yp == 0) for yt, yp in zip(y_true, y_pred, strict=True)) + fp = sum((yt == 0 and yp == 1) for yt, yp in zip(y_true, y_pred, strict=True)) + fn = sum((yt == 1 and yp == 0) for yt, yp in zip(y_true, y_pred, strict=True))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def compute_binary_metrics(y_true, y_pred):

tp = sum((yt == 1 and yp == 1) for yt, yp in zip(y_true, y_pred))

tn = sum((yt == 0 and yp == 0) for yt, yp in zip(y_true, y_pred))

fp = sum((yt == 0 and yp == 1) for yt, yp in zip(y_true, y_pred))

fn = sum((yt == 1 and yp == 0) for yt, yp in zip(y_true, y_pred))

def compute_binary_metrics(y_true, y_pred):

tp = sum((yt == 1 and yp == 1) for yt, yp in zip(y_true, y_pred, strict=True))

tn = sum((yt == 0 and yp == 0) for yt, yp in zip(y_true, y_pred, strict=True))

fp = sum((yt == 0 and yp == 1) for yt, yp in zip(y_true, y_pred, strict=True))

fn = sum((yt == 1 and yp == 0) for yt, yp in zip(y_true, y_pred, strict=True))

🧰 Tools

🪛 Ruff (0.14.14)

2-2: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

3-3: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

4-4: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

5-5: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

In `@backend/app/eval/common/metrics.py` around lines 1 - 5, compute_binary_metrics currently uses zip(y_true, y_pred) which silently truncates on length mismatch; update each zip call in compute_binary_metrics (the lines computing tp, tn, fp, fn) to use zip(y_true, y_pred, strict=True) so a ValueError is raised if lengths differ, preserving correct metric calculations.

coderabbitai · 2026-01-27T11:36:12Z

backend/app/eval/lexical_slur/run.py

+        "performance": {
+            "latency_ms": {
+                "mean": sum(p.latencies) / len(p.latencies),
+                "p95": sorted(p.latencies)[int(len(p.latencies) * 0.95)],
+                "max": max(p.latencies),
+            },


⚠️ Potential issue | 🟡 Minor

Guard latency stats for empty datasets.
If the dataset is empty, mean, p95, and max will raise. A small guard avoids failures in edge cases.

🛠 Proposed guard

+latencies = p.latencies +if latencies: + latency_stats = { + "mean": sum(latencies) / len(latencies), + "p95": sorted(latencies)[int(len(latencies) * 0.95)], + "max": max(latencies), + } +else: + latency_stats = {"mean": 0.0, "p95": 0.0, "max": 0.0} + write_json( { "guardrail": "lexical_slur", "num_samples": len(df), "metrics": metrics, "performance": { "latency_ms": { - "mean": sum(p.latencies) / len(p.latencies), - "p95": sorted(p.latencies)[int(len(p.latencies) * 0.95)], - "max": max(p.latencies), + **latency_stats, }, "memory_mb": p.peak_memory_mb, }, }, OUT_DIR / "metrics.json", )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"performance": {

"latency_ms": {

"mean": sum(p.latencies) / len(p.latencies),

"p95": sorted(p.latencies)[int(len(p.latencies) * 0.95)],

"max": max(p.latencies),

},

latencies = p.latencies

if latencies:

latency_stats = {

"mean": sum(latencies) / len(latencies),

"p95": sorted(latencies)[int(len(latencies) * 0.95)],

"max": max(latencies),

}

else:

latency_stats = {"mean": 0.0, "p95": 0.0, "max": 0.0}

write_json(

{

"guardrail": "lexical_slur",

"num_samples": len(df),

"metrics": metrics,

"performance": {

"latency_ms": {

**latency_stats,

},

"memory_mb": p.peak_memory_mb,

},

},

OUT_DIR / "metrics.json",

)

🤖 Prompt for AI Agents

In `@backend/app/eval/lexical_slur/run.py` around lines 35 - 40, The performance latency computation in run.py assumes p.latencies is non-empty and will throw on empty datasets; update the "performance" block to guard p.latencies (e.g., check if p.latencies truthy) and only compute mean, p95 (sorted index), and max when there are values, otherwise set those fields to a safe default such as None (or 0) so empty datasets don't raise; locate the code using p.latencies in the "performance": {"latency_ms": ...} block and wrap or inline-conditional the mean, p95, and max calculations accordingly.

coderabbitai · 2026-01-27T11:36:12Z

backend/app/eval/pii/entity_metrics.py

+    """
+    stats = defaultdict(lambda: {"tp": 0, "fp": 0, "fn": 0})
+
+    for gold_txt, pred_txt in zip(gold_texts, pred_texts):


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find Python version requirement fd -type f -name "setup.py" -o -name "pyproject.toml" -o -name ".python-version" -o -name "requirements.txt" | head -20

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 248

🏁 Script executed:

# Read the file and surrounding context cat -n backend/app/eval/pii/entity_metrics.py | head -80

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 2861

🏁 Script executed:

# Also check if there are any other zip calls to understand the pattern rg "zip\(" backend/app/eval/pii/ -A 2 -B 2

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 495

🏁 Script executed:

# Search for Python version requirement rg "python_requires|version.*python|3\.[0-9]+" -t py --max-count=10 | head -20

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 60

🏁 Script executed:

# Look for pyproject.toml or setup files cat pyproject.toml 2>/dev/null | head -40

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 60

🏁 Script executed:

# Check for any test files or examples showing usage of compute_entity_metrics rg "compute_entity_metrics" --type py -B 3 -A 3

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 1305

🏁 Script executed:

# Search all config files find . -name "setup.py" -o -name "pyproject.toml" -o -name "setup.cfg" -o -name ".python-version" 2>/dev/null | head -10

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 102

🏁 Script executed:

# Check if pyproject.toml exists in repo root ls -la | grep -E "pyproject|setup|requirements"

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 60

🏁 Script executed:

# Check the actual run.py usage context cat -n backend/app/eval/pii/run.py | grep -A 10 "entity_report = compute_entity_metrics"

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 365

🏁 Script executed:

# Check the pyproject.toml for Python version cat backend/pyproject.toml

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 2352

Add strict=True to zip to catch length mismatches.
zip() without strict=True silently truncates when iterables have unequal lengths, which would produce incorrect entity metrics. Since the project requires Python 3.10+, use strict=True.

🔧 Proposed fix

- for gold_txt, pred_txt in zip(gold_texts, pred_texts): + for gold_txt, pred_txt in zip(gold_texts, pred_texts, strict=True):

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for gold_txt, pred_txt in zip(gold_texts, pred_texts):

for gold_txt, pred_txt in zip(gold_texts, pred_texts, strict=True):

🧰 Tools

🪛 Ruff (0.14.14)

41-41: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

In `@backend/app/eval/pii/entity_metrics.py` at line 41, The loop pairing gold and predicted texts uses zip without strict checking; update the loop "for gold_txt, pred_txt in zip(gold_texts, pred_texts):" in entity_metrics.py to use strict=True (i.e., zip(gold_texts, pred_texts, strict=True)) so a length mismatch raises immediately; ensure any callers that rely on silent truncation are adjusted or tests updated if needed.

coderabbitai · 2026-01-27T11:36:12Z

backend/app/eval/pii/run.py

+# ---- Save outputs ----
+write_csv(df, OUT_DIR / "predictions.csv")
+


⚠️ Potential issue | 🟠 Major

Avoid exporting raw PII in predictions artifacts.
predictions.csv currently includes source_text, which likely contains raw PII. If these artifacts are shared or stored, this becomes a compliance/privacy risk. Consider excluding raw text (or gating it behind an explicit flag).

🔒 Proposed safer export

-# ---- Save outputs ---- -write_csv(df, OUT_DIR / "predictions.csv") +# ---- Save outputs ---- +# Avoid exporting raw PII by default +write_csv(df.drop(columns=["source_text"]), OUT_DIR / "predictions.csv")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# ---- Save outputs ----

write_csv(df, OUT_DIR / "predictions.csv")

# ---- Save outputs ----

# Avoid exporting raw PII by default

write_csv(df.drop(columns=["source_text"]), OUT_DIR / "predictions.csv")

🤖 Prompt for AI Agents

In `@backend/app/eval/pii/run.py` around lines 29 - 31, The export currently writes raw text including potential PII to predictions.csv; before calling write_csv(df, OUT_DIR / "predictions.csv") remove or mask the source_text column (or any columns named source_text, text, raw_text, etc.) from df, or gate inclusion behind an explicit flag (e.g., a keep_raw_text boolean) that defaults to false; update the code path that prepares df (the variable named df) so write_csv only receives non-PII columns, and ensure the change is applied where write_csv and OUT_DIR / "predictions.csv" are used to prevent accidental export of raw PII.

coderabbitai · 2026-01-27T11:36:12Z

backend/pyproject.toml

    "pandas>=2.3.2",
    "numpy>=1.24.0",
    "python-dotenv<2.0.0,>=1.0.0",
+    "scikit-learn",


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cd backend && wc -l pyproject.toml

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 95

🏁 Script executed:

cd backend && cat -n pyproject.toml

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 3003

🏁 Script executed:

rg "scikit-learn|sklearn" --type py backend/ -l

Repository: ProjectTech4DevAI/kaapi-guardrails

Length of output: 60

🌐 Web query:

scikit-learn latest stable version 2024 2025 Python support

💡 Result:

Latest stable (as of Jan 27, 2026): scikit-learn 1.8.0 — released Dec 10, 2025; PyPI metadata says Requires-Python >= 3.11 (classifiers include 3.11–3.14). [1]

Latest stable released in 2025: 1.8.0 (Dec 10, 2025). [1]

Earlier 2025 line: 1.7.0 (Jun 5, 2025) and bugfixes (1.7.1/1.7.2). [1]

Docs note 1.7 requires Python 3.10+. [2]

Latest stable released in 2024: 1.6.0 — released Dec 9, 2024 (with 1.5.x and 1.4.x earlier in 2024). [1]

Docs note 1.4 and 1.5 support Python 3.9–3.12, and 1.6 supports Python 3.9–3.13. [2]

Sources: PyPI release history/metadata [1]; scikit-learn install docs (Python-version support notes) [2].

Remove unbounded scikit-learn from production dependencies.

scikit-learn is not imported anywhere in the backend codebase, indicating it's used only in offline evaluation scripts. Keep it out of the main dependencies list to avoid bloating the production runtime. Either move it to dev-dependencies (if used in committed eval scripts) or an optional extra [project.optional-dependencies] (if installed separately for offline tooling).

If it remains in production deps, add version bounds for consistency with the rest of the project:

- "scikit-learn", + "scikit-learn>=1.6.0,<2.0.0",

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"scikit-learn",

"scikit-learn>=1.6.0,<2.0.0",

🤖 Prompt for AI Agents

In `@backend/pyproject.toml` at line 35, The unbounded "scikit-learn" entry in the pyproject.toml production dependencies should be removed from the main dependencies list; either move the "scikit-learn" line into dev-dependencies (or [project.optional-dependencies] if used as an opt-in tool) or, if it truly must remain in production, replace the unbounded string "scikit-learn" with a version-bounded specifier (e.g., "scikit-learn>=1.0,<2.0") to match the project's dependency pinning conventions; update the dependencies section (the "scikit-learn" entry) accordingly.

Added testing setup

f503349

rkritika1508 added 8 commits January 19, 2026 18:08

Added API metrics

2aaa996

Updated import ordering

c1c3ff4

Updated testing code

e14fb66

Updated testing code

c2c81c2

Added validator logs and organized code (#18)

dd3012c

* Validators: Added validator logs and organized code

Fixed testing setup

609e6a9

Merge branch 'main' into feat/testing-setup

c81b717

Refactored testing setup

c4d102c

rkritika1508 linked an issue Jan 27, 2026 that may be closed by this pull request

Validator: Add Validator Evaluation #7

Open

rkritika1508 marked this pull request as ready for review January 27, 2026 11:28

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added testing setup #17

Added testing setup #17

Uh oh!

rkritika1508 commented Jan 19, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 19, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 27, 2026

Uh oh!

coderabbitai bot Jan 27, 2026

Uh oh!

coderabbitai bot Jan 27, 2026

Uh oh!

coderabbitai bot Jan 27, 2026

Uh oh!

coderabbitai bot Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        "performance": {
-            "latency_ms": {
-                "mean": sum(p.latencies) / len(p.latencies),
-                "p95": sorted(p.latencies)[int(len(p.latencies) * 0.95)],
-                "max": max(p.latencies),
-            },
+latencies = p.latencies
+if latencies:
+    latency_stats = {
+        "mean": sum(latencies) / len(latencies),
+        "p95": sorted(latencies)[int(len(latencies) * 0.95)],
+        "max": max(latencies),
+    }
+else:
+    latency_stats = {"mean": 0.0, "p95": 0.0, "max": 0.0}
+write_json(
+    {
+        "guardrail": "lexical_slur",
+        "num_samples": len(df),
+        "metrics": metrics,
+        "performance": {
+            "latency_ms": {
+                **latency_stats,
+            },
+            "memory_mb": p.peak_memory_mb,
+        },
+    },
+    OUT_DIR / "metrics.json",
+)

	for gold_txt, pred_txt in zip(gold_texts, pred_texts):
	for gold_txt, pred_txt in zip(gold_texts, pred_texts, strict=True):

		# ---- Save outputs ----
		write_csv(df, OUT_DIR / "predictions.csv")

Added testing setup #17

Are you sure you want to change the base?

Added testing setup #17

Uh oh!

Conversation

rkritika1508 commented Jan 19, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rkritika1508 commented Jan 19, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 19, 2026 •

edited

Loading