-
Notifications
You must be signed in to change notification settings - Fork 2
Added testing setup #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughThis PR introduces a new evaluation framework for testing validators with performance profiling and metrics computation, adds utility helpers for I/O operations, refactors test fixtures to use monkeypatch, and includes dependency and test configuration updates. Changes
Sequence DiagramsequenceDiagram
participant Dataset as Dataset<br/>(CSV)
participant Validator as Validator<br/>(LexicalSlur/PII)
participant Profiler as Profiler<br/>(Context Mgr)
participant Metrics as Metrics<br/>Computation
participant Output as Output<br/>(CSV/JSON)
Dataset->>Validator: Load samples (commentText)
loop Per sample
Profiler->>Profiler: __enter__() start tracemalloc
Profiler->>Validator: record(validate, text)
Validator->>Validator: Execute validation
Profiler->>Profiler: Measure latency (ms)
Profiler->>Profiler: Store in latencies[]
end
Profiler->>Profiler: __exit__() capture peak_memory_mb
Validator->>Metrics: y_true, y_pred (FailResult → 1/0)
Metrics->>Metrics: compute_binary_metrics()
Metrics->>Metrics: Calculate tp/tn/fp/fn, precision, recall, F1
Metrics->>Output: Write predictions.csv (samples + preds)
Metrics->>Output: Write metrics.json (stats + latency + memory)
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
* Validators: Added validator logs and organized code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@backend/app/eval/common/metrics.py`:
- Around line 1-5: compute_binary_metrics currently uses zip(y_true, y_pred)
which silently truncates on length mismatch; update each zip call in
compute_binary_metrics (the lines computing tp, tn, fp, fn) to use zip(y_true,
y_pred, strict=True) so a ValueError is raised if lengths differ, preserving
correct metric calculations.
In `@backend/app/eval/lexical_slur/run.py`:
- Around line 35-40: The performance latency computation in run.py assumes
p.latencies is non-empty and will throw on empty datasets; update the
"performance" block to guard p.latencies (e.g., check if p.latencies truthy) and
only compute mean, p95 (sorted index), and max when there are values, otherwise
set those fields to a safe default such as None (or 0) so empty datasets don't
raise; locate the code using p.latencies in the "performance": {"latency_ms":
...} block and wrap or inline-conditional the mean, p95, and max calculations
accordingly.
In `@backend/app/eval/pii/entity_metrics.py`:
- Line 41: The loop pairing gold and predicted texts uses zip without strict
checking; update the loop "for gold_txt, pred_txt in zip(gold_texts,
pred_texts):" in entity_metrics.py to use strict=True (i.e., zip(gold_texts,
pred_texts, strict=True)) so a length mismatch raises immediately; ensure any
callers that rely on silent truncation are adjusted or tests updated if needed.
In `@backend/app/eval/pii/run.py`:
- Around line 29-31: The export currently writes raw text including potential
PII to predictions.csv; before calling write_csv(df, OUT_DIR /
"predictions.csv") remove or mask the source_text column (or any columns named
source_text, text, raw_text, etc.) from df, or gate inclusion behind an explicit
flag (e.g., a keep_raw_text boolean) that defaults to false; update the code
path that prepares df (the variable named df) so write_csv only receives non-PII
columns, and ensure the change is applied where write_csv and OUT_DIR /
"predictions.csv" are used to prevent accidental export of raw PII.
In `@backend/pyproject.toml`:
- Line 35: The unbounded "scikit-learn" entry in the pyproject.toml production
dependencies should be removed from the main dependencies list; either move the
"scikit-learn" line into dev-dependencies (or [project.optional-dependencies] if
used as an opt-in tool) or, if it truly must remain in production, replace the
unbounded string "scikit-learn" with a version-bounded specifier (e.g.,
"scikit-learn>=1.0,<2.0") to match the project's dependency pinning conventions;
update the dependencies section (the "scikit-learn" entry) accordingly.
🧹 Nitpick comments (1)
backend/scripts/test.sh (1)
6-6: Make the marker filter configurable for full-suite runs.Hard-coding the exclusion prevents running the full suite from this script and can skew coverage. Consider an env override with a sensible default.
♻️ Proposed tweak
+MARK_EXPR=${PYTEST_MARK_EXPR:-"not slow and not redteam"} -coverage run -m pytest app/tests -m "not slow and not redteam" +coverage run -m pytest app/tests -m "$MARK_EXPR"
| def compute_binary_metrics(y_true, y_pred): | ||
| tp = sum((yt == 1 and yp == 1) for yt, yp in zip(y_true, y_pred)) | ||
| tn = sum((yt == 0 and yp == 0) for yt, yp in zip(y_true, y_pred)) | ||
| fp = sum((yt == 0 and yp == 1) for yt, yp in zip(y_true, y_pred)) | ||
| fn = sum((yt == 1 and yp == 0) for yt, yp in zip(y_true, y_pred)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
git ls-files | grep -E "(backend/app|setup|pyproject|Pipfile)" | head -20Repository: ProjectTech4DevAI/kaapi-guardrails
Length of output: 727
🏁 Script executed:
cat backend/app/eval/common/metrics.py 2>/dev/null || echo "File not found, searching for it..."
fd -t f -name "metrics.py"Repository: ProjectTech4DevAI/kaapi-guardrails
Length of output: 1003
🏁 Script executed:
fd -t f -g 'pyproject.toml' -o -g 'setup.cfg' -o -g 'setup.py' -o -g '.python-version' -o -g 'Pipfile'Repository: ProjectTech4DevAI/kaapi-guardrails
Length of output: 194
🏁 Script executed:
find . -type f \( -name "pyproject.toml" -o -name "setup.cfg" -o -name "setup.py" -o -name ".python-version" \) | head -10Repository: ProjectTech4DevAI/kaapi-guardrails
Length of output: 102
🏁 Script executed:
cat pyproject.toml 2>/dev/null || cat setup.py 2>/dev/null || cat setup.cfg 2>/dev/null || echo "No config files found"Repository: ProjectTech4DevAI/kaapi-guardrails
Length of output: 99
🏁 Script executed:
rg "python|requires.python" -i --type toml --type txt --max-count 5Repository: ProjectTech4DevAI/kaapi-guardrails
Length of output: 164
🏁 Script executed:
cat backend/pyproject.tomlRepository: ProjectTech4DevAI/kaapi-guardrails
Length of output: 2352
Add strict=True to zip() calls to prevent silent truncation.
The zip() calls on lines 2–5 will silently drop items if y_true and y_pred have different lengths, producing incorrect metrics. Since the project requires Python ≥3.10, use strict=True to raise a ValueError immediately on length mismatch.
Proposed fix
- tp = sum((yt == 1 and yp == 1) for yt, yp in zip(y_true, y_pred))
- tn = sum((yt == 0 and yp == 0) for yt, yp in zip(y_true, y_pred))
- fp = sum((yt == 0 and yp == 1) for yt, yp in zip(y_true, y_pred))
- fn = sum((yt == 1 and yp == 0) for yt, yp in zip(y_true, y_pred))
+ tp = sum((yt == 1 and yp == 1) for yt, yp in zip(y_true, y_pred, strict=True))
+ tn = sum((yt == 0 and yp == 0) for yt, yp in zip(y_true, y_pred, strict=True))
+ fp = sum((yt == 0 and yp == 1) for yt, yp in zip(y_true, y_pred, strict=True))
+ fn = sum((yt == 1 and yp == 0) for yt, yp in zip(y_true, y_pred, strict=True))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def compute_binary_metrics(y_true, y_pred): | |
| tp = sum((yt == 1 and yp == 1) for yt, yp in zip(y_true, y_pred)) | |
| tn = sum((yt == 0 and yp == 0) for yt, yp in zip(y_true, y_pred)) | |
| fp = sum((yt == 0 and yp == 1) for yt, yp in zip(y_true, y_pred)) | |
| fn = sum((yt == 1 and yp == 0) for yt, yp in zip(y_true, y_pred)) | |
| def compute_binary_metrics(y_true, y_pred): | |
| tp = sum((yt == 1 and yp == 1) for yt, yp in zip(y_true, y_pred, strict=True)) | |
| tn = sum((yt == 0 and yp == 0) for yt, yp in zip(y_true, y_pred, strict=True)) | |
| fp = sum((yt == 0 and yp == 1) for yt, yp in zip(y_true, y_pred, strict=True)) | |
| fn = sum((yt == 1 and yp == 0) for yt, yp in zip(y_true, y_pred, strict=True)) |
🧰 Tools
🪛 Ruff (0.14.14)
2-2: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
3-3: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
4-4: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
5-5: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
In `@backend/app/eval/common/metrics.py` around lines 1 - 5,
compute_binary_metrics currently uses zip(y_true, y_pred) which silently
truncates on length mismatch; update each zip call in compute_binary_metrics
(the lines computing tp, tn, fp, fn) to use zip(y_true, y_pred, strict=True) so
a ValueError is raised if lengths differ, preserving correct metric
calculations.
| "performance": { | ||
| "latency_ms": { | ||
| "mean": sum(p.latencies) / len(p.latencies), | ||
| "p95": sorted(p.latencies)[int(len(p.latencies) * 0.95)], | ||
| "max": max(p.latencies), | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard latency stats for empty datasets.
If the dataset is empty, mean, p95, and max will raise. A small guard avoids failures in edge cases.
🛠 Proposed guard
+latencies = p.latencies
+if latencies:
+ latency_stats = {
+ "mean": sum(latencies) / len(latencies),
+ "p95": sorted(latencies)[int(len(latencies) * 0.95)],
+ "max": max(latencies),
+ }
+else:
+ latency_stats = {"mean": 0.0, "p95": 0.0, "max": 0.0}
+
write_json(
{
"guardrail": "lexical_slur",
"num_samples": len(df),
"metrics": metrics,
"performance": {
"latency_ms": {
- "mean": sum(p.latencies) / len(p.latencies),
- "p95": sorted(p.latencies)[int(len(p.latencies) * 0.95)],
- "max": max(p.latencies),
+ **latency_stats,
},
"memory_mb": p.peak_memory_mb,
},
},
OUT_DIR / "metrics.json",
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "performance": { | |
| "latency_ms": { | |
| "mean": sum(p.latencies) / len(p.latencies), | |
| "p95": sorted(p.latencies)[int(len(p.latencies) * 0.95)], | |
| "max": max(p.latencies), | |
| }, | |
| latencies = p.latencies | |
| if latencies: | |
| latency_stats = { | |
| "mean": sum(latencies) / len(latencies), | |
| "p95": sorted(latencies)[int(len(latencies) * 0.95)], | |
| "max": max(latencies), | |
| } | |
| else: | |
| latency_stats = {"mean": 0.0, "p95": 0.0, "max": 0.0} | |
| write_json( | |
| { | |
| "guardrail": "lexical_slur", | |
| "num_samples": len(df), | |
| "metrics": metrics, | |
| "performance": { | |
| "latency_ms": { | |
| **latency_stats, | |
| }, | |
| "memory_mb": p.peak_memory_mb, | |
| }, | |
| }, | |
| OUT_DIR / "metrics.json", | |
| ) |
🤖 Prompt for AI Agents
In `@backend/app/eval/lexical_slur/run.py` around lines 35 - 40, The performance
latency computation in run.py assumes p.latencies is non-empty and will throw on
empty datasets; update the "performance" block to guard p.latencies (e.g., check
if p.latencies truthy) and only compute mean, p95 (sorted index), and max when
there are values, otherwise set those fields to a safe default such as None (or
0) so empty datasets don't raise; locate the code using p.latencies in the
"performance": {"latency_ms": ...} block and wrap or inline-conditional the
mean, p95, and max calculations accordingly.
| """ | ||
| stats = defaultdict(lambda: {"tp": 0, "fp": 0, "fn": 0}) | ||
|
|
||
| for gold_txt, pred_txt in zip(gold_texts, pred_texts): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Find Python version requirement
fd -type f -name "setup.py" -o -name "pyproject.toml" -o -name ".python-version" -o -name "requirements.txt" | head -20Repository: ProjectTech4DevAI/kaapi-guardrails
Length of output: 248
🏁 Script executed:
# Read the file and surrounding context
cat -n backend/app/eval/pii/entity_metrics.py | head -80Repository: ProjectTech4DevAI/kaapi-guardrails
Length of output: 2861
🏁 Script executed:
# Also check if there are any other zip calls to understand the pattern
rg "zip\(" backend/app/eval/pii/ -A 2 -B 2Repository: ProjectTech4DevAI/kaapi-guardrails
Length of output: 495
🏁 Script executed:
# Search for Python version requirement
rg "python_requires|version.*python|3\.[0-9]+" -t py --max-count=10 | head -20Repository: ProjectTech4DevAI/kaapi-guardrails
Length of output: 60
🏁 Script executed:
# Look for pyproject.toml or setup files
cat pyproject.toml 2>/dev/null | head -40Repository: ProjectTech4DevAI/kaapi-guardrails
Length of output: 60
🏁 Script executed:
# Check for any test files or examples showing usage of compute_entity_metrics
rg "compute_entity_metrics" --type py -B 3 -A 3Repository: ProjectTech4DevAI/kaapi-guardrails
Length of output: 1305
🏁 Script executed:
# Search all config files
find . -name "setup.py" -o -name "pyproject.toml" -o -name "setup.cfg" -o -name ".python-version" 2>/dev/null | head -10Repository: ProjectTech4DevAI/kaapi-guardrails
Length of output: 102
🏁 Script executed:
# Check if pyproject.toml exists in repo root
ls -la | grep -E "pyproject|setup|requirements"Repository: ProjectTech4DevAI/kaapi-guardrails
Length of output: 60
🏁 Script executed:
# Check the actual run.py usage context
cat -n backend/app/eval/pii/run.py | grep -A 10 "entity_report = compute_entity_metrics"Repository: ProjectTech4DevAI/kaapi-guardrails
Length of output: 365
🏁 Script executed:
# Check the pyproject.toml for Python version
cat backend/pyproject.tomlRepository: ProjectTech4DevAI/kaapi-guardrails
Length of output: 2352
Add strict=True to zip to catch length mismatches.
zip() without strict=True silently truncates when iterables have unequal lengths, which would produce incorrect entity metrics. Since the project requires Python 3.10+, use strict=True.
🔧 Proposed fix
- for gold_txt, pred_txt in zip(gold_texts, pred_texts):
+ for gold_txt, pred_txt in zip(gold_texts, pred_texts, strict=True):📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for gold_txt, pred_txt in zip(gold_texts, pred_texts): | |
| for gold_txt, pred_txt in zip(gold_texts, pred_texts, strict=True): |
🧰 Tools
🪛 Ruff (0.14.14)
41-41: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
In `@backend/app/eval/pii/entity_metrics.py` at line 41, The loop pairing gold and
predicted texts uses zip without strict checking; update the loop "for gold_txt,
pred_txt in zip(gold_texts, pred_texts):" in entity_metrics.py to use
strict=True (i.e., zip(gold_texts, pred_texts, strict=True)) so a length
mismatch raises immediately; ensure any callers that rely on silent truncation
are adjusted or tests updated if needed.
| # ---- Save outputs ---- | ||
| write_csv(df, OUT_DIR / "predictions.csv") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid exporting raw PII in predictions artifacts.
predictions.csv currently includes source_text, which likely contains raw PII. If these artifacts are shared or stored, this becomes a compliance/privacy risk. Consider excluding raw text (or gating it behind an explicit flag).
🔒 Proposed safer export
-# ---- Save outputs ----
-write_csv(df, OUT_DIR / "predictions.csv")
+# ---- Save outputs ----
+# Avoid exporting raw PII by default
+write_csv(df.drop(columns=["source_text"]), OUT_DIR / "predictions.csv")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # ---- Save outputs ---- | |
| write_csv(df, OUT_DIR / "predictions.csv") | |
| # ---- Save outputs ---- | |
| # Avoid exporting raw PII by default | |
| write_csv(df.drop(columns=["source_text"]), OUT_DIR / "predictions.csv") | |
🤖 Prompt for AI Agents
In `@backend/app/eval/pii/run.py` around lines 29 - 31, The export currently
writes raw text including potential PII to predictions.csv; before calling
write_csv(df, OUT_DIR / "predictions.csv") remove or mask the source_text column
(or any columns named source_text, text, raw_text, etc.) from df, or gate
inclusion behind an explicit flag (e.g., a keep_raw_text boolean) that defaults
to false; update the code path that prepares df (the variable named df) so
write_csv only receives non-PII columns, and ensure the change is applied where
write_csv and OUT_DIR / "predictions.csv" are used to prevent accidental export
of raw PII.
| "pandas>=2.3.2", | ||
| "numpy>=1.24.0", | ||
| "python-dotenv<2.0.0,>=1.0.0", | ||
| "scikit-learn", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
cd backend && wc -l pyproject.tomlRepository: ProjectTech4DevAI/kaapi-guardrails
Length of output: 95
🏁 Script executed:
cd backend && cat -n pyproject.tomlRepository: ProjectTech4DevAI/kaapi-guardrails
Length of output: 3003
🏁 Script executed:
rg "scikit-learn|sklearn" --type py backend/ -lRepository: ProjectTech4DevAI/kaapi-guardrails
Length of output: 60
🌐 Web query:
scikit-learn latest stable version 2024 2025 Python support
💡 Result:
- Latest stable (as of Jan 27, 2026): scikit-learn 1.8.0 — released Dec 10, 2025; PyPI metadata says Requires-Python >= 3.11 (classifiers include 3.11–3.14). [1]
- Latest stable released in 2025: 1.8.0 (Dec 10, 2025). [1]
- Earlier 2025 line: 1.7.0 (Jun 5, 2025) and bugfixes (1.7.1/1.7.2). [1]
- Docs note 1.7 requires Python 3.10+. [2]
- Latest stable released in 2024: 1.6.0 — released Dec 9, 2024 (with 1.5.x and 1.4.x earlier in 2024). [1]
- Docs note 1.4 and 1.5 support Python 3.9–3.12, and 1.6 supports Python 3.9–3.13. [2]
Sources: PyPI release history/metadata [1]; scikit-learn install docs (Python-version support notes) [2].
Remove unbounded scikit-learn from production dependencies.
scikit-learn is not imported anywhere in the backend codebase, indicating it's used only in offline evaluation scripts. Keep it out of the main dependencies list to avoid bloating the production runtime. Either move it to dev-dependencies (if used in committed eval scripts) or an optional extra [project.optional-dependencies] (if installed separately for offline tooling).
If it remains in production deps, add version bounds for consistency with the rest of the project:
- "scikit-learn",
+ "scikit-learn>=1.6.0,<2.0.0",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "scikit-learn", | |
| "scikit-learn>=1.6.0,<2.0.0", |
🤖 Prompt for AI Agents
In `@backend/pyproject.toml` at line 35, The unbounded "scikit-learn" entry in the
pyproject.toml production dependencies should be removed from the main
dependencies list; either move the "scikit-learn" line into dev-dependencies (or
[project.optional-dependencies] if used as an opt-in tool) or, if it truly must
remain in production, replace the unbounded string "scikit-learn" with a
version-bounded specifier (e.g., "scikit-learn>=1.0,<2.0") to match the
project's dependency pinning conventions; update the dependencies section (the
"scikit-learn" entry) accordingly.
Summary
Target issue is #7.
Explain the motivation for making this change. What existing problem does the pull request solve?
This PR introduces a structured, offline evaluation framework for Guardrails validators, with support for metrics generation, performance profiling, and artifact export. We can benchmark validators like PII Remover and Lexical Slur Detection easily now.
Added standalone evaluation runners under app/eval/ for:
Evaluations run validators directly (no API / Guard orchestration), ensuring deterministic and fast benchmarks.
Each validator produces:
Standardized output structure:
How to test
Run the offline evaluation script:
python app/eval/lexical_slur/run.pyExpected outputs:
predictions.csv contains row-level inputs, predictions, and labels.
metrics.json contains binary classification metrics and performance stats
(latency + peak memory).
Run the PII evaluation script:
python app/eval/pii/run.pyExpected outputs:
predictions.csv contains original text, anonymized output, ground-truth masked text
metrics.json contains entity-level precision, recall, and F1 per PII type.
Checklist
Before submitting a pull request, please ensure that you mark these task.
fastapi run --reload app/main.pyordocker compose upin the repository root and test.Notes
Please add here if any other information is required for the reviewer.
Summary by CodeRabbit
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.