ComplyHat runs zero internal LLM calls. Your host agent brings its own reasoning; ComplyHat runs the deterministic statistical methods below and returns structured, audit-tagged citations. Every metric value, threshold, pass/fail ruling, dataset row count, subgroup sizes, engine version, and random seed is persisted with the report — a third party can re-derive every finding from the same inputs.Documentation Index
Fetch the complete documentation index at: https://docs.complyhat.ai/llms.txt
Use this file to discover all available pages before exploring further.
Bias
Four fairness metrics. Each runs against a tabular dataset with an outcome column, a protected-class column, and — for two of them — a ground-truth column. All return apass / fail ruling against a configurable threshold; the defaults below trace to legal or academic sources.
Before any of the four tests run, a data-quality gate checks subgroup sample sizes (warns if n < 30), class imbalance (warns if the smallest subgroup is under 5% of the dataset), and missing values (warns if more than 10% of rows are missing the protected-class column). Warnings are carried into the report so a reviewer can assess whether a pass ruling is statistically meaningful.
Disparate impact (Four-Fifths Rule)
For each subgroupg, compute the favorable rate favorable(g) / total(g). The reference group is the subgroup with the highest favorable rate. The adverse impact ratio for any other subgroup is:
Statistical parity
Equal opportunity
True positive rate per subgroup:min(TPR) / max(TPR) < 0.80. Requires ground-truth labels — ComplyHat skips this test automatically when the model is deployed but not yet validated against outcomes.
Source: Hardt, Price, Srebro. Equality of Opportunity in Supervised Learning. NeurIPS 2016.
Predictive parity
Positive predictive value per subgroup:max(PPV) − min(PPV) > 0.10.
Predictive parity and equal opportunity cannot both hold when base rates differ across groups (Chouldechova 2016). ComplyHat reports both metrics and lets the audit context determine which matters for your use case. Do not suppress either metric from the report.
Drift
Drift testing compares a baseline distribution (typically training data) against a production distribution. Two methods form the standard pair; two more are available when the standard pair falls short.Population Stability Index (PSI)
For each bini:
| PSI range | Interpretation |
|---|---|
< 0.10 | No material change |
0.10 – 0.25 | Moderate drift — monitor |
>= 0.25 | Significant drift — investigate |
Kolmogorov-Smirnov test
p < 0.05 and KS > 0.10. The dual gate is intentional: with large production samples, any real numeric feature will produce a statistically significant but trivially small KS statistic. Both conditions must hold.
Source: Massey. The Kolmogorov-Smirnov Test for Goodness of Fit. JASA 1951. Standard in model risk since Federal Reserve SR 11-7 (2011) required ongoing monitoring of input data.
Additional methods
Jensen-Shannon divergence (bounded in[0, 1], useful when PSI is numerically unstable) and chi-squared (for categorical features) are also available. Reports include all metric values that ran, so an auditor sees the full picture regardless of which methods triggered a flag.
Explainability
Two model-agnostic local explainers. Both return per-feature attribution scores for a single prediction. You pass in the decision plus a set of neighbor or background decisions with their precomputed outcomes — ComplyHat does not call your model’s prediction functionf.
LIME with intercept
Each neighbor is weighted by an exponential kernel of its Euclidean distance to the target decision in feature space. ComplyHat fits a weighted least-squares linear surrogate against the full design matrix — a leading column of ones plus the feature columns. The first coefficient is the intercept; the remaining coefficients are the per-feature slopes returned as local attributions. Without the intercept, the surrogate is forced through the origin, which biases slope estimates whenever the neighborhood mean is offset from zero. ComplyHat returns the intercept alongside the slopes so reviewers can audit it. Defaults: kernel width0.75; up to 50,000 neighbors retained.
Source: Ribeiro, Singh, Guestrin. “Why Should I Trust You?” Explaining the Predictions of Any Classifier. KDD 2016.
Coalition-attribution proxy
Feature coalitions are enumerated (small feature sets) or sampled (large). Each coalition is weighted by the Kernel-SHAP kernel:f on the masked vector. Because ComplyHat cannot call f, the per-coalition outcome is approximated as:
coalition_attribution and must not be presented to a regulator as Shapley values.
Defaults: up to 50,000 coalitions; 10,000 background decisions retained.
Background: Lundberg, Lee. A Unified Approach to Interpreting Model Predictions. NeurIPS 2017. The ComplyHat proxy implementation is not a substitute for that method.
Completeness check
Both explainers report a completeness score: how closely the sum of attributions matchesactual_prediction − baseline_prediction. Scores are in [0, 1]. A low completeness score on a low-sample run signals noisy attributions — treat it as a red flag before the explanation enters an audit trail.
Adversarial robustness
Adversarial testing probes whether a model’s prediction is stable under input perturbations. ComplyHat runs two test families.Boundary robustness
For each test point, find the smallest perturbation — in L-infinity or L2 norm — that flips the model’s prediction. ComplyHat reports the median and 10th-percentile perturbation magnitudes across the test set. The pass threshold is use-case-dependent; your audit team sets it based on the plausible perturbation range for the use case (pixel-noise tolerance for vision models, rounding tolerance for tabular models). Source: Szegedy et al. Intriguing properties of neural networks. ICLR 2014. Method is a black-box variant of Carlini, Wagner. Towards Evaluating the Robustness of Neural Networks. IEEE S&P 2017.Data-quality robustness
ComplyHat injects realistic corruptions — missing values, out-of-range numerics, mistyped categoricals — at controlled rates (1%, 5%, 10%) and reports the delta in prediction distribution per corruption type. This measures graceful degradation under ordinary production-data errors, which is what operational teams actually face. EU AI Act Article 15 (§1, §3) explicitly requires this kind of robustness evidence for high-risk systems.Reproducibility
For every report, ComplyHat persists:- Metric values and thresholds
- Pass/fail rulings
- Dataset row count and subgroup sizes
- Data-quality warnings
- Engine version
- Random seeds used in sampling steps
ComplyHat runs zero internal LLM calls. Host agents (Claude Code, Codex, custom MCP clients) bring their own reasoning. ComplyHat returns structured citations and audit-tagged prose — never synthesized findings.
Next steps
Supported frameworks
Which metrics each regulator requires, at what cadence, and for which protected classes.
Tool reference
The MCP entry points that invoke these methods, with example requests and responses.