Toxicity Fairness Benchmark
Demographic fairness evaluation of commercial toxicity APIs · HateXplain dataset, 1,000-sample draw, seed 42
Benchmark results not found. Run
python scripts/run_benchmark.py --sample 1000 --models perspective claude to generate data.
Overall Accuracy
Percentage of correct predictions across all samples
Accuracy by Group
Prediction accuracy per demographic subgroup within the selected protected attribute
Fairness Summary
Gap metrics across demographic subgroups — smaller values indicate greater fairness
Gap metrics only include subgroups with at least 5 examples of each class. HateXplain's demographic annotations skew heavily toward toxic examples, so many identity slices lack enough non-toxic examples to produce reliable FPR and TPR estimates. Subgroups that don't meet this threshold are listed in the warning below when applicable, and appear as n/a in the table.
Acc. Gap = max accuracy difference across subgroups · DP Gap = demographic parity (positive prediction rate) · TPR/FPR Gap = equalized odds components
False Positive vs. False Negative Rate
Points near the origin indicate equalized odds across subgroups. Bubble size = sample count.
Live Scorer
Score any text against all available models in real time using live API calls