Toxicity Fairness Benchmark

Overall Accuracy

Percentage of correct predictions across all samples

Accuracy by Group

Prediction accuracy per demographic subgroup within the selected protected attribute

Fairness Summary

Gap metrics across demographic subgroups — smaller values indicate greater fairness

Gap metrics only include subgroups with at least 5 examples of each class. HateXplain's demographic annotations skew heavily toward toxic examples, so many identity slices lack enough non-toxic examples to produce reliable FPR and TPR estimates. Subgroups that don't meet this threshold are listed in the warning below when applicable, and appear as n/a in the table.

Acc. Gap = max accuracy difference across subgroups · DP Gap = demographic parity (positive prediction rate) · TPR/FPR Gap = equalized odds components

False Positive vs. False Negative Rate

Points near the origin indicate equalized odds across subgroups. Bubble size = sample count.

Live Scorer

Score any text against all available models in real time using live API calls

Text to evaluate