๐Ÿงช

LLM Eval Kit

10 quality checks. LLM-as-judge. Multi-model comparison.
Score any LLM output โ€” zero API keys required.

10
Quality Checks
3
LLM Judges
43
Tests Passing
0
API Keys Needed

LLM Output

ScoreCard

๐Ÿงช

Click Score Output

Compare Models

Prompt

Model A

Model B

All 10 Checks + 3 Judges

๐Ÿ”
Hallucination
Hedging, fake citations, cutoff refs
Rule-based
๐Ÿ“
Placeholder
{{VAR}}, [TBD], Lorem ipsum
Rule-based
๐Ÿค–
Style
AI tells: "delve", "tapestry"
Rule-based
๐Ÿ“…
Freshness
Stale year references
Rule-based
๐Ÿ“
Length
Too short or too long
Rule-based
๐Ÿ”’
PII
Emails, SSNs, API keys, tokens
Rule-based
โ˜ ๏ธ
Toxicity
Violence, insults, profanity
Rule-based
{ }
JSON Validity
Valid JSON? Schema? Types?
Rule-based
โœ…
Completeness
All prompt parts addressed?
Rule-based
โš–๏ธ
Consistency
Self-contradictions detected
Rule-based
๐Ÿง‘โ€โš–๏ธ
LLM Judge
G-Eval: criteria โ†’ CoT โ†’ score
LLM Judge
โš”๏ธ
Pairwise
Compare A vs B, pick winner
LLM Judge
๐Ÿ“Š
Rubric
Score against defined levels
LLM Judge
pip install llm-eval-kit