A production-grade open-source framework for evaluating GPT-4, Claude, Gemini, Mistral and Llama on accuracy, latency, cost, hallucination rate, and reasoning quality — all in one place.
From raw benchmarks to beautiful dashboards, REST APIs, PDF reports, and a CLI tool — no duct tape required.
Evaluate hundreds of samples in parallel with a configurable concurrency semaphore. Automatic timeout handling and retry logic built in.
asyncio5-page interactive dashboard with radar charts, latency histograms, cost vs quality scatter plots, and one-click PDF export.
streamlit + plotlyRun the exact same prompts on up to 5 models simultaneously and compare every metric on the same leaderboard.
parallel eval10 endpoints with full auto-generated OpenAPI docs at /docs. File upload, PDF generation, CSV/JSON export — all via HTTP.
7 subcommands (run, compare, results, export, report, serve, dashboard) with rich terminal output, progress bars, and tables.
click + richAuto-generate professional evaluation reports with cover page, summary table, per-model detail sections using ReportLab.
reportlabMulti-stage Dockerfile with separate API and Dashboard targets. docker-compose up and you're running.
Auto-loads MMLU and TruthfulQA from HF Hub with local caching. Dataset is also published on HuggingFace for easy reuse.
datasets libraryPricing table for 15+ model variants. Track total spend, cost per 1K tokens, and estimate run costs before executing.
cost calculatorAll results saved to SQLite automatically. Query, filter, export to CSV/JSON, and build comparison charts over time.
sqlite3Full pytest test suite covering all modules — no real API keys needed. GitHub Actions CI runs across Python 3.10/3.11/3.12.
pytest + coverageUpload your own CSV or JSON dataset from the dashboard, API, or CLI. Just provide prompt + expected columns.
Every metric is computed per sample, fully async, with statistical aggregation and percentile breakdowns.
From config to results in under 2 minutes.
Select any LiteLLM-compatible model and a benchmark (MMLU, TruthfulQA, or custom CSV). Set sample count and concurrency.
The engine fires all API calls concurrently using asyncio.Semaphore to control rate. Each call has a configurable timeout.
For each response: accuracy check, latency record, token count + cost, hallucination score, reasoning quality — all computed in parallel.
Results are aggregated into a full EvaluationResult with percentile stats and persisted to SQLite for future comparison.
Install, configure, and run your first evaluation in minutes.
# Install from PyPI pip install llm-evaluation-framework # Or clone from source git clone https://github.com/vignesh2027/LLM-Evaluation-Framework.git cd LLM-Evaluation-Framework pip install -e . # Copy and fill in your API keys cp .env.example .env # Verify installation llm-eval --version # llm-eval, version 1.0.0
# Run a single model evaluation (100 MMLU samples) llm-eval run --model gpt-4o-mini --benchmark mmlu --samples 100 ╭──────────────────────────────────────╮ │ Evaluation: gpt-4o-mini │ ├──────────────────┬───────────────────┤ │ Accuracy │ 78.00% │ │ Avg Latency │ 432 ms │ │ P95 Latency │ 1240 ms │ │ Total Cost │ $0.0023 │ │ Cost / 1K Tokens │ $0.0015 │ │ Hallucination │ 2.40% │ │ Reasoning Score │ 7.2 / 10 │ │ Samples │ 100 │ │ Run ID │ a3f92c1b │ ╰──────────────────┴───────────────────╯ # Compare 3 models llm-eval compare \ --models gpt-4o-mini \ --models claude-3-haiku-20240307 \ --models gemini/gemini-1.5-flash \ --benchmark mmlu --samples 50 # Show all stored results llm-eval results --benchmark mmlu --limit 20 # Export results llm-eval export --format csv --output results.csv # Generate PDF report llm-eval report --run-ids a3f92c1b --output ./reports/ # Launch dashboard llm-eval dashboard # Start REST API server llm-eval serve --port 8000
import asyncio from llm_eval.core.evaluator import LLMEvaluator, EvaluationConfig from llm_eval.benchmarks.mmlu import MMLUBenchmark async def main(): # Initialize evaluator evaluator = LLMEvaluator() # Load benchmark samples = MMLUBenchmark().load(num_samples=100) # Configure run config = EvaluationConfig( model="gpt-4o-mini", benchmark="mmlu", num_samples=100, temperature=0.0, concurrency=10, # 10 parallel calls timeout=30.0, ) # Run evaluation result = await evaluator.evaluate(config, samples) # Access results print(f"Accuracy: {result.accuracy:.2%}") print(f"Avg Latency: {result.avg_latency_ms:.0f}ms") print(f"P95 Latency: {result.p95_latency_ms:.0f}ms") print(f"Total Cost: ${result.total_cost_usd:.4f}") print(f"Hallucinate: {result.hallucination_rate:.2%}") print(f"Reasoning: {result.avg_reasoning_score:.1f}/10") # Export to JSON import json print(json.dumps(result.to_dict(), indent=2)) asyncio.run(main())
import asyncio from llm_eval.core.evaluator import LLMEvaluator, EvaluationConfig from llm_eval.benchmarks.mmlu import MMLUBenchmark async def compare_models(): evaluator = LLMEvaluator() samples = MMLUBenchmark().load(num_samples=50) # All models run on the SAME samples — apples-to-apples comparison configs = [ EvaluationConfig(model="gpt-4o-mini", benchmark="mmlu", num_samples=50), EvaluationConfig(model="claude-3-haiku-20240307", benchmark="mmlu", num_samples=50), EvaluationConfig(model="gemini/gemini-1.5-flash", benchmark="mmlu", num_samples=50), EvaluationConfig(model="mistral/mistral-small-latest", benchmark="mmlu", num_samples=50), ] # Runs all 4 evaluations in parallel results = await evaluator.evaluate_multiple(configs, samples) # Print leaderboard print(f"{'Model':<35} {'Accuracy':>10} {'Latency':>10} {'Cost/1K':>12} {'Reasoning':>12}") print("-" * 80) for r in sorted(results, key=lambda x: x.accuracy, reverse=True): print(f"{r.model:<35} {r.accuracy:>9.1%} {r.avg_latency_ms:>8.0f}ms ${r.cost_per_1k_tokens:>9.4f} {r.avg_reasoning_score:>9.1f}/10") asyncio.run(compare_models())
# Start the API server uvicorn llm_eval.api.main:app --reload --port 8000 # Open: http://localhost:8000/docs # Evaluate a model curl -X POST http://localhost:8000/evaluate \ -H "Content-Type: application/json" \ -d '{"model":"gpt-4o-mini","benchmark":"mmlu","num_samples":50}' { "run_id": "a3f92c1b", "model": "gpt-4o-mini", "accuracy": 0.78, "avg_latency_ms": 432.1, "p95_latency_ms": 1240.0, "total_cost_usd": 0.0012, "cost_per_1k_tokens": 0.0015, "hallucination_rate": 0.024, "avg_reasoning_score": 7.2, "created_at": "2025-01-20T14:32:01" } # Compare models curl -X POST http://localhost:8000/compare \ -H "Content-Type: application/json" \ -d '{"models":["gpt-4o-mini","claude-3-haiku-20240307"],"benchmark":"mmlu","num_samples":30}' # Export CSV curl http://localhost:8000/export/csv -o results.csv # Generate PDF report curl -X POST http://localhost:8000/report \ -H "Content-Type: application/json" \ -d '{"run_ids":["a3f92c1b"]}' \ -o report.pdf
# Clone and configure git clone https://github.com/vignesh2027/LLM-Evaluation-Framework.git cd LLM-Evaluation-Framework cp .env.example .env # Edit .env — add your API keys # Start API + Dashboard with one command docker-compose up -d # API: http://localhost:8000/docs # Dashboard: http://localhost:8501 # Build only the API image docker build --target api -t llm-eval-api . docker run -p 8000:8000 --env-file .env llm-eval-api # Build only the Dashboard image docker build --target dashboard -t llm-eval-dashboard . docker run -p 8501:8501 --env-file .env llm-eval-dashboard # View logs docker-compose logs -f api docker-compose logs -f dashboard
Three ways to benchmark: industry-standard datasets loaded from HuggingFace Hub, local cache, or your own CSV/JSON.
| Benchmark | Samples | Format | Subjects | Use Case | Source |
|---|---|---|---|---|---|
| MMLU | ~14,000 test | 4-choice MC | 57 academic subjects | General knowledge & reasoning | HuggingFace Hub |
| TruthfulQA | 817 questions | 4-choice MC | Health, law, history, myths | Factual truthfulness | HuggingFace Hub |
| Custom CSV | Any | prompt + expected | User-defined | Domain-specific evaluation | File Upload |
| Custom JSON | Any | Array of objects | User-defined | Programmatic benchmarks | File Upload |
prompt,expected "What is 2+2?",4 "Capital of France?",Paris "Who wrote Hamlet?","Shakespeare" "What is O(log n)?","Binary search"
[
{
"prompt": "What is 2+2?",
"expected": "4"
},
{
"prompt": "Capital of France?",
"expected": "Paris"
}
]Any LiteLLM-compatible model works. Pricing is pre-loaded for the most common variants.
+ Any model supported by LiteLLM works automatically.
Run the API with uvicorn llm_eval.api.main:app --reload then visit /docs for interactive Swagger UI.
Every layer is independently replaceable. Swap the benchmark loader, DB backend, or metrics engine without touching the rest.
Representative benchmark runs on MMLU test set (100 samples). Your results will vary by API region and time of day.
| Model | Accuracy | Avg Latency | P95 Latency | Cost / 1K Tokens | Hallucination Rate | Reasoning Score |
|---|---|---|---|---|---|---|
| GPT-4o | 88.2% | 892 ms | 2,140 ms | $0.0080 | 1.8% | 8.4 / 10 |
| Claude 3.5 Sonnet | 87.6% | 1,240 ms | 2,890 ms | $0.0090 | 2.1% | 8.6 / 10 |
| GPT-4o-mini | 78.4% | 432 ms | 1,100 ms | $0.0003 | 3.2% | 7.2 / 10 |
| Gemini 1.5 Flash | 76.8% | 380 ms | 910 ms | $0.0001 | 4.1% | 6.8 / 10 |
| Claude 3 Haiku | 74.2% | 410 ms | 980 ms | $0.0010 | 4.8% | 6.5 / 10 |
| Mistral Small | 71.0% | 520 ms | 1,320 ms | $0.0010 | 5.6% | 6.2 / 10 |
| Llama 3 8B | 64.4% | 680 ms | 1,820 ms | $0.0002 | 7.4% | 5.9 / 10 |
⚠️ These are sample results for illustration. Run the framework with real API keys for actual benchmarks.
Choose the setup that fits your workflow — from one-line pip install to full Docker deployment.
pip install llm-evaluation-framework # With dashboard extras pip install "llm-evaluation-framework[dashboard]" # With PDF reports pip install "llm-evaluation-framework[reports]" # Full install pip install "llm-evaluation-framework[dashboard,reports,dev]"
git clone https://github.com/vignesh2027/LLM-Evaluation-Framework.git cd LLM-Evaluation-Framework python -m venv .venv && source .venv/bin/activate pip install -e ".[dashboard,reports,dev]" cp .env.example .env && nano .env
git clone https://github.com/vignesh2027/LLM-Evaluation-Framework.git cd LLM-Evaluation-Framework cp .env.example .env # add API keys docker-compose up -d # API at :8000, Dashboard at :8501
from datasets import load_dataset ds = load_dataset("vigneshwar234/llm-eval-benchmark") # 1,200 samples: train / validation / test # Covers MMLU + TruthfulQA subjects # Science, math, CS, history, economics
Common questions about the LLM Evaluation Framework.
.env.ollama/llama3, huggingface/meta-llama/Llama-3-8b) and set the custom API base URL in your .env.pytest tests/ -v. All 40+ tests use mocked LiteLLM responses — no real API keys required. The CI runs the full suite on Python 3.10, 3.11, and 3.12 on every push.prompt column and an optional expected column. In the dashboard, use the ▶️ Run Evaluation page and select "custom" as the benchmark to upload your file. Via CLI: pipe your data through the Python API with CustomBenchmark.from_file("data.csv").evaluate_multiple method loads the same sample list once and passes it to each model configuration. All model evaluations run in parallel using asyncio.gather, so comparing 4 models takes the same time as comparing 1. This guarantees apples-to-apples results since every model sees identical prompts.CostMetric.estimate_run_cost(model, num_samples) method gives you a pre-run cost estimate based on average token counts. The actual cost is computed precisely per-sample from real token counts after the run.Free, open-source, production-grade. Star the repo to help others discover it — it means everything. 🙏