v1.0.0 — Production Ready — Open Source

Benchmark Any LLM
Like a Scientist

A production-grade open-source framework for evaluating GPT-4, Claude, Gemini, Mistral and Llama on accuracy, latency, cost, hallucination rate, and reasoning quality — all in one place.

🚀 Get Started on GitHub 🤗 HuggingFace Dataset 📖 Documentation
Python MIT FastAPI Streamlit LiteLLM CI Coverage Stars
10+
LLM Providers
5
Eval Metrics
15K+
Benchmark Samples
40+
Unit Tests
100%
Async Parallel
3
Deploy Options
Features

Everything you need to
benchmark LLMs

From raw benchmarks to beautiful dashboards, REST APIs, PDF reports, and a CLI tool — no duct tape required.

Full Async Engine

Evaluate hundreds of samples in parallel with a configurable concurrency semaphore. Automatic timeout handling and retry logic built in.

asyncio
📊

Streamlit Dashboard

5-page interactive dashboard with radar charts, latency histograms, cost vs quality scatter plots, and one-click PDF export.

streamlit + plotly
⚖️

Side-by-Side Comparison

Run the exact same prompts on up to 5 models simultaneously and compare every metric on the same leaderboard.

parallel eval
🌐

FastAPI REST API

10 endpoints with full auto-generated OpenAPI docs at /docs. File upload, PDF generation, CSV/JSON export — all via HTTP.

fastapi
💻

CLI Tool

7 subcommands (run, compare, results, export, report, serve, dashboard) with rich terminal output, progress bars, and tables.

click + rich
📄

PDF Reports

Auto-generate professional evaluation reports with cover page, summary table, per-model detail sections using ReportLab.

reportlab
🐳

Docker Ready

Multi-stage Dockerfile with separate API and Dashboard targets. docker-compose up and you're running.

docker-compose
🤗

HuggingFace Integration

Auto-loads MMLU and TruthfulQA from HF Hub with local caching. Dataset is also published on HuggingFace for easy reuse.

datasets library
💰

Real-Time Cost Tracking

Pricing table for 15+ model variants. Track total spend, cost per 1K tokens, and estimate run costs before executing.

cost calculator
🗄️

SQLite Persistence

All results saved to SQLite automatically. Query, filter, export to CSV/JSON, and build comparison charts over time.

sqlite3
🧪

40+ Unit Tests

Full pytest test suite covering all modules — no real API keys needed. GitHub Actions CI runs across Python 3.10/3.11/3.12.

pytest + coverage
📤

Custom Benchmarks

Upload your own CSV or JSON dataset from the dashboard, API, or CLI. Just provide prompt + expected columns.

custom upload
Metrics

5 metrics that actually matter

Every metric is computed per sample, fully async, with statistical aggregation and percentile breakdowns.

🎯
Accuracy
Multi-strategy scorer: exact match → normalized match → multiple-choice letter detection → fuzzy match cascade. Handles free-form and MC answers.
0.0 – 1.0
Latency
Wall-clock time per call. Reports mean, median, std, min, max, and all percentiles (p50, p75, p90, p95, p99). Includes SLA violation rate.
ms · p50/p95/p99
💰
Cost
Per-token pricing for 15+ provider/model variants. Reports total cost, cost per 1K tokens, and pre-run cost estimate. No extra API calls.
USD / 1K tokens
🤥
Hallucination Rate
Linguistic signal analysis — uncertainty markers, hedging phrases, ungrounded claims vs grounding signals. Runs entirely locally, zero cost.
0.0 – 1.0
🧠
Reasoning Quality
Scores chain-of-thought depth: reasoning marker density, grounding signals, response length calibration. Identifies step-by-step vs shallow answers.
1 – 10
How It Works

Simple 4-step evaluation pipeline

From config to results in under 2 minutes.

1

Choose Model + Benchmark

Select any LiteLLM-compatible model and a benchmark (MMLU, TruthfulQA, or custom CSV). Set sample count and concurrency.

2

Async Parallel Evaluation

The engine fires all API calls concurrently using asyncio.Semaphore to control rate. Each call has a configurable timeout.

3

Per-Sample Metrics

For each response: accuracy check, latency record, token count + cost, hallucination score, reasoning quality — all computed in parallel.

4

Aggregation + Storage

Results are aggregated into a full EvaluationResult with percentile stats and persisted to SQLite for future comparison.

Quick Start

Up and running in 3 commands

Install, configure, and run your first evaluation in minutes.

# Install from PyPI
pip install llm-evaluation-framework

# Or clone from source
git clone https://github.com/vignesh2027/LLM-Evaluation-Framework.git
cd LLM-Evaluation-Framework
pip install -e .

# Copy and fill in your API keys
cp .env.example .env

# Verify installation
llm-eval --version
# llm-eval, version 1.0.0
# Run a single model evaluation (100 MMLU samples)
llm-eval run --model gpt-4o-mini --benchmark mmlu --samples 100

╭──────────────────────────────────────╮
│  Evaluation: gpt-4o-mini             │
├──────────────────┬───────────────────┤
│ Accuracy         │ 78.00%            │
│ Avg Latency      │ 432 ms            │
│ P95 Latency      │ 1240 ms           │
│ Total Cost       │ $0.0023           │
│ Cost / 1K Tokens │ $0.0015           │
│ Hallucination    │ 2.40%             │
│ Reasoning Score  │ 7.2 / 10          │
│ Samples          │ 100               │
│ Run ID           │ a3f92c1b          │
╰──────────────────┴───────────────────╯

# Compare 3 models
llm-eval compare \
  --models gpt-4o-mini \
  --models claude-3-haiku-20240307 \
  --models gemini/gemini-1.5-flash \
  --benchmark mmlu --samples 50

# Show all stored results
llm-eval results --benchmark mmlu --limit 20

# Export results
llm-eval export --format csv --output results.csv

# Generate PDF report
llm-eval report --run-ids a3f92c1b --output ./reports/

# Launch dashboard
llm-eval dashboard

# Start REST API server
llm-eval serve --port 8000
import asyncio
from llm_eval.core.evaluator import LLMEvaluator, EvaluationConfig
from llm_eval.benchmarks.mmlu import MMLUBenchmark

async def main():
    # Initialize evaluator
    evaluator = LLMEvaluator()

    # Load benchmark
    samples = MMLUBenchmark().load(num_samples=100)

    # Configure run
    config = EvaluationConfig(
        model="gpt-4o-mini",
        benchmark="mmlu",
        num_samples=100,
        temperature=0.0,
        concurrency=10,     # 10 parallel calls
        timeout=30.0,
    )

    # Run evaluation
    result = await evaluator.evaluate(config, samples)

    # Access results
    print(f"Accuracy:     {result.accuracy:.2%}")
    print(f"Avg Latency:  {result.avg_latency_ms:.0f}ms")
    print(f"P95 Latency:  {result.p95_latency_ms:.0f}ms")
    print(f"Total Cost:   ${result.total_cost_usd:.4f}")
    print(f"Hallucinate:  {result.hallucination_rate:.2%}")
    print(f"Reasoning:    {result.avg_reasoning_score:.1f}/10")

    # Export to JSON
    import json
    print(json.dumps(result.to_dict(), indent=2))

asyncio.run(main())
import asyncio
from llm_eval.core.evaluator import LLMEvaluator, EvaluationConfig
from llm_eval.benchmarks.mmlu import MMLUBenchmark

async def compare_models():
    evaluator = LLMEvaluator()
    samples = MMLUBenchmark().load(num_samples=50)

    # All models run on the SAME samples — apples-to-apples comparison
    configs = [
        EvaluationConfig(model="gpt-4o-mini",             benchmark="mmlu", num_samples=50),
        EvaluationConfig(model="claude-3-haiku-20240307",  benchmark="mmlu", num_samples=50),
        EvaluationConfig(model="gemini/gemini-1.5-flash", benchmark="mmlu", num_samples=50),
        EvaluationConfig(model="mistral/mistral-small-latest", benchmark="mmlu", num_samples=50),
    ]

    # Runs all 4 evaluations in parallel
    results = await evaluator.evaluate_multiple(configs, samples)

    # Print leaderboard
    print(f"{'Model':<35} {'Accuracy':>10} {'Latency':>10} {'Cost/1K':>12} {'Reasoning':>12}")
    print("-" * 80)
    for r in sorted(results, key=lambda x: x.accuracy, reverse=True):
        print(f"{r.model:<35} {r.accuracy:>9.1%} {r.avg_latency_ms:>8.0f}ms ${r.cost_per_1k_tokens:>9.4f} {r.avg_reasoning_score:>9.1f}/10")

asyncio.run(compare_models())
# Start the API server
uvicorn llm_eval.api.main:app --reload --port 8000
# Open: http://localhost:8000/docs

# Evaluate a model
curl -X POST http://localhost:8000/evaluate \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","benchmark":"mmlu","num_samples":50}'

{
  "run_id": "a3f92c1b",
  "model": "gpt-4o-mini",
  "accuracy": 0.78,
  "avg_latency_ms": 432.1,
  "p95_latency_ms": 1240.0,
  "total_cost_usd": 0.0012,
  "cost_per_1k_tokens": 0.0015,
  "hallucination_rate": 0.024,
  "avg_reasoning_score": 7.2,
  "created_at": "2025-01-20T14:32:01"
}

# Compare models
curl -X POST http://localhost:8000/compare \
  -H "Content-Type: application/json" \
  -d '{"models":["gpt-4o-mini","claude-3-haiku-20240307"],"benchmark":"mmlu","num_samples":30}'

# Export CSV
curl http://localhost:8000/export/csv -o results.csv

# Generate PDF report
curl -X POST http://localhost:8000/report \
  -H "Content-Type: application/json" \
  -d '{"run_ids":["a3f92c1b"]}' \
  -o report.pdf
# Clone and configure
git clone https://github.com/vignesh2027/LLM-Evaluation-Framework.git
cd LLM-Evaluation-Framework
cp .env.example .env
# Edit .env — add your API keys

# Start API + Dashboard with one command
docker-compose up -d

# API:       http://localhost:8000/docs
# Dashboard: http://localhost:8501

# Build only the API image
docker build --target api -t llm-eval-api .
docker run -p 8000:8000 --env-file .env llm-eval-api

# Build only the Dashboard image
docker build --target dashboard -t llm-eval-dashboard .
docker run -p 8501:8501 --env-file .env llm-eval-dashboard

# View logs
docker-compose logs -f api
docker-compose logs -f dashboard
Benchmarks

Built-in benchmarks + custom upload

Three ways to benchmark: industry-standard datasets loaded from HuggingFace Hub, local cache, or your own CSV/JSON.

BenchmarkSamplesFormatSubjectsUse CaseSource
MMLU ~14,000 test 4-choice MC 57 academic subjects General knowledge & reasoning HuggingFace Hub
TruthfulQA 817 questions 4-choice MC Health, law, history, myths Factual truthfulness HuggingFace Hub
Custom CSV Any prompt + expected User-defined Domain-specific evaluation File Upload
Custom JSON Any Array of objects User-defined Programmatic benchmarks File Upload

Custom Benchmark Format

📄 CSV Format
prompt,expected
"What is 2+2?",4
"Capital of France?",Paris
"Who wrote Hamlet?","Shakespeare"
"What is O(log n)?","Binary search"
📦 JSON Format
[
  {
    "prompt": "What is 2+2?",
    "expected": "4"
  },
  {
    "prompt": "Capital of France?",
    "expected": "Paris"
  }
]
Supported Models

10+ models across 5 providers

Any LiteLLM-compatible model works. Pricing is pre-loaded for the most common variants.

OpenAI
GPT-4o
$5 / $15 per 1M
OpenAI
GPT-4o-mini
$0.15 / $0.60 per 1M
OpenAI
GPT-4-turbo
$10 / $30 per 1M
OpenAI
o1
$15 / $60 per 1M
OpenAI
o1-mini
$3 / $12 per 1M
Anthropic
Claude 3.5 Sonnet
$3 / $15 per 1M
Anthropic
Claude 3.5 Haiku
$0.80 / $4 per 1M
Anthropic
Claude 3 Opus
$15 / $75 per 1M
Google
Gemini 1.5 Pro
$3.5 / $10.5 per 1M
Google
Gemini 1.5 Flash
$0.075 / $0.30 per 1M
Mistral
Mistral Large
$4 / $12 per 1M
Meta (Together)
Llama 3 70B
$0.90 / $0.90 per 1M

+ Any model supported by LiteLLM works automatically.

REST API Reference

10 endpoints, full OpenAPI docs

Run the API with uvicorn llm_eval.api.main:app --reload then visit /docs for interactive Swagger UI.

POST
/evaluate
Run a full evaluation — model, benchmark, num_samples, temperature, concurrency
POST
/compare
Compare multiple models side-by-side on the same samples (up to 5 models)
POST
/evaluate/custom
Upload a CSV or JSON file as a custom benchmark dataset
GET
/results
List stored evaluation results with optional model/benchmark filters
GET
/results/{run_id}
Get detailed result for a specific run ID
DELETE
/results/{run_id}
Delete a stored evaluation run
GET
/export/csv
Download all results as a CSV file
GET
/export/json
Download all results as a JSON file
POST
/report
Generate and download a professional PDF evaluation report
GET
/models
List all supported models with pricing
GET
/benchmarks
List all available benchmarks with descriptions
GET
/health
Health check — returns version and status
Architecture

Designed for production

Every layer is independently replaceable. Swap the benchmark loader, DB backend, or metrics engine without touching the rest.

┌─────────────────────────────────────────────────────────────────────────┐ │ LLM EVALUATION FRAMEWORK │ │ │ │ ┌──────────┐ ┌──────────────┐ ┌───────────────┐ ┌─────────────┐ │ │ │ Click │ │ FastAPI │ │ Streamlit │ │ ReportLab │ │ │ │ CLI │ │ REST API │ │ Dashboard │ │PDF Generator│ │ │ │ 7 cmds │ │ 10 endpoints │ │ 5 pages │ │ │ │ │ └────┬─────┘ └──────┬───────┘ └───────┬───────┘ └──────┬──────┘ │ │ └───────────────┴──────────────────┴─────────────────┘ │ │ │ │ │ ┌───────────▼──────────┐ │ │ │ Core Evaluator │ │ │ │ (Async Engine) │ │ │ │ asyncio.Semaphore │ │ │ │ timeout handling │ │ │ │ progress callbacks │ │ │ └───────────┬──────────┘ │ │ │ │ │ ┌────────────────────┬──────┴──────┬────────────────────┐ │ │ │ │ │ │ │ │ ┌─────▼──────┐ ┌────────▼──────┐ ┌──▼──────────┐ ┌──────▼───┐ │ │ │ Metrics │ │ Benchmarks │ │ Database │ │ LiteLLM │ │ │ │ │ │ │ │ (SQLite) │ │ │ │ │ │ accuracy │ │ MMLU │ │ │ │ OpenAI │ │ │ │ hallucin. │ │ TruthfulQA │ │ save_result │ │ Anthropic│ │ │ │ latency │ │ Custom CSV │ │ list_results│ │ Google │ │ │ │ cost │ │ Custom JSON │ │ export_csv │ │ Mistral │ │ │ │ reasoning │ │ │ │ export_json │ │ Together │ │ │ └────────────┘ └───────────────┘ └─────────────┘ └──────────┘ │ └─────────────────────────────────────────────────────────────────────────┘
Benchmark Results

Sample evaluation results

Representative benchmark runs on MMLU test set (100 samples). Your results will vary by API region and time of day.

ModelAccuracyAvg LatencyP95 LatencyCost / 1K TokensHallucination RateReasoning Score
GPT-4o88.2%892 ms2,140 ms$0.00801.8%8.4 / 10
Claude 3.5 Sonnet87.6%1,240 ms2,890 ms$0.00902.1%8.6 / 10
GPT-4o-mini78.4%432 ms1,100 ms$0.00033.2%7.2 / 10
Gemini 1.5 Flash76.8%380 ms910 ms$0.00014.1%6.8 / 10
Claude 3 Haiku74.2%410 ms980 ms$0.00104.8%6.5 / 10
Mistral Small71.0%520 ms1,320 ms$0.00105.6%6.2 / 10
Llama 3 8B64.4%680 ms1,820 ms$0.00027.4%5.9 / 10

⚠️ These are sample results for illustration. Run the framework with real API keys for actual benchmarks.

Installation Guide

Multiple ways to get started

Choose the setup that fits your workflow — from one-line pip install to full Docker deployment.

⚡ pip (Recommended)
pip install llm-evaluation-framework

# With dashboard extras
pip install "llm-evaluation-framework[dashboard]"

# With PDF reports
pip install "llm-evaluation-framework[reports]"

# Full install
pip install "llm-evaluation-framework[dashboard,reports,dev]"
🔧 From Source
git clone https://github.com/vignesh2027/LLM-Evaluation-Framework.git
cd LLM-Evaluation-Framework
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dashboard,reports,dev]"
cp .env.example .env && nano .env
🐳 Docker Compose
git clone https://github.com/vignesh2027/LLM-Evaluation-Framework.git
cd LLM-Evaluation-Framework
cp .env.example .env  # add API keys
docker-compose up -d
# API at :8000, Dashboard at :8501
🤗 Use the Dataset
from datasets import load_dataset
ds = load_dataset("vigneshwar234/llm-eval-benchmark")
# 1,200 samples: train / validation / test
# Covers MMLU + TruthfulQA subjects
# Science, math, CS, history, economics
FAQ

Frequently asked questions

Common questions about the LLM Evaluation Framework.

Do I need all provider API keys to use this?
No. You only need the API key for the model you want to evaluate. The framework uses LiteLLM, so you can mix and match — test GPT-4o-mini with just an OpenAI key, or Claude with just an Anthropic key. Set whatever keys you have in .env.
How accurate is the hallucination metric?
The hallucination scorer uses heuristic linguistic signal analysis — it detects hedging phrases, uncertainty markers, and ungrounded claims without calling an external API. It's fast and zero-cost, but for production use cases you may want to swap it with an NLI model (e.g., a Hugging Face cross-encoder). The README explains how to extend it.
Can I evaluate open-source/local models?
Yes! LiteLLM supports Ollama, vLLM, HuggingFace TGI, and many other local inference servers. Use the appropriate LiteLLM model string (e.g., ollama/llama3, huggingface/meta-llama/Llama-3-8b) and set the custom API base URL in your .env.
How do I run the tests?
Run pytest tests/ -v. All 40+ tests use mocked LiteLLM responses — no real API keys required. The CI runs the full suite on Python 3.10, 3.11, and 3.12 on every push.
Can I use my own dataset format?
Yes. Upload any CSV or JSON file with a prompt column and an optional expected column. In the dashboard, use the ▶️ Run Evaluation page and select "custom" as the benchmark to upload your file. Via CLI: pipe your data through the Python API with CustomBenchmark.from_file("data.csv").
How does the side-by-side comparison work?
The evaluate_multiple method loads the same sample list once and passes it to each model configuration. All model evaluations run in parallel using asyncio.gather, so comparing 4 models takes the same time as comparing 1. This guarantees apples-to-apples results since every model sees identical prompts.
Is there a cost estimate before running?
Yes. The CostMetric.estimate_run_cost(model, num_samples) method gives you a pre-run cost estimate based on average token counts. The actual cost is computed precisely per-sample from real token counts after the run.

Ready to benchmark your LLM?

Free, open-source, production-grade. Star the repo to help others discover it — it means everything. 🙏

⭐ Star on GitHub 🤗 HuggingFace Dataset 💬 Open an Issue