LLM Evaluation Framework

Features

Everything you need to
benchmark LLMs in production

From raw benchmarks to dashboards, REST APIs, PDF reports, and a CLI — batteries included.

⚡

Full Async Engine

Evaluate hundreds of samples in parallel with configurable concurrency. Automatic timeout handling and retry built in.

asyncio

📊

Streamlit Dashboard

5-page interactive dashboard with radar charts, latency histograms, cost vs quality scatter plots, and one-click PDF export.

streamlit + plotly

⚖

Side-by-Side Comparison

Run the same prompts on up to 5 models simultaneously and compare every metric on the same leaderboard.

parallel eval

🌐

FastAPI REST API

12 endpoints with full OpenAPI docs at /docs. File upload, PDF generation, CSV/JSON export — all over HTTP.

fastapi

💻

CLI Tool

7 subcommands (run, compare, results, export, report, serve, dashboard) with rich terminal output and progress bars.

click + rich

📄

PDF Report Generator

Auto-generate professional evaluation reports with cover page, summary table, and per-model sections using ReportLab.

reportlab

🐙

Docker Ready

Multi-stage Dockerfile with separate API and Dashboard targets. docker-compose up and you are running.

docker-compose

🤗

HuggingFace Integration

Auto-loads MMLU and TruthfulQA from HF Hub with local caching. Dataset also published on HuggingFace for easy reuse.

datasets library

💰

Real-Time Cost Tracking

Pricing table for 15+ model variants. Track total spend, cost per 1K tokens, and estimate run costs before executing.

cost calculator

🗃

SQLite Persistence

All results saved automatically. Query, filter, export to CSV or JSON, and build comparison charts over time.

sqlite3

🧪

40+ Unit Tests

Full pytest suite covering all modules. No API keys needed. GitHub Actions CI runs across Python 3.10, 3.11, and 3.12.

pytest · 95% coverage

📤

Custom Benchmarks

Upload your own CSV or JSON dataset from the dashboard, API, or CLI. Provide prompt + expected columns.

custom upload

Evaluation Metrics

5 metrics that actually matter

Every metric is computed per sample, fully async, with statistical aggregation and percentile breakdowns.

🎯

Accuracy

Multi-strategy scorer: exact match → normalized → multiple-choice detection → fuzzy match cascade. Handles free-form and MC answers.

0.0 – 1.0

⚡

Latency

Wall-clock time per call. Reports mean, median, std, min, max, and full percentiles (p50, p75, p90, p95, p99). Includes SLA violation rate.

ms · p50 / p95 / p99

💰

Cost

Per-token pricing for 15+ provider variants. Reports total cost, cost per 1K tokens, and pre-run estimates. No extra API calls.

USD / 1K tokens

🔎

Hallucination Rate

Linguistic signal analysis — uncertainty markers, hedging phrases, ungrounded claims vs grounding signals. Runs locally, zero cost.

0.0 – 1.0

🧠

Reasoning Quality

Scores chain-of-thought depth: reasoning marker density, grounding signals, response length calibration. 1 = shallow, 10 = deep.

1 – 10

How It Works

Simple 4-step evaluation pipeline

From configuration to results in under 2 minutes.

Choose Model + Benchmark

Select any LiteLLM-compatible model and a benchmark (MMLU, TruthfulQA, or custom CSV). Set sample count and concurrency.

Async Parallel Evaluation

The engine fires all API calls concurrently using asyncio.Semaphore. Each call has configurable timeout and retry logic.

Per-Sample Metrics

For each response: accuracy check, latency record, token count and cost, hallucination score, reasoning quality — all computed in parallel.

Aggregation + Storage

Results are aggregated into a full EvaluationResult with percentile statistics and persisted to SQLite automatically.

Quick Start

Up and running in 3 commands

Install, configure, and run your first evaluation in minutes.

# Install
pip install llm-evaluation-framework

# Clone from source
git clone https://github.com/vignesh2027/LLM-Evaluation-Framework.git
cd LLM-Evaluation-Framework && pip install -e .

# Configure API keys
cp .env.example .env   # add OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.

# Verify
llm-eval --version
llm-eval, version 1.0.0

# Evaluate a single model
llm-eval run --model gpt-4o-mini --benchmark mmlu --samples 100

╭──────────────────────────────────────╮
│  Evaluation: gpt-4o-mini             │
├──────────────────┬───────────────────┤
│ Accuracy         │ 78.00%            │
│ Avg Latency      │ 432 ms            │
│ P95 Latency      │ 1240 ms           │
│ Total Cost       │ $0.0023           │
│ Hallucination    │ 2.40%             │
│ Reasoning Score  │ 7.2 / 10          │
╰──────────────────┴───────────────────╯

# Compare 3 models
llm-eval compare \
  --models gpt-4o-mini \
  --models claude-3-haiku-20240307 \
  --models gemini/gemini-1.5-flash \
  --benchmark mmlu --samples 50

# Export results
llm-eval export --format csv --output results.csv

# Generate PDF report
llm-eval report --run-ids a3f92c1b --output ./reports/

# Start dashboard
llm-eval dashboard --port 8501

import asyncio
from llm_eval.core.evaluator import LLMEvaluator, EvaluationConfig
from llm_eval.benchmarks.mmlu import MMLUBenchmark

async def main():
    evaluator = LLMEvaluator()
    samples   = MMLUBenchmark().load(num_samples=100)

    config = EvaluationConfig(
        model="gpt-4o-mini",
        benchmark="mmlu",
        num_samples=100,
        temperature=0.0,
        concurrency=10,
    )

    result = await evaluator.evaluate(config, samples)
    print(f"Accuracy:   {result.accuracy:.2%}")
    print(f"P95 Latency:{result.p95_latency_ms:.0f}ms")
    print(f"Total Cost: ${result.total_cost_usd:.4f}")
    print(f"Hallucin.:  {result.hallucination_rate:.2%}")
    print(f"Reasoning:  {result.avg_reasoning_score:.1f}/10")

asyncio.run(main())

import asyncio
from llm_eval.core.evaluator import LLMEvaluator, EvaluationConfig
from llm_eval.benchmarks.mmlu import MMLUBenchmark

async def compare():
    evaluator = LLMEvaluator()
    samples   = MMLUBenchmark().load(num_samples=50)

    configs = [
        EvaluationConfig(model="gpt-4o-mini",            benchmark="mmlu", num_samples=50),
        EvaluationConfig(model="claude-3-haiku-20240307", benchmark="mmlu", num_samples=50),
        EvaluationConfig(model="gemini/gemini-1.5-flash",benchmark="mmlu", num_samples=50),
    ]

    results = await evaluator.evaluate_multiple(configs, samples)

    for r in sorted(results, key=lambda x: x.accuracy, reverse=True):
        print(f"{r.model:<35} {r.accuracy:>7.1%} {r.avg_latency_ms:>7.0f}ms ${r.cost_per_1k_tokens:.4f}")

asyncio.run(compare())

# Start server
uvicorn llm_eval.api.main:app --reload --port 8000

# Evaluate
curl -X POST http://localhost:8000/evaluate \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","benchmark":"mmlu","num_samples":50}'

{
  "run_id": "a3f92c1b",
  "accuracy": 0.78,
  "avg_latency_ms": 432.1,
  "p95_latency_ms": 1240.0,
  "total_cost_usd": 0.0012,
  "hallucination_rate": 0.024,
  "avg_reasoning_score": 7.2
}

# Generate PDF report
curl -X POST http://localhost:8000/report \
  -d '{"run_ids":["a3f92c1b"]}' -o report.pdf

# Clone and configure
git clone https://github.com/vignesh2027/LLM-Evaluation-Framework.git
cd LLM-Evaluation-Framework && cp .env.example .env

# Start API + Dashboard
docker-compose up -d

# API:       http://localhost:8000/docs
# Dashboard: http://localhost:8501

# View logs
docker-compose logs -f

Benchmark	Samples	Format	Subjects	Use Case	Source
MMLU	~14,000 test	4-choice MC	57 academic subjects	General knowledge & reasoning	HuggingFace Hub
TruthfulQA	817 questions	4-choice MC	Health, law, history, myths	Factual truthfulness	HuggingFace Hub
Custom CSV	Any	prompt + expected	User-defined	Domain-specific evaluation	File Upload
Custom JSON	Any	Array of objects	User-defined	Programmatic benchmarks	File Upload

Model	Accuracy	Avg Latency	P95 Latency	Cost / 1K Tokens	Hallucination	Reasoning
GPT-4o	88.2%	892 ms	2,140 ms	$0.0080	1.8%	8.4 / 10
Claude 3.5 Sonnet	87.6%	1,240 ms	2,890 ms	$0.0090	2.1%	8.6 / 10
GPT-4o-mini	78.4%	432 ms	1,100 ms	$0.0003	3.2%	7.2 / 10
Gemini 1.5 Flash	76.8%	380 ms	910 ms	$0.0001	4.1%	6.8 / 10
Claude 3 Haiku	74.2%	410 ms	980 ms	$0.0010	4.8%	6.5 / 10
Mistral Small	71.0%	520 ms	1,320 ms	$0.0010	5.6%	6.2 / 10

HuggingFace

Demo Space & Public Dataset

Try the framework instantly on HuggingFace Spaces, or load the benchmark dataset directly in Python.

🤗

LLM Evaluation Demo — HuggingFace Space

vigneshwar234 / llm-eval-demo · Gradio App

Gradio Evaluation Benchmarking LLM Open Source

Run live LLM evaluations directly in your browser — no installation required. Enter any prompt, pick a benchmark sample, and see accuracy, hallucination score, and reasoning quality computed in real time. Built with Gradio on HuggingFace Spaces.

Metrics

Demo Tabs

Free

No Login

Open Space Demo → View Dataset

vigneshwar234 / llm-eval-benchmark

HuggingFace Dataset · 1,200 samples · MIT License

Live

500

Train

200

Validation

500

Test

15+

Subjects

Sources

MIT

License

from datasets import load_dataset ds = load_dataset("vigneshwar234/llm-eval-benchmark") print(ds) # DatasetDict({ # train: Dataset({num_rows: 500}), # validation: Dataset({num_rows: 200}), # test: Dataset({num_rows: 500}) # }) # Use as a custom benchmark import pandas as pd df = pd.DataFrame(ds["test"]) samples = df[["prompt", "expected"]].to_dict("records")

Architecture

Modular layers, independently replaceable

Swap the benchmark loader, DB backend, or metrics engine without touching the rest.

┌─────────────────────────────────────────────────────────────────────────┐ │ LLM EVALUATION FRAMEWORK │ │ │ │ ┌──────────┐ ┌──────────────┐ ┌───────────────┐ ┌─────────────┐ │ │ │ Click │ │ FastAPI │ │ Streamlit │ │ ReportLab │ │ │ │ CLI │ │ REST API │ │ Dashboard │ │PDF Generator│ │ │ │ 7 cmds │ │ 12 endpoints │ │ 5 pages │ │ │ │ │ └────┬─────┘ └──────┬───────┘ └───────┬───────┘ └──────┬──────┘ │ │ └───────────────┴──────────────────┴─────────────────┘ │ │ │ │ │ ┌───────────▼──────────┐ │ │ │ Core Evaluator │ │ │ │ asyncio.Semaphore │ │ │ │ configurable timeout│ │ │ │ progress callbacks │ │ │ └───────────┬──────────┘ │ │ │ │ │ ┌────────────────┬──────────┴──────────┬────────────────┐ │ │ │ │ │ │ │ │ ┌─────▼──────┐ ┌──────▼──────┐ ┌──────────▼───┐ ┌────────▼───┐ │ │ │ Metrics │ │ Benchmarks │ │ Database │ │ LiteLLM │ │ │ │ accuracy │ │ MMLU │ │ (SQLite) │ │ OpenAI │ │ │ │ hallucin. │ │ TruthfulQA │ │ save_result │ │ Anthropic │ │ │ │ latency │ │ Custom CSV │ │ list_results │ │ Google │ │ │ │ cost │ │ HF Hub cache│ │ export_csv │ │ Mistral │ │ │ │ reasoning │ │ │ │ export_json │ │ Together │ │ │ └────────────┘ └─────────────┘ └──────────────┘ └────────────┘ │ └─────────────────────────────────────────────────────────────────────────┘

REST API

12 endpoints, full OpenAPI docs

Start with uvicorn llm_eval.api.main:app --reload then open /docs for interactive Swagger UI.

POST

/evaluate

Run a full evaluation — model, benchmark, num_samples, temperature, concurrency

POST

/compare

Compare multiple models side-by-side on the same samples (up to 5 models)

POST

/evaluate/custom

Upload a CSV or JSON file as a custom benchmark dataset

GET

/results

List stored evaluation results — filterable by model and benchmark

GET

/results/{run_id}

Get detailed result for a specific run ID

DEL

/results/{run_id}

Delete a stored evaluation run

GET

/export/csv

Download all results as a CSV file

GET

/export/json

Download all results as a JSON file

POST

/report

Generate and download a professional PDF evaluation report

GET

/models

List all supported models with pricing table

GET

/benchmarks

List all available benchmarks with descriptions

GET

/health

Health check — returns version and status

Community

Join the community

The framework is used and contributed to by researchers, ML engineers, and startups. Here is where to find us.

🐙

GitHub Repository

Source code, issues, pull requests, and releases. Star the repo to follow updates and help others discover it.

github.com/vignesh2027 →

🤗

HuggingFace Hub

Demo Space and benchmark dataset both live on HuggingFace. Load the dataset in one line, try the demo with no setup.

huggingface.co/vigneshwar234 →

💼

Follow for benchmark results, new model comparisons, and updates as the framework evolves. Benchmark discussions welcome.

linkedin.com/in/vigneshwar-s-27 →

📜

GitHub Issues

Found a bug or want a feature? Open an issue. Contributions are welcome — see the contributing guide in the README.

Open an Issue →

📊

Related Projects

Connects with LangChain, LlamaIndex, RAGAS, and DeepEval ecosystems. The Python API is composable with most ML pipelines.

Contributing Guide →

🆕

Releases

Watch the GitHub repository for release notifications. Changelog is maintained in the repo with semantic versioning.

View Releases →

Installation

Four ways to get started

pip, source, Docker, or HuggingFace — whichever fits your workflow.

pip (Recommended)

pip install llm-evaluation-framework

# With dashboard
pip install "llm-evaluation-framework[dashboard]"

# Full install
pip install "llm-evaluation-framework[dashboard,reports,dev]"

From Source

git clone https://github.com/vignesh2027/LLM-Evaluation-Framework.git
cd LLM-Evaluation-Framework
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dashboard,reports,dev]"
cp .env.example .env

Docker Compose

git clone https://github.com/vignesh2027/LLM-Evaluation-Framework.git
cd LLM-Evaluation-Framework
cp .env.example .env
docker-compose up -d
# API :8000  Dashboard :8501

HuggingFace Dataset

from datasets import load_dataset
ds = load_dataset("vigneshwar234/llm-eval-benchmark")
# 1,200 samples: train / validation / test
# MMLU + TruthfulQA subjects mixed
# Science, math, CS, history, law

FAQ

Frequently asked questions

Answers to the most common questions from the community.

Do I need all provider API keys?

No. You only need the key for the model you want to test. Set whatever keys you have in .env and the rest are ignored.

How accurate is the hallucination metric?

The v1 scorer uses heuristic linguistic signal analysis — fast, free, and good for relative comparison between models. It detects hedging phrases and uncertainty markers but cannot verify factual correctness. The v2 roadmap includes an NLI-based scorer (DeBERTa cross-encoder) for ground-truth hallucination detection.

Can I evaluate local or open-source models?

Yes. LiteLLM supports Ollama, vLLM, and HuggingFace TGI. Use the appropriate model string — for example ollama/llama3 or hosted_vllm/meta-llama/Llama-3-8b — and set the custom API base URL in .env.

How many samples do I need for reliable rankings?

20 samples is a quick smoke test. 100 is where trends become visible. 500+ is where rankings become statistically reliable. For production decisions use 200 or more. The roadmap includes confidence intervals and bootstrapping to make this explicit.

Can I upload my own dataset?

Yes. Upload any CSV or JSON file with a prompt column and an optional expected column via the dashboard, the /evaluate/custom API endpoint, or the Python API with CustomBenchmark.from_file("data.csv").

How does the side-by-side comparison work?

The evaluate_multiple method loads the same sample list once and passes it to each model configuration. All evaluations run in parallel with asyncio.gather, so comparing 4 models takes the same wall-clock time as comparing 1 — and every model sees identical prompts.

Why SQLite and not Postgres?

Zero setup, zero server, file is portable. SQLite handles millions of rows easily for an evaluation tool. The Database class abstracts the layer — switching to Postgres is a one-file change if your team needs it.

Benchmark Any LLM
With Real Data

Everything you need to
benchmark LLMs in production

Full Async Engine

Streamlit Dashboard

Side-by-Side Comparison

FastAPI REST API

CLI Tool

PDF Report Generator

Docker Ready

HuggingFace Integration

Real-Time Cost Tracking

SQLite Persistence

40+ Unit Tests

Custom Benchmarks

5 metrics that actually matter

Simple 4-step evaluation pipeline

Choose Model + Benchmark

Async Parallel Evaluation

Per-Sample Metrics

Aggregation + Storage

Up and running in 3 commands

Industry-standard datasets plus custom upload

Benchmark results — MMLU 100 samples

Demo Space & Public Dataset

Modular layers, independently replaceable

12 endpoints, full OpenAPI docs

10+ models, 5 providers

Join the community

GitHub Repository

HuggingFace Hub

LinkedIn

GitHub Issues

Related Projects

Releases

Four ways to get started

Frequently asked questions

Ready to stop guessing?

Benchmark Any LLMWith Real Data

Everything you need tobenchmark LLMs in production

Full Async Engine

Streamlit Dashboard

Side-by-Side Comparison

FastAPI REST API

CLI Tool

PDF Report Generator

Docker Ready

HuggingFace Integration

Real-Time Cost Tracking

SQLite Persistence

40+ Unit Tests

Custom Benchmarks

5 metrics that actually matter

Simple 4-step evaluation pipeline

Choose Model + Benchmark

Async Parallel Evaluation

Per-Sample Metrics

Aggregation + Storage

Up and running in 3 commands

Industry-standard datasets plus custom upload

Benchmark results — MMLU 100 samples

Demo Space & Public Dataset

Modular layers, independently replaceable

12 endpoints, full OpenAPI docs

10+ models, 5 providers

Join the community

GitHub Repository

HuggingFace Hub

LinkedIn

GitHub Issues

Related Projects

Releases

Four ways to get started

Frequently asked questions

Ready to stop guessing?

Benchmark Any LLM
With Real Data

Everything you need to
benchmark LLMs in production