Evaluation and Benchmarking

Measuring retrieval quality, latency, reliability, and offline A/B outcomes.

Benchmark Dimensions

Latency: p50/p95/p99 by endpoint.
Reliability: success rate and error code distribution.
Rerank behavior: fallback frequency and rank movement signals.
Answer quality: citation coverage and evaluator judgment.

Quick Benchmark

Run API benchmark from repository root:

python scripts/benchmark_flux.py --queries-file scripts/benchmark_queries.txt --loops 2 --output reports/flux_benchmark_latest.json

Offline A/B Evaluation

This repository includes an offline comparison harness:

baseline: direct Tavily retrieval path
flux: retrieval via Flux /search

Run:

python experiments/offline_eval/run_offline_eval.py --skip-synthesis --output experiments/offline_eval/outputs/run_latest.json
python experiments/offline_eval/score_offline_eval.py --run experiments/offline_eval/outputs/run_latest.json

Artifacts:

run_latest.json (full run)
judge_packet.jsonl (blind evaluation packet)
judge_scores_template.csv (manual scoring sheet)
scorecard_latest.md (presentation-ready summary)

Recommended Review Process

Run quantitative benchmark.
Run offline A/B dataset pass.
Conduct blind human grading with rubric.
Publish scorecard with both latency and quality metrics.

Deployment

Hosting the API and docs for external evaluators and developers.

API Reference

OpenAPI endpoint documentation with request and response details.

On this page

Benchmark Dimensions Quick Benchmark Offline A/B Evaluation Recommended Review Process