Evaluation and Benchmarking
Measuring retrieval quality, latency, reliability, and offline A/B outcomes.
Benchmark Dimensions
- Latency: p50/p95/p99 by endpoint.
- Reliability: success rate and error code distribution.
- Rerank behavior: fallback frequency and rank movement signals.
- Answer quality: citation coverage and evaluator judgment.
Quick Benchmark
Run API benchmark from repository root:
python scripts/benchmark_flux.py --queries-file scripts/benchmark_queries.txt --loops 2 --output reports/flux_benchmark_latest.jsonOffline A/B Evaluation
This repository includes an offline comparison harness:
baseline: direct Tavily retrieval pathflux: retrieval via Flux/search
Run:
python experiments/offline_eval/run_offline_eval.py --skip-synthesis --output experiments/offline_eval/outputs/run_latest.json
python experiments/offline_eval/score_offline_eval.py --run experiments/offline_eval/outputs/run_latest.jsonArtifacts:
run_latest.json(full run)judge_packet.jsonl(blind evaluation packet)judge_scores_template.csv(manual scoring sheet)scorecard_latest.md(presentation-ready summary)
Recommended Review Process
- Run quantitative benchmark.
- Run offline A/B dataset pass.
- Conduct blind human grading with rubric.
- Publish scorecard with both latency and quality metrics.