<!-- .github/pull_request_template.md --> ## Description - Split metrics dashboard into two modules: calculator (statistics) and generator (visualization) - Added aggregate metrics as a new phase in evaluation pipeline - Created modal example to run multiple evaluations in parallel and collect results into a single combined output ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enhanced metrics reporting with improved visualizations, including histogram and confidence interval plots. - Introduced an asynchronous evaluation process that supports parallel execution and streamlined result aggregation. - Added new configuration options to control metrics calculation and aggregated output storage. - **Refactor** - Restructured dashboard generation and evaluation workflows into a more modular, maintainable design. - Improved error handling and logging for better feedback during evaluation processes. - **Bug Fixes** - Updated test cases to ensure accurate validation of the new dashboard generation and metrics calculation functionalities. <!-- end of auto-generated comment: release notes by coderabbit.ai --> |
||
|---|---|---|
| .. | ||
| cloud | ||
| eval_framework | ||
| test_datasets/initial_test | ||
| __init__.py | ||
| deepeval_metrics.py | ||
| EC2_README.md | ||
| eval_on_hotpot.py | ||
| eval_swe_bench.py | ||
| eval_utils.py | ||
| generate_test_set.py | ||
| multimetric_qa_eval_run.py | ||
| official_hotpot_metrics.py | ||
| promptfoo_config_template.yaml | ||
| promptfoo_metrics.py | ||
| promptfoo_wrapper.py | ||
| promptfooprompt.json | ||
| qa_context_provider_utils.py | ||
| qa_dataset_utils.py | ||
| qa_eval_parameters.json | ||
| qa_eval_utils.py | ||
| qa_metrics_utils.py | ||
| run_qa_eval.py | ||
| simple_rag_vs_cognee_eval.py | ||