cognee/notebooks/cognee_hotpot_eval.ipynb
Vasilije 9ba2e0d6c1
chore: Fix and update visualization (#518)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced enhanced visualization capabilities that let users launch a
dedicated server for visual displays.
  
- **Documentation**
- Updated several interactive notebooks to include execution outputs and
expanded explanatory content for better user guidance.
  
- **Style**
- Refined formatting and layout across notebooks to ensure consistent
presentation and improved readability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com>
2025-02-11 19:25:01 +01:00

1152 lines
28 KiB
Text

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Evaluation on the hotpotQA dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install \"cognee[deepeval]\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from evals.eval_on_hotpot import deepeval_answers, answer_qa_instance\n",
"from evals.qa_dataset_utils import load_qa_dataset\n",
"from evals.qa_metrics_utils import get_metrics\n",
"from evals.qa_context_provider_utils import qa_context_providers\n",
"from pathlib import Path\n",
"from tqdm import tqdm\n",
"import statistics\n",
"import random"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"num_samples = 10 # With cognee, it takes ~1m10s per sample\n",
"dataset_name_or_filename = \"hotpotqa\"\n",
"dataset = load_qa_dataset(dataset_name_or_filename)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define Context Provider\n",
"**Options**: \n",
"- **cognee**: context with cognee \n",
"- **no_rag**: raw context \n",
"- **simple_rag**: context with simple rag \n",
"- **brute_force**: context with brute force triplet search"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Here, \"cognee\" is used as context provider\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"context_provider_name = \"cognee\"\n",
"context_provider = qa_context_providers[context_provider_name]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Generate Answers for QA Instances\n",
"1. **Random Sampling**: Selects a random subset of the dataset if `num_samples` is defined.\n",
"2. **Context Filename**: Defines the file path for storing contexts generated by the context provider.\n",
"3. **Answer Generation**: Iterates over the QA instances using `tqdm` for progress tracking and generates answers using the `answer_qa_instance` function asynchronously."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"random.seed(42)\n",
"instances = dataset if not num_samples else random.sample(dataset, num_samples)\n",
"\n",
"out_path = \"out\"\n",
"if not Path(out_path).exists():\n",
" Path(out_path).mkdir()\n",
"contexts_filename = out_path / Path(\n",
" f\"contexts_{dataset_name_or_filename.split('.')[0]}_{context_provider_name}.json\"\n",
")\n",
"\n",
"answers = []\n",
"for instance in tqdm(instances, desc=\"Getting answers\"):\n",
" answer = await answer_qa_instance(instance, context_provider, contexts_filename)\n",
" answers.append(answer)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Define Metrics for Evaluation and Calculate Score\n",
"**Options**: \n",
"- **Correctness**: Is the actual output factually correct based on the expected output?\n",
"- **Comprehensiveness**: How much detail does the answer provide to cover all aspects and details of the question?\n",
"- **Diversity**: How varied and rich is the answer in providing different perspectives and insights on the question?\n",
"- **Empowerment**: How well does the answer help the reader understand and make informed judgements about the topic?\n",
"- **Directness**: How specifically and clearly does the answer address the question?\n",
"- **F1 Score**: the harmonic mean of the precision and recall, using word-level Exact Match\n",
"- **EM Score**: the rate at which the predicted strings exactly match their references, ignoring white spaces and capitalization.\n",
"\n",
"We can also calculate scores based on the same metrics with promptfoo"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating `\"Correctness\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"Correctness\"]\n",
"eval_metrics = get_metrics(metric_name_list)\n",
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Correctness = statistics.mean(\n",
" [result.metrics_data[0].score for result in eval_results.test_results]\n",
")\n",
"print(Correctness)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating `\"Comprehensiveness\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"Comprehensiveness\"]\n",
"eval_metrics = get_metrics(metric_name_list)\n",
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Comprehensiveness = statistics.mean(\n",
" [result.metrics_data[0].score for result in eval_results.test_results]\n",
")\n",
"print(Comprehensiveness)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating `\"Diversity\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"Diversity\"]\n",
"eval_metrics = get_metrics(metric_name_list)\n",
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Diversity = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
"print(Diversity)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating`\"Empowerment\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"Empowerment\"]\n",
"eval_metrics = get_metrics(metric_name_list)\n",
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Empowerment = statistics.mean(\n",
" [result.metrics_data[0].score for result in eval_results.test_results]\n",
")\n",
"print(Empowerment)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating `\"Directness\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"Directness\"]\n",
"eval_metrics = get_metrics(metric_name_list)\n",
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Directness = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
"print(Directness)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating `\"F1 Score\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"F1\"]\n",
"eval_metrics = get_metrics(metric_name_list)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"F1_score = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
"print(F1_score)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating `\"Exact Match (EM) Score\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"EM\"]\n",
"eval_metrics = get_metrics(metric_name_list)\n",
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"EM = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
"print(EM)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### \"no_rag\" as context provider"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"context_provider_name = \"no_rag\"\n",
"context_provider = qa_context_providers[context_provider_name]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Generate Answers for QA Instances"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"random.seed(42)\n",
"instances = dataset if not num_samples else random.sample(dataset, num_samples)\n",
"\n",
"out_path = \"out\"\n",
"if not Path(out_path).exists():\n",
" Path(out_path).mkdir()\n",
"contexts_filename = out_path / Path(\n",
" f\"contexts_{dataset_name_or_filename.split('.')[0]}_{context_provider_name}.json\"\n",
")\n",
"\n",
"answers = []\n",
"for instance in tqdm(instances, desc=\"Getting answers\"):\n",
" answer = await answer_qa_instance(instance, context_provider, contexts_filename)\n",
" answers.append(answer)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Define Metrics for Evaluation and Calculate Score"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculate `\"Correctness\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"Correctness\"]\n",
"eval_metrics = get_metrics(metric_name_list)\n",
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Correctness = statistics.mean(\n",
" [result.metrics_data[0].score for result in eval_results.test_results]\n",
")\n",
"print(Correctness)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating `\"Comprehensiveness\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"Comprehensiveness\"]\n",
"eval_metrics = get_metrics(metric_name_list)\n",
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Comprehensiveness = statistics.mean(\n",
" [result.metrics_data[0].score for result in eval_results.test_results]\n",
")\n",
"print(Comprehensiveness)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating `\"Diversity\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"Diversity\"]\n",
"eval_metrics = get_metrics(metric_name_list)\n",
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Diversity = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
"print(Diversity)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating`\"Empowerment\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"Empowerment\"]\n",
"eval_metrics = get_metrics(metric_name_list)\n",
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Empowerment = statistics.mean(\n",
" [result.metrics_data[0].score for result in eval_results.test_results]\n",
")\n",
"print(Empowerment)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating `\"Directness\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"Directness\"]\n",
"eval_metrics = get_metrics(metric_name_list)\n",
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Directness = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
"print(Directness)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating `\"F1 Score\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"F1\"]\n",
"eval_metrics = get_metrics(metric_name_list)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"F1_score = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
"print(F1_score)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating `\"Exact Match (EM) Score\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"EM\"]\n",
"eval_metrics = get_metrics(metric_name_list)\n",
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"EM = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
"print(EM)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### \"simple_rag\" as context provider\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"context_provider_name = \"simple_rag\"\n",
"context_provider = qa_context_providers[context_provider_name]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Generate Answers for QA Instances"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"random.seed(42)\n",
"instances = dataset if not num_samples else random.sample(dataset, num_samples)\n",
"\n",
"out_path = \"out\"\n",
"if not Path(out_path).exists():\n",
" Path(out_path).mkdir()\n",
"contexts_filename = out_path / Path(\n",
" f\"contexts_{dataset_name_or_filename.split('.')[0]}_{context_provider_name}.json\"\n",
")\n",
"\n",
"answers = []\n",
"for instance in tqdm(instances, desc=\"Getting answers\"):\n",
" answer = await answer_qa_instance(instance, context_provider, contexts_filename)\n",
" answers.append(answer)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Define Metrics for Evaluation and Calculate Score"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculate `\"Correctness\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"Correctness\"]\n",
"eval_metrics = get_metrics(metric_name_list)\n",
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Correctness = statistics.mean(\n",
" [result.metrics_data[0].score for result in eval_results.test_results]\n",
")\n",
"print(Correctness)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating `\"Comprehensiveness\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"Comprehensiveness\"]\n",
"eval_metrics = get_metrics(metric_name_list)\n",
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Comprehensiveness = statistics.mean(\n",
" [result.metrics_data[0].score for result in eval_results.test_results]\n",
")\n",
"print(Comprehensiveness)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating `\"Diversity\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"Diversity\"]\n",
"eval_metrics = get_metrics(metric_name_list)\n",
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Diversity = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
"print(Diversity)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating`\"Empowerment\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"Empowerment\"]\n",
"eval_metrics = get_metrics(metric_name_list)\n",
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Empowerment = statistics.mean(\n",
" [result.metrics_data[0].score for result in eval_results.test_results]\n",
")\n",
"print(Empowerment)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating `\"Directness\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"Directness\"]\n",
"eval_metrics = get_metrics(metric_name_list)\n",
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Directness = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
"print(Directness)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating `\"F1\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"F1 Score\"]\n",
"eval_metrics = get_metrics(metric_name_list)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"F1_score = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
"print(F1_score)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating `\"Exact Match (EM) Score\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"EM\"]\n",
"eval_metrics = get_metrics(metric_name_list)\n",
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"EM = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
"print(EM)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### \"brute_force\" as context provider\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"context_provider_name = \"brute_force\"\n",
"context_provider = qa_context_providers[context_provider_name]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Generate Answers for QA Instances"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"random.seed(42)\n",
"instances = dataset if not num_samples else random.sample(dataset, num_samples)\n",
"\n",
"out_path = \"out\"\n",
"if not Path(out_path).exists():\n",
" Path(out_path).mkdir()\n",
"contexts_filename = out_path / Path(\n",
" f\"contexts_{dataset_name_or_filename.split('.')[0]}_{context_provider_name}.json\"\n",
")\n",
"\n",
"answers = []\n",
"for instance in tqdm(instances, desc=\"Getting answers\"):\n",
" answer = await answer_qa_instance(instance, context_provider, contexts_filename)\n",
" answers.append(answer)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Define Metrics for Evaluation and Calculate Score"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculate `\"Correctness\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"Correctness\"]\n",
"eval_metrics = get_metrics(metric_name_list)\n",
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Correctness = statistics.mean(\n",
" [result.metrics_data[0].score for result in eval_results.test_results]\n",
")\n",
"print(Correctness)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating `\"Comprehensiveness\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"Comprehensiveness\"]\n",
"eval_metrics = get_metrics(metric_name_list)\n",
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Comprehensiveness = statistics.mean(\n",
" [result.metrics_data[0].score for result in eval_results.test_results]\n",
")\n",
"print(Comprehensiveness)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating `\"Diversity\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"Diversity\"]\n",
"eval_metrics = get_metrics(metric_name_list)\n",
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Diversity = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
"print(Diversity)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating`\"Empowerment\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"Empowerment\"]\n",
"eval_metrics = get_metrics(metric_name_list)\n",
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Empowerment = statistics.mean(\n",
" [result.metrics_data[0].score for result in eval_results.test_results]\n",
")\n",
"print(Empowerment)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating `\"Directness\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"Directness\"]\n",
"eval_metrics = get_metrics(metric_name_list)\n",
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Directness = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
"print(Directness)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating `\"F1 Score\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"F1\"]\n",
"eval_metrics = get_metrics(metric_name_list)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"F1_score = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
"print(F1_score)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Calculating `\"Exact Match (EM) Score\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric_name_list = [\"EM\"]\n",
"eval_metrics = get_metrics(metric_name_list)\n",
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"EM = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
"print(EM)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "cognee-c83GrcRT-py3.11",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.10"
}
},
"nbformat": 4,
"nbformat_minor": 2
}