<!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced enhanced visualization capabilities that let users launch a dedicated server for visual displays. - **Documentation** - Updated several interactive notebooks to include execution outputs and expanded explanatory content for better user guidance. - **Style** - Refined formatting and layout across notebooks to ensure consistent presentation and improved readability. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com>
1152 lines
28 KiB
Text
1152 lines
28 KiB
Text
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Evaluation on the hotpotQA dataset"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!pip install \"cognee[deepeval]\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from evals.eval_on_hotpot import deepeval_answers, answer_qa_instance\n",
|
|
"from evals.qa_dataset_utils import load_qa_dataset\n",
|
|
"from evals.qa_metrics_utils import get_metrics\n",
|
|
"from evals.qa_context_provider_utils import qa_context_providers\n",
|
|
"from pathlib import Path\n",
|
|
"from tqdm import tqdm\n",
|
|
"import statistics\n",
|
|
"import random"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Load Dataset"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"num_samples = 10 # With cognee, it takes ~1m10s per sample\n",
|
|
"dataset_name_or_filename = \"hotpotqa\"\n",
|
|
"dataset = load_qa_dataset(dataset_name_or_filename)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Define Context Provider\n",
|
|
"**Options**: \n",
|
|
"- **cognee**: context with cognee \n",
|
|
"- **no_rag**: raw context \n",
|
|
"- **simple_rag**: context with simple rag \n",
|
|
"- **brute_force**: context with brute force triplet search"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Here, \"cognee\" is used as context provider\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"context_provider_name = \"cognee\"\n",
|
|
"context_provider = qa_context_providers[context_provider_name]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Generate Answers for QA Instances\n",
|
|
"1. **Random Sampling**: Selects a random subset of the dataset if `num_samples` is defined.\n",
|
|
"2. **Context Filename**: Defines the file path for storing contexts generated by the context provider.\n",
|
|
"3. **Answer Generation**: Iterates over the QA instances using `tqdm` for progress tracking and generates answers using the `answer_qa_instance` function asynchronously."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"random.seed(42)\n",
|
|
"instances = dataset if not num_samples else random.sample(dataset, num_samples)\n",
|
|
"\n",
|
|
"out_path = \"out\"\n",
|
|
"if not Path(out_path).exists():\n",
|
|
" Path(out_path).mkdir()\n",
|
|
"contexts_filename = out_path / Path(\n",
|
|
" f\"contexts_{dataset_name_or_filename.split('.')[0]}_{context_provider_name}.json\"\n",
|
|
")\n",
|
|
"\n",
|
|
"answers = []\n",
|
|
"for instance in tqdm(instances, desc=\"Getting answers\"):\n",
|
|
" answer = await answer_qa_instance(instance, context_provider, contexts_filename)\n",
|
|
" answers.append(answer)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Define Metrics for Evaluation and Calculate Score\n",
|
|
"**Options**: \n",
|
|
"- **Correctness**: Is the actual output factually correct based on the expected output?\n",
|
|
"- **Comprehensiveness**: How much detail does the answer provide to cover all aspects and details of the question?\n",
|
|
"- **Diversity**: How varied and rich is the answer in providing different perspectives and insights on the question?\n",
|
|
"- **Empowerment**: How well does the answer help the reader understand and make informed judgements about the topic?\n",
|
|
"- **Directness**: How specifically and clearly does the answer address the question?\n",
|
|
"- **F1 Score**: the harmonic mean of the precision and recall, using word-level Exact Match\n",
|
|
"- **EM Score**: the rate at which the predicted strings exactly match their references, ignoring white spaces and capitalization.\n",
|
|
"\n",
|
|
"We can also calculate scores based on the same metrics with promptfoo"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating `\"Correctness\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"Correctness\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)\n",
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Correctness = statistics.mean(\n",
|
|
" [result.metrics_data[0].score for result in eval_results.test_results]\n",
|
|
")\n",
|
|
"print(Correctness)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating `\"Comprehensiveness\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"Comprehensiveness\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)\n",
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Comprehensiveness = statistics.mean(\n",
|
|
" [result.metrics_data[0].score for result in eval_results.test_results]\n",
|
|
")\n",
|
|
"print(Comprehensiveness)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating `\"Diversity\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"Diversity\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)\n",
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Diversity = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
|
|
"print(Diversity)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating`\"Empowerment\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"Empowerment\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)\n",
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Empowerment = statistics.mean(\n",
|
|
" [result.metrics_data[0].score for result in eval_results.test_results]\n",
|
|
")\n",
|
|
"print(Empowerment)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating `\"Directness\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"Directness\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)\n",
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Directness = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
|
|
"print(Directness)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating `\"F1 Score\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"F1\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"F1_score = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
|
|
"print(F1_score)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating `\"Exact Match (EM) Score\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"EM\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)\n",
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"EM = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
|
|
"print(EM)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### \"no_rag\" as context provider"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"context_provider_name = \"no_rag\"\n",
|
|
"context_provider = qa_context_providers[context_provider_name]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Generate Answers for QA Instances"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"random.seed(42)\n",
|
|
"instances = dataset if not num_samples else random.sample(dataset, num_samples)\n",
|
|
"\n",
|
|
"out_path = \"out\"\n",
|
|
"if not Path(out_path).exists():\n",
|
|
" Path(out_path).mkdir()\n",
|
|
"contexts_filename = out_path / Path(\n",
|
|
" f\"contexts_{dataset_name_or_filename.split('.')[0]}_{context_provider_name}.json\"\n",
|
|
")\n",
|
|
"\n",
|
|
"answers = []\n",
|
|
"for instance in tqdm(instances, desc=\"Getting answers\"):\n",
|
|
" answer = await answer_qa_instance(instance, context_provider, contexts_filename)\n",
|
|
" answers.append(answer)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Define Metrics for Evaluation and Calculate Score"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculate `\"Correctness\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"Correctness\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)\n",
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Correctness = statistics.mean(\n",
|
|
" [result.metrics_data[0].score for result in eval_results.test_results]\n",
|
|
")\n",
|
|
"print(Correctness)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating `\"Comprehensiveness\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"Comprehensiveness\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)\n",
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Comprehensiveness = statistics.mean(\n",
|
|
" [result.metrics_data[0].score for result in eval_results.test_results]\n",
|
|
")\n",
|
|
"print(Comprehensiveness)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating `\"Diversity\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"Diversity\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)\n",
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Diversity = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
|
|
"print(Diversity)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating`\"Empowerment\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"Empowerment\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)\n",
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Empowerment = statistics.mean(\n",
|
|
" [result.metrics_data[0].score for result in eval_results.test_results]\n",
|
|
")\n",
|
|
"print(Empowerment)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating `\"Directness\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"Directness\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)\n",
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Directness = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
|
|
"print(Directness)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating `\"F1 Score\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"F1\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"F1_score = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
|
|
"print(F1_score)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating `\"Exact Match (EM) Score\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"EM\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)\n",
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"EM = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
|
|
"print(EM)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### \"simple_rag\" as context provider\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"context_provider_name = \"simple_rag\"\n",
|
|
"context_provider = qa_context_providers[context_provider_name]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Generate Answers for QA Instances"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"random.seed(42)\n",
|
|
"instances = dataset if not num_samples else random.sample(dataset, num_samples)\n",
|
|
"\n",
|
|
"out_path = \"out\"\n",
|
|
"if not Path(out_path).exists():\n",
|
|
" Path(out_path).mkdir()\n",
|
|
"contexts_filename = out_path / Path(\n",
|
|
" f\"contexts_{dataset_name_or_filename.split('.')[0]}_{context_provider_name}.json\"\n",
|
|
")\n",
|
|
"\n",
|
|
"answers = []\n",
|
|
"for instance in tqdm(instances, desc=\"Getting answers\"):\n",
|
|
" answer = await answer_qa_instance(instance, context_provider, contexts_filename)\n",
|
|
" answers.append(answer)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Define Metrics for Evaluation and Calculate Score"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculate `\"Correctness\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"Correctness\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)\n",
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Correctness = statistics.mean(\n",
|
|
" [result.metrics_data[0].score for result in eval_results.test_results]\n",
|
|
")\n",
|
|
"print(Correctness)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating `\"Comprehensiveness\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"Comprehensiveness\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)\n",
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Comprehensiveness = statistics.mean(\n",
|
|
" [result.metrics_data[0].score for result in eval_results.test_results]\n",
|
|
")\n",
|
|
"print(Comprehensiveness)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating `\"Diversity\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"Diversity\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)\n",
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Diversity = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
|
|
"print(Diversity)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating`\"Empowerment\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"Empowerment\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)\n",
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Empowerment = statistics.mean(\n",
|
|
" [result.metrics_data[0].score for result in eval_results.test_results]\n",
|
|
")\n",
|
|
"print(Empowerment)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating `\"Directness\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"Directness\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)\n",
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Directness = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
|
|
"print(Directness)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating `\"F1\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"F1 Score\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"F1_score = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
|
|
"print(F1_score)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating `\"Exact Match (EM) Score\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"EM\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)\n",
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"EM = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
|
|
"print(EM)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### \"brute_force\" as context provider\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"context_provider_name = \"brute_force\"\n",
|
|
"context_provider = qa_context_providers[context_provider_name]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Generate Answers for QA Instances"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"random.seed(42)\n",
|
|
"instances = dataset if not num_samples else random.sample(dataset, num_samples)\n",
|
|
"\n",
|
|
"out_path = \"out\"\n",
|
|
"if not Path(out_path).exists():\n",
|
|
" Path(out_path).mkdir()\n",
|
|
"contexts_filename = out_path / Path(\n",
|
|
" f\"contexts_{dataset_name_or_filename.split('.')[0]}_{context_provider_name}.json\"\n",
|
|
")\n",
|
|
"\n",
|
|
"answers = []\n",
|
|
"for instance in tqdm(instances, desc=\"Getting answers\"):\n",
|
|
" answer = await answer_qa_instance(instance, context_provider, contexts_filename)\n",
|
|
" answers.append(answer)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Define Metrics for Evaluation and Calculate Score"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculate `\"Correctness\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"Correctness\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)\n",
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Correctness = statistics.mean(\n",
|
|
" [result.metrics_data[0].score for result in eval_results.test_results]\n",
|
|
")\n",
|
|
"print(Correctness)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating `\"Comprehensiveness\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"Comprehensiveness\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)\n",
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Comprehensiveness = statistics.mean(\n",
|
|
" [result.metrics_data[0].score for result in eval_results.test_results]\n",
|
|
")\n",
|
|
"print(Comprehensiveness)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating `\"Diversity\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"Diversity\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)\n",
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Diversity = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
|
|
"print(Diversity)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating`\"Empowerment\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"Empowerment\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)\n",
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Empowerment = statistics.mean(\n",
|
|
" [result.metrics_data[0].score for result in eval_results.test_results]\n",
|
|
")\n",
|
|
"print(Empowerment)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating `\"Directness\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"Directness\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)\n",
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Directness = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
|
|
"print(Directness)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating `\"F1 Score\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"F1\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"F1_score = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
|
|
"print(F1_score)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Calculating `\"Exact Match (EM) Score\"`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metric_name_list = [\"EM\"]\n",
|
|
"eval_metrics = get_metrics(metric_name_list)\n",
|
|
"eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"EM = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n",
|
|
"print(EM)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "cognee-c83GrcRT-py3.11",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.11.10"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|