{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Evaluation on the hotpotQA dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install \"cognee[deepeval]\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evals.eval_on_hotpot import deepeval_answers, answer_qa_instance\n", "from evals.qa_dataset_utils import load_qa_dataset\n", "from evals.qa_metrics_utils import get_metrics\n", "from evals.qa_context_provider_utils import qa_context_providers\n", "from pathlib import Path\n", "from tqdm import tqdm\n", "import statistics\n", "import random" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "num_samples = 10 # With cognee, it takes ~1m10s per sample\n", "dataset_name_or_filename = \"hotpotqa\"\n", "dataset = load_qa_dataset(dataset_name_or_filename)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define Context Provider\n", "**Options**: \n", "- **cognee**: context with cognee \n", "- **no_rag**: raw context \n", "- **simple_rag**: context with simple rag \n", "- **brute_force**: context with brute force triplet search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Here, \"cognee\" is used as context provider\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "context_provider_name = \"cognee\"\n", "context_provider = qa_context_providers[context_provider_name]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Generate Answers for QA Instances\n", "1. **Random Sampling**: Selects a random subset of the dataset if `num_samples` is defined.\n", "2. **Context Filename**: Defines the file path for storing contexts generated by the context provider.\n", "3. **Answer Generation**: Iterates over the QA instances using `tqdm` for progress tracking and generates answers using the `answer_qa_instance` function asynchronously." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "random.seed(42)\n", "instances = dataset if not num_samples else random.sample(dataset, num_samples)\n", "\n", "out_path = \"out\"\n", "if not Path(out_path).exists():\n", " Path(out_path).mkdir()\n", "contexts_filename = out_path / Path(\n", " f\"contexts_{dataset_name_or_filename.split('.')[0]}_{context_provider_name}.json\"\n", ")\n", "\n", "answers = []\n", "for instance in tqdm(instances, desc=\"Getting answers\"):\n", " answer = await answer_qa_instance(instance, context_provider, contexts_filename)\n", " answers.append(answer)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Define Metrics for Evaluation and Calculate Score\n", "**Options**: \n", "- **Correctness**: Is the actual output factually correct based on the expected output?\n", "- **Comprehensiveness**: How much detail does the answer provide to cover all aspects and details of the question?\n", "- **Diversity**: How varied and rich is the answer in providing different perspectives and insights on the question?\n", "- **Empowerment**: How well does the answer help the reader understand and make informed judgements about the topic?\n", "- **Directness**: How specifically and clearly does the answer address the question?\n", "- **F1 Score**: the harmonic mean of the precision and recall, using word-level Exact Match\n", "- **EM Score**: the rate at which the predicted strings exactly match their references, ignoring white spaces and capitalization.\n", "\n", "We can also calculate scores based on the same metrics with promptfoo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating `\"Correctness\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"Correctness\"]\n", "eval_metrics = get_metrics(metric_name_list)\n", "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Correctness = statistics.mean(\n", " [result.metrics_data[0].score for result in eval_results.test_results]\n", ")\n", "print(Correctness)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating `\"Comprehensiveness\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"Comprehensiveness\"]\n", "eval_metrics = get_metrics(metric_name_list)\n", "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Comprehensiveness = statistics.mean(\n", " [result.metrics_data[0].score for result in eval_results.test_results]\n", ")\n", "print(Comprehensiveness)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating `\"Diversity\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"Diversity\"]\n", "eval_metrics = get_metrics(metric_name_list)\n", "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Diversity = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n", "print(Diversity)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating`\"Empowerment\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"Empowerment\"]\n", "eval_metrics = get_metrics(metric_name_list)\n", "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Empowerment = statistics.mean(\n", " [result.metrics_data[0].score for result in eval_results.test_results]\n", ")\n", "print(Empowerment)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating `\"Directness\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"Directness\"]\n", "eval_metrics = get_metrics(metric_name_list)\n", "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Directness = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n", "print(Directness)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating `\"F1 Score\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"F1\"]\n", "eval_metrics = get_metrics(metric_name_list)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "F1_score = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n", "print(F1_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating `\"Exact Match (EM) Score\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"EM\"]\n", "eval_metrics = get_metrics(metric_name_list)\n", "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "EM = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n", "print(EM)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \"no_rag\" as context provider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "context_provider_name = \"no_rag\"\n", "context_provider = qa_context_providers[context_provider_name]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Generate Answers for QA Instances" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "random.seed(42)\n", "instances = dataset if not num_samples else random.sample(dataset, num_samples)\n", "\n", "out_path = \"out\"\n", "if not Path(out_path).exists():\n", " Path(out_path).mkdir()\n", "contexts_filename = out_path / Path(\n", " f\"contexts_{dataset_name_or_filename.split('.')[0]}_{context_provider_name}.json\"\n", ")\n", "\n", "answers = []\n", "for instance in tqdm(instances, desc=\"Getting answers\"):\n", " answer = await answer_qa_instance(instance, context_provider, contexts_filename)\n", " answers.append(answer)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Define Metrics for Evaluation and Calculate Score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculate `\"Correctness\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"Correctness\"]\n", "eval_metrics = get_metrics(metric_name_list)\n", "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Correctness = statistics.mean(\n", " [result.metrics_data[0].score for result in eval_results.test_results]\n", ")\n", "print(Correctness)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating `\"Comprehensiveness\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"Comprehensiveness\"]\n", "eval_metrics = get_metrics(metric_name_list)\n", "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Comprehensiveness = statistics.mean(\n", " [result.metrics_data[0].score for result in eval_results.test_results]\n", ")\n", "print(Comprehensiveness)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating `\"Diversity\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"Diversity\"]\n", "eval_metrics = get_metrics(metric_name_list)\n", "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Diversity = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n", "print(Diversity)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating`\"Empowerment\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"Empowerment\"]\n", "eval_metrics = get_metrics(metric_name_list)\n", "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Empowerment = statistics.mean(\n", " [result.metrics_data[0].score for result in eval_results.test_results]\n", ")\n", "print(Empowerment)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating `\"Directness\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"Directness\"]\n", "eval_metrics = get_metrics(metric_name_list)\n", "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Directness = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n", "print(Directness)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating `\"F1 Score\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"F1\"]\n", "eval_metrics = get_metrics(metric_name_list)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "F1_score = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n", "print(F1_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating `\"Exact Match (EM) Score\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"EM\"]\n", "eval_metrics = get_metrics(metric_name_list)\n", "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "EM = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n", "print(EM)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \"simple_rag\" as context provider\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "context_provider_name = \"simple_rag\"\n", "context_provider = qa_context_providers[context_provider_name]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Generate Answers for QA Instances" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "random.seed(42)\n", "instances = dataset if not num_samples else random.sample(dataset, num_samples)\n", "\n", "out_path = \"out\"\n", "if not Path(out_path).exists():\n", " Path(out_path).mkdir()\n", "contexts_filename = out_path / Path(\n", " f\"contexts_{dataset_name_or_filename.split('.')[0]}_{context_provider_name}.json\"\n", ")\n", "\n", "answers = []\n", "for instance in tqdm(instances, desc=\"Getting answers\"):\n", " answer = await answer_qa_instance(instance, context_provider, contexts_filename)\n", " answers.append(answer)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Define Metrics for Evaluation and Calculate Score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculate `\"Correctness\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"Correctness\"]\n", "eval_metrics = get_metrics(metric_name_list)\n", "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Correctness = statistics.mean(\n", " [result.metrics_data[0].score for result in eval_results.test_results]\n", ")\n", "print(Correctness)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating `\"Comprehensiveness\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"Comprehensiveness\"]\n", "eval_metrics = get_metrics(metric_name_list)\n", "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Comprehensiveness = statistics.mean(\n", " [result.metrics_data[0].score for result in eval_results.test_results]\n", ")\n", "print(Comprehensiveness)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating `\"Diversity\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"Diversity\"]\n", "eval_metrics = get_metrics(metric_name_list)\n", "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Diversity = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n", "print(Diversity)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating`\"Empowerment\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"Empowerment\"]\n", "eval_metrics = get_metrics(metric_name_list)\n", "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Empowerment = statistics.mean(\n", " [result.metrics_data[0].score for result in eval_results.test_results]\n", ")\n", "print(Empowerment)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating `\"Directness\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"Directness\"]\n", "eval_metrics = get_metrics(metric_name_list)\n", "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Directness = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n", "print(Directness)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating `\"F1\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"F1 Score\"]\n", "eval_metrics = get_metrics(metric_name_list)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "F1_score = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n", "print(F1_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating `\"Exact Match (EM) Score\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"EM\"]\n", "eval_metrics = get_metrics(metric_name_list)\n", "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "EM = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n", "print(EM)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \"brute_force\" as context provider\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "context_provider_name = \"brute_force\"\n", "context_provider = qa_context_providers[context_provider_name]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Generate Answers for QA Instances" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "random.seed(42)\n", "instances = dataset if not num_samples else random.sample(dataset, num_samples)\n", "\n", "out_path = \"out\"\n", "if not Path(out_path).exists():\n", " Path(out_path).mkdir()\n", "contexts_filename = out_path / Path(\n", " f\"contexts_{dataset_name_or_filename.split('.')[0]}_{context_provider_name}.json\"\n", ")\n", "\n", "answers = []\n", "for instance in tqdm(instances, desc=\"Getting answers\"):\n", " answer = await answer_qa_instance(instance, context_provider, contexts_filename)\n", " answers.append(answer)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Define Metrics for Evaluation and Calculate Score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculate `\"Correctness\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"Correctness\"]\n", "eval_metrics = get_metrics(metric_name_list)\n", "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Correctness = statistics.mean(\n", " [result.metrics_data[0].score for result in eval_results.test_results]\n", ")\n", "print(Correctness)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating `\"Comprehensiveness\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"Comprehensiveness\"]\n", "eval_metrics = get_metrics(metric_name_list)\n", "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Comprehensiveness = statistics.mean(\n", " [result.metrics_data[0].score for result in eval_results.test_results]\n", ")\n", "print(Comprehensiveness)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating `\"Diversity\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"Diversity\"]\n", "eval_metrics = get_metrics(metric_name_list)\n", "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Diversity = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n", "print(Diversity)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating`\"Empowerment\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"Empowerment\"]\n", "eval_metrics = get_metrics(metric_name_list)\n", "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Empowerment = statistics.mean(\n", " [result.metrics_data[0].score for result in eval_results.test_results]\n", ")\n", "print(Empowerment)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating `\"Directness\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"Directness\"]\n", "eval_metrics = get_metrics(metric_name_list)\n", "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Directness = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n", "print(Directness)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating `\"F1 Score\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"F1\"]\n", "eval_metrics = get_metrics(metric_name_list)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "F1_score = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n", "print(F1_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculating `\"Exact Match (EM) Score\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_name_list = [\"EM\"]\n", "eval_metrics = get_metrics(metric_name_list)\n", "eval_results = await deepeval_answers(instances, answers, eval_metrics[\"deepeval_metrics\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "EM = statistics.mean([result.metrics_data[0].score for result in eval_results.test_results])\n", "print(EM)" ] } ], "metadata": { "kernelspec": { "display_name": "cognee-c83GrcRT-py3.11", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.10" } }, "nbformat": 4, "nbformat_minor": 2 }