History

vasilije b0e3f89340 move to gpt5		2025-08-17 12:19:34 +02:00
..
old	move to gpt5	2025-08-17 12:19:34 +02:00
src	move to gpt5	2025-08-17 12:19:34 +02:00
benchmark_summary_cognee.json	fix: Feature/cog 2648 evals update (#1221 )	2025-08-08 20:23:09 +02:00
benchmark_summary_competition.json	fix: Feature/cog 2648 evals update (#1221 )	2025-08-08 20:23:09 +02:00
comprehensive_metrics_comparison.png	fix: Feature/cog 2648 evals update (#1221 )	2025-08-08 20:23:09 +02:00
metrics_comparison.png	Cog 1576 (#721 )	2025-04-11 09:31:48 +02:00
optimized_cognee_configurations.png	fix: Feature/cog 2648 evals update (#1221 )	2025-08-08 20:23:09 +02:00
plot_metrics.py	fix: Feature/cog 2648 evals update (#1221 )	2025-08-08 20:23:09 +02:00
README.md	fix: Feature/cog 2648 evals update (#1221 )	2025-08-08 20:23:09 +02:00
requirements.txt	Cog 1576 (#721 )	2025-04-11 09:31:48 +02:00

README.md

QA Evaluation

Repeated runs of QA evaluation on 24-item HotpotQA subset, comparing Mem0, Graphiti, LightRAG, and Cognee (multiple retriever configs). Uses Modal for distributed benchmark execution.

Dataset

hotpot_qa_24_corpus.json and hotpot_qa_24_qa_pairs.json
hotpot_qa_24_instance_filter.json for instance filtering

Systems Evaluated

Mem0: OpenAI-based memory QA system
Graphiti: LangChain + Neo4j knowledge graph QA
LightRAG: Falkor's GraphRAG-SDK
Cognee: Multiple retriever configurations (GRAPH_COMPLETION, GRAPH_COMPLETION_COT, GRAPH_COMPLETION_CONTEXT_EXTENSION)

Project Structure

src/ - Analysis scripts and QA implementations
src/modal_apps/ - Modal deployment configurations
src/qa/ - QA benchmark classes
src/helpers/ and src/analysis/ - Utilities

Notes:

Use PyProject.toml for dependencies
Ensure Modal CLI is configured
Modular QA benchmark classes enable parallel execution on other platforms beyond Modal

Execute repeated runs via Modal apps:

modal run modal_apps/modal_qa_benchmark_<system>.py

Where <system> is one of: mem0, graphiti, lightrag, cognee

Raw results stored in Modal volumes under /qa-benchmarks/<benchmark>/{answers,evaluated}

Results Analysis

python run_cross_benchmark_analysis.py
Downloads Modal volumes, processes evaluated JSONs
Generates per-benchmark CSVs and cross-benchmark summary
Use visualize_benchmarks.py to create comparison charts

Results

45 evaluation cycles on 24 HotPotQA questions with multiple metrics (EM, F1, DeepEval Correctness, Human-like Correctness)
Significant variance observed in metrics across small runs due to LLM-as-judge inconsistencies
Cognee showed consistent improvements across all measured dimensions compared to Mem0, Lightrag, and Graphiti

Visualization Results

The following charts visualize the benchmark results and performance comparisons:

Comprehensive Metrics Comparison

A comprehensive comparison of all evaluated systems across multiple metrics, showing Cognee's performance relative to Mem0, Graphiti, and LightRAG.

Optimized Cognee Configurations

Performance analysis of different Cognee retriever configurations (GRAPH_COMPLETION, GRAPH_COMPLETION_COT, GRAPH_COMPLETION_CONTEXT_EXTENSION), showing optimization results.

Notes

Traditional QA metrics (EM/F1) miss core value of AI memory systems - measure letter/word differences rather than information content
HotPotQA benchmark mismatch - designed for multi-hop reasoning but operates in constrained contexts vs. real-world cross-context linking
DeepEval variance - LLM-as-judge evaluation carries inconsistencies of underlying language model