feat(#1357): Lexical chunk retriever (#1392)

<!-- .github/pull_request_template.md -->

## Description
<!-- 
Please provide a clear, human-generated description of the changes in
this PR.
DO NOT use AI-generated descriptions. We want to understand your thought
process and reasoning.
-->
I Implemented Lexical Chunk Retriever In the LexicalRetriever class is
Inherite the BaseRetriever and The DocumentChunk are lazy loaded when
first time query is made because it save time during object
initialization
and the function get_context and the get_completion are Implemented same
as the ChunksRetriever the only diffrence is that the DocumentChunk are
converted to match the output type of the ChunksRetriever using function
get_own_properties in the utils.

## Type of Change
<!-- Please check the relevant option -->
- [-] Bug fix (non-breaking change that fixes an issue)
- [-] New feature (non-breaking change that adds functionality)
- [-] Breaking change (fix or feature that would cause existing
functionality to change)
- [-] Documentation update
- [-] Code refactoring
- [-] Performance improvement
- [-] Other (please specify):

## Changes Made
<!-- List the specific changes made in this PR -->
- Added LexicalRetriever base class with customizable tokenizer & scorer
     - Implemented caching of DocumentChunk tokens and payloads 
- Added robust initialization with error handling and logging -
Implemented get_context with top_k ranking and optional scores
- Implemented get_completion consistent with BaseRetriever interface
- Added JaccardChunksRetriever demo using set/multiset Jaccard
similarity
- Support for stopwords and multiset frequency-aware similarity -
Integrated logging for initialization, scoring, and retrieval

## Testing

- Manual tests: initialized retriever, retrieved chunks with toy corpus
    - Edge cases: empty corpus, empty query, scorer/tokenizer errors 
    - Verified Jaccard similarity results for single/multiset cases 
    - Code formatted and linted


## Screenshots/Videos (if applicable)
<!-- Add screenshots or videos to help explain your changes -->

## Pre-submission Checklist
<!-- Please check all boxes that apply before submitting your PR -->
- [-] **I have tested my changes thoroughly before submitting this PR**
- [-] **This PR contains minimal changes necessary to address the
issue/feature**
- [-] My code follows the project's coding standards and style
guidelines
- [-] I have added tests that prove my fix is effective or that my
feature works
- [-] I have added necessary documentation (if applicable)
- [-] All new and existing tests pass
- [-] I have searched existing PRs to ensure this change hasn't been
submitted already
- [-] I have linked any relevant issues in the description
- [-] My commits have clear and descriptive messages

## Related Issues
<!-- Link any related issues using "Fixes #issue_number" or "Relates to
#issue_number" -->
Relates to  #1392
## Additional Notes
<!-- Add any additional notes, concerns, or context for reviewers -->
Int the cognee/modules/chunking/models/DocumentChunk.py
don't remove the optional  from is_part_of attributes.

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: Andrej Milicevic <milicevicandrej@yahoo.com>
Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Igor Ilic <igorilic03@gmail.com>
Co-authored-by: Vasilije <8619304+Vasilije1990@users.noreply.github.com>
Co-authored-by: Boris <boris@topoteretes.com>
Co-authored-by: lxobr <122801072+lxobr@users.noreply.github.com>
This commit is contained in:
Chaitany 2025-09-19 21:54:33 +05:30 committed by GitHub
parent 38c05ba71a
commit 96eb0d448a
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
10 changed files with 276 additions and 76 deletions

View file

@ -10,7 +10,7 @@ WORKFLOWS=(
"test_kuzu.yml"
"test_multimetric_qa_eval_run.yaml"
"test_graphrag_vs_rag_notebook.yml"
"test_gemini.yml"
"test_llms.yml"
"test_multimedia_example.yaml"
"test_deduplication.yml"
"test_eval_framework.yml"

View file

@ -1,29 +0,0 @@
name: test | gemini
on:
workflow_call:
jobs:
test-gemini:
name: Run Gemini Test
runs-on: ubuntu-22.04
steps:
- name: Check out repository
uses: actions/checkout@v4
- name: Cognee Setup
uses: ./.github/actions/cognee_setup
with:
python-version: '3.11.x'
- name: Run Gemini Simple Example
env:
LLM_PROVIDER: "gemini"
LLM_API_KEY: ${{ secrets.GEMINI_API_KEY }}
LLM_MODEL: "gemini/gemini-1.5-flash"
EMBEDDING_PROVIDER: "gemini"
EMBEDDING_API_KEY: ${{ secrets.GEMINI_API_KEY }}
EMBEDDING_MODEL: "gemini/text-embedding-004"
EMBEDDING_DIMENSIONS: "768"
EMBEDDING_MAX_TOKENS: "8076"
run: uv run python ./examples/python/simple_example.py

86
.github/workflows/test_llms.yml vendored Normal file
View file

@ -0,0 +1,86 @@
name: LLM Test Suites
permissions:
contents: read
on:
workflow_call:
env:
RUNTIME__LOG_LEVEL: ERROR
ENV: 'dev'
jobs:
test-gemini:
name: Run Gemini Test
runs-on: ubuntu-22.04
steps:
- name: Check out repository
uses: actions/checkout@v4
- name: Cognee Setup
uses: ./.github/actions/cognee_setup
with:
python-version: '3.11.x'
- name: Run Gemini Simple Example
env:
LLM_PROVIDER: "gemini"
LLM_API_KEY: ${{ secrets.GEMINI_API_KEY }}
LLM_MODEL: "gemini/gemini-1.5-flash"
EMBEDDING_PROVIDER: "gemini"
EMBEDDING_API_KEY: ${{ secrets.GEMINI_API_KEY }}
EMBEDDING_MODEL: "gemini/text-embedding-004"
EMBEDDING_DIMENSIONS: "768"
EMBEDDING_MAX_TOKENS: "8076"
run: uv run python ./examples/python/simple_example.py
test-fastembed:
name: Run Fastembed Test
runs-on: ubuntu-22.04
steps:
- name: Check out repository
uses: actions/checkout@v4
- name: Cognee Setup
uses: ./.github/actions/cognee_setup
with:
python-version: '3.11.x'
- name: Run Fastembed Simple Example
env:
LLM_PROVIDER: "openai"
LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
LLM_MODEL: ${{ secrets.LLM_MODEL }}
LLM_ENDPOINT: ${{ secrets.LLM_ENDPOINT }}
LLM_API_VERSION: ${{ secrets.LLM_API_VERSION }}
EMBEDDING_PROVIDER: "fastembed"
EMBEDDING_MODEL: "sentence-transformers/all-MiniLM-L6-v2"
EMBEDDING_DIMENSIONS: "384"
EMBEDDING_MAX_TOKENS: "256"
run: uv run python ./examples/python/simple_example.py
test-openrouter:
name: Run OpenRouter Test
runs-on: ubuntu-22.04
steps:
- name: Check out repository
uses: actions/checkout@v4
- name: Cognee Setup
uses: ./.github/actions/cognee_setup
with:
python-version: '3.11.x'
- name: Run OpenRouter Simple Example
env:
LLM_PROVIDER: "custom"
LLM_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
LLM_MODEL: "openrouter/x-ai/grok-code-fast-1"
LLM_ENDPOINT: "https://openrouter.ai/api/v1"
EMBEDDING_PROVIDER: "openai"
EMBEDDING_API_KEY: ${{ secrets.OPENAI_API_KEY }}
EMBEDDING_MODEL: "openai/text-embedding-3-large"
EMBEDDING_DIMENSIONS: "3072"
EMBEDDING_MAX_TOKENS: "8191"
run: uv run python ./examples/python/simple_example.py

View file

@ -1,30 +0,0 @@
name: test | openrouter
on:
workflow_call:
jobs:
test-openrouter:
name: Run OpenRouter Test
runs-on: ubuntu-22.04
steps:
- name: Check out repository
uses: actions/checkout@v4
- name: Cognee Setup
uses: ./.github/actions/cognee_setup
with:
python-version: '3.11.x'
- name: Run OpenRouter Simple Example
env:
LLM_PROVIDER: "custom"
LLM_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
LLM_MODEL: "openrouter/x-ai/grok-code-fast-1"
LLM_ENDPOINT: "https://openrouter.ai/api/v1"
EMBEDDING_PROVIDER: "openai"
EMBEDDING_API_KEY: ${{ secrets.OPENAI_API_KEY }}
EMBEDDING_MODEL: "openai/text-embedding-3-large"
EMBEDDING_DIMENSIONS: "3072"
EMBEDDING_MAX_TOKENS: "8191"
run: uv run python ./examples/python/simple_example.py

View file

@ -115,16 +115,10 @@ jobs:
secrets: inherit
# Additional LLM tests
gemini-tests:
name: Gemini Tests
needs: [basic-tests, e2e-tests]
uses: ./.github/workflows/test_gemini.yml
secrets: inherit
openrouter-tests:
name: OpenRouter Tests
needs: [basic-tests, e2e-tests]
uses: ./.github/workflows/test_openrouter.yml
llm-tests:
name: LLM Test Suite
needs: [ basic-tests, e2e-tests ]
uses: ./.github/workflows/test_llms.yml
secrets: inherit
# Ollama tests moved to the end
@ -138,8 +132,7 @@ jobs:
different-operating-systems-tests,
vector-db-tests,
example-tests,
gemini-tests,
openrouter-tests,
llm-tests,
mcp-test,
relational-db-migration-tests,
docker-compose-test,
@ -161,8 +154,7 @@ jobs:
example-tests,
db-examples-tests,
mcp-test,
gemini-tests,
openrouter-tests,
llm-tests,
ollama-tests,
relational-db-migration-tests,
docker-compose-test,
@ -183,8 +175,7 @@ jobs:
"${{ needs.example-tests.result }}" == "success" &&
"${{ needs.db-examples-tests.result }}" == "success" &&
"${{ needs.relational-db-migration-tests.result }}" == "success" &&
"${{ needs.gemini-tests.result }}" == "success" &&
"${{ needs.openrouter-tests.result }}" == "success" &&
"${{ needs.llm-tests.result }}" == "success" &&
"${{ needs.docker-compose-test.result }}" == "success" &&
"${{ needs.docker-ci-test.result }}" == "success" &&
"${{ needs.ollama-tests.result }}" == "success" ]]; then

View file

@ -82,6 +82,9 @@ async def search(
Best for: General-purpose queries or when you're unsure which search type is best.
Returns: The results from the automatically selected search type.
**CHUNKS_LEXICAL**:
Token-based lexical chunk search (e.g., Jaccard). Best for: exact-term matching, stopword-aware lookups.
Returns: Ranked text chunks (optionally with scores).
Args:
query_text: Your question or search query in natural language.

View file

@ -0,0 +1,56 @@
from cognee.modules.retrieval.lexical_retriever import LexicalRetriever
import re
from collections import Counter
from typing import Optional
class JaccardChunksRetriever(LexicalRetriever):
"""
Retriever that specializes LexicalRetriever to use Jaccard similarity.
"""
def __init__(self, top_k: int = 10, with_scores: bool = False,
stop_words: Optional[list[str]] = None, multiset_jaccard: bool = False):
"""
Parameters
----------
top_k : int
Number of top results to return.
with_scores : bool
If True, return (payload, score) pairs. Otherwise, only payloads.
stop_words : list[str], optional
List of tokens to filter out.
multiset_jaccard : bool
If True, use multiset Jaccard (frequency aware).
"""
self.stop_words = {t.lower() for t in stop_words} if stop_words else set()
self.multiset_jaccard = multiset_jaccard
super().__init__(
tokenizer=self._tokenizer,
scorer=self._scorer,
top_k=top_k,
with_scores=with_scores
)
def _tokenizer(self, text: str) -> list[str]:
"""
Tokenizer: lowercases, splits on word characters (w+), filters stopwords.
"""
tokens = re.findall(r"\w+", text.lower())
return [t for t in tokens if t not in self.stop_words]
def _scorer(self, query_tokens: list[str], chunk_tokens: list[str]) -> float:
"""
Jaccard similarity scorer.
- If multiset_jaccard=True, uses frequency-aware Jaccard.
- Otherwise, normal set Jaccard.
"""
if self.multiset_jaccard:
q_counts, c_counts = Counter(query_tokens), Counter(chunk_tokens)
numerator = sum(min(q_counts[t], c_counts[t]) for t in set(q_counts) | set(c_counts))
denominator = sum(max(q_counts[t], c_counts[t]) for t in set(q_counts) | set(c_counts))
return numerator / denominator if denominator else 0.0
else:
q_set, c_set = set(query_tokens), set(chunk_tokens)
if not q_set or not c_set:
return 0.0
return len(q_set & c_set) / len(q_set | c_set)

View file

@ -0,0 +1,117 @@
import asyncio
from typing import Any, Callable, Optional
from heapq import nlargest
from cognee.infrastructure.databases.graph import get_graph_engine
from cognee.modules.retrieval.base_retriever import BaseRetriever
from cognee.modules.retrieval.exceptions.exceptions import NoDataError
from cognee.shared.logging_utils import get_logger
logger = get_logger("LexicalRetriever")
class LexicalRetriever(BaseRetriever):
def __init__(self, tokenizer: Callable, scorer: Callable, top_k: int = 10, with_scores: bool = False):
if not callable(tokenizer) or not callable(scorer):
raise TypeError("tokenizer and scorer must be callables")
if not isinstance(top_k, int) or top_k <= 0:
raise ValueError("top_k must be a positive integer")
self.tokenizer = tokenizer
self.scorer = scorer
self.top_k = top_k
self.with_scores = bool(with_scores)
# Cache keyed by dataset context
self.chunks: dict[str, Any] = {} # {chunk_id: tokens}
self.payloads: dict[str, Any] = {} # {chunk_id: original_document}
self._initialized = False
self._init_lock = asyncio.Lock()
async def initialize(self):
"""Initialize retriever by reading all DocumentChunks from graph_engine."""
async with self._init_lock:
if self._initialized:
return
logger.info("Initializing LexicalRetriever by loading DocumentChunks from graph engine")
try:
graph_engine = await get_graph_engine()
nodes, _ = await graph_engine.get_filtered_graph_data([{"type": ["DocumentChunk"]}])
except Exception as e:
logger.error("Graph engine initialization failed")
raise NoDataError("Graph engine initialization failed") from e
chunk_count = 0
for node in nodes:
try:
chunk_id, document = node
except Exception:
logger.warning("Skipping node with unexpected shape: %r", node)
continue
if document.get("type") == "DocumentChunk" and document.get("text"):
try:
tokens = self.tokenizer(document["text"])
if not tokens:
continue
self.chunks[str(document.get("id",chunk_id))] = tokens
self.payloads[str(document.get("id",chunk_id))] = document
chunk_count += 1
except Exception as e:
logger.error("Tokenizer failed for chunk %s: %s", chunk_id, str(e))
if chunk_count == 0:
logger.error("Initialization completed but no valid chunks were loaded.")
raise NoDataError("No valid chunks loaded during initialization.")
self._initialized = True
logger.info("Initialized with %d document chunks", len(self.chunks))
async def get_context(self, query: str) -> Any:
"""Retrieves relevant chunks for the given query."""
if not self._initialized:
await self.initialize()
if not self.chunks:
logger.warning("No chunks available in retriever")
return []
try:
query_tokens = self.tokenizer(query)
except Exception as e:
logger.error("Failed to tokenize query: %s", str(e))
return []
if not query_tokens:
logger.warning("Query produced no tokens")
return []
results = []
for chunk_id, chunk_tokens in self.chunks.items():
try:
score = self.scorer(query_tokens, chunk_tokens)
if not isinstance(score, (int, float)):
logger.warning("Non-numeric score for chunk %s → treated as 0.0", chunk_id)
score = 0.0
except Exception as e:
logger.error("Scorer failed for chunk %s: %s", chunk_id, str(e))
score = 0.0
results.append((chunk_id, score))
top_results = nlargest(self.top_k, results, key=lambda x: x[1])
logger.info("Retrieved %d/%d chunks for query (len=%d)", len(top_results), len(results), len(query_tokens))
if self.with_scores:
return [(self.payloads[chunk_id], score) for chunk_id, score in top_results]
else:
return [self.payloads[chunk_id] for chunk_id, _ in top_results]
async def get_completion(self, query: str, context: Optional[Any] = None) -> Any:
"""Returns context for the given query (retrieves if not provided)."""
if context is None:
context = await self.get_context(query)
return context

View file

@ -15,6 +15,7 @@ from cognee.modules.retrieval.completion_retriever import CompletionRetriever
from cognee.modules.retrieval.graph_completion_retriever import GraphCompletionRetriever
from cognee.modules.retrieval.temporal_retriever import TemporalRetriever
from cognee.modules.retrieval.coding_rules_retriever import CodingRulesRetriever
from cognee.modules.retrieval.jaccard_retrival import JaccardChunksRetriever
from cognee.modules.retrieval.graph_summary_completion_retriever import (
GraphSummaryCompletionRetriever,
)
@ -152,6 +153,10 @@ async def get_search_type_tools(
TemporalRetriever(top_k=top_k).get_completion,
TemporalRetriever(top_k=top_k).get_context,
],
SearchType.CHUNKS_LEXICAL: (lambda _r=JaccardChunksRetriever(top_k=top_k): [
_r.get_completion,
_r.get_context,
])(),
SearchType.CODING_RULES: [
CodingRulesRetriever(rules_nodeset_name=node_name).get_existing_rules,
],

View file

@ -17,3 +17,4 @@ class SearchType(Enum):
FEEDBACK = "FEEDBACK"
TEMPORAL = "TEMPORAL"
CODING_RULES = "CODING_RULES"
CHUNKS_LEXICAL = "CHUNKS_LEXICAL"