Implement automatic entity resolution to prevent duplicate nodes in the knowledge graph. The system uses a 3-layer approach: 1. Case-insensitive exact matching (free, instant) 2. Fuzzy string matching >85% threshold (free, instant) 3. Vector similarity + LLM verification (for acronyms/synonyms) Key features: - Pre-resolution phase prevents race conditions in parallel processing - Numeric suffix detection blocks false matches (IL-4 ≠ IL-13) - PostgreSQL alias cache for fast lookups on subsequent ingestion - Configurable thresholds via environment variables Bug fixes included: - Fix fuzzy matching false positives for numbered entities - Fix alias cache not being populated (missing db parameter) - Skip entity_aliases table from generic id index creation New files: - lightrag/entity_resolution/ - Core resolution module - tests/test_entity_resolution/ - Unit tests - docker/postgres-age-vector/ - Custom PG image with pgvector + AGE - docker-compose.test.yml - Integration test environment Configuration (env.example): - ENTITY_RESOLUTION_ENABLED=true - ENTITY_RESOLUTION_FUZZY_THRESHOLD=0.85 - ENTITY_RESOLUTION_VECTOR_THRESHOLD=0.5 - ENTITY_RESOLUTION_MAX_CANDIDATES=3
29 lines
670 B
Python
29 lines
670 B
Python
"""
|
|
Entity Resolution Module for LightRAG
|
|
|
|
Provides automatic entity deduplication using a 3-layer approach:
|
|
1. Case normalization (exact match)
|
|
2. Fuzzy string matching (typos)
|
|
3. Vector similarity + LLM verification (semantic matches)
|
|
"""
|
|
|
|
from .resolver import (
|
|
resolve_entity,
|
|
resolve_entity_with_vdb,
|
|
ResolutionResult,
|
|
get_cached_alias,
|
|
store_alias,
|
|
fuzzy_similarity,
|
|
)
|
|
from .config import EntityResolutionConfig, DEFAULT_CONFIG
|
|
|
|
__all__ = [
|
|
"resolve_entity",
|
|
"resolve_entity_with_vdb",
|
|
"ResolutionResult",
|
|
"EntityResolutionConfig",
|
|
"DEFAULT_CONFIG",
|
|
"get_cached_alias",
|
|
"store_alias",
|
|
"fuzzy_similarity",
|
|
]
|