Implement automatic entity resolution to prevent duplicate nodes in the knowledge graph. The system uses a 3-layer approach: 1. Case-insensitive exact matching (free, instant) 2. Fuzzy string matching >85% threshold (free, instant) 3. Vector similarity + LLM verification (for acronyms/synonyms) Key features: - Pre-resolution phase prevents race conditions in parallel processing - Numeric suffix detection blocks false matches (IL-4 ≠ IL-13) - PostgreSQL alias cache for fast lookups on subsequent ingestion - Configurable thresholds via environment variables Bug fixes included: - Fix fuzzy matching false positives for numbered entities - Fix alias cache not being populated (missing db parameter) - Skip entity_aliases table from generic id index creation New files: - lightrag/entity_resolution/ - Core resolution module - tests/test_entity_resolution/ - Unit tests - docker/postgres-age-vector/ - Custom PG image with pgvector + AGE - docker-compose.test.yml - Integration test environment Configuration (env.example): - ENTITY_RESOLUTION_ENABLED=true - ENTITY_RESOLUTION_FUZZY_THRESHOLD=0.85 - ENTITY_RESOLUTION_VECTOR_THRESHOLD=0.5 - ENTITY_RESOLUTION_MAX_CANDIDATES=3
57 lines
2.1 KiB
Python
57 lines
2.1 KiB
Python
"""Configuration for Entity Resolution
|
|
|
|
Uses the same LLM that LightRAG is configured with - no separate model config needed.
|
|
"""
|
|
|
|
from dataclasses import dataclass, field
|
|
|
|
|
|
@dataclass
|
|
class EntityResolutionConfig:
|
|
"""Configuration for the entity resolution system."""
|
|
|
|
# Whether entity resolution is enabled
|
|
enabled: bool = True
|
|
|
|
# Fuzzy pre-resolution: Enable/disable within-batch fuzzy matching before
|
|
# VDB lookup. When enabled, entities in the same batch are matched by string
|
|
# similarity alone. Set to False to skip fuzzy pre-resolution entirely (only
|
|
# exact case-insensitive matches will be accepted within batch; all other
|
|
# resolution goes to VDB/LLM). Disabling reduces false positives but may
|
|
# miss obvious typo corrections.
|
|
fuzzy_pre_resolution_enabled: bool = True
|
|
|
|
# Fuzzy string matching threshold (0-1)
|
|
# Above this = auto-match (catches typos like Dupixant/Dupixent at 0.88)
|
|
# Below this = continue to vector search
|
|
# Tuning advice:
|
|
# 0.90+ = Very conservative, near-identical strings (Dupixent/Dupixant)
|
|
# 0.85 = Balanced default, catches typos, avoids most false positives
|
|
# 0.80 = Aggressive, may merge distinct entities with similar names
|
|
# <0.75 = Not recommended, high false positive risk (Celebrex/Cerebyx=0.67)
|
|
# Test with your domain data; pharmaceutical names need higher thresholds.
|
|
fuzzy_threshold: float = 0.85
|
|
|
|
# Vector similarity threshold for finding candidates
|
|
# Low threshold = cast wide net, LLM will verify
|
|
# 0.5 catches FDA/US Food and Drug Administration at 0.67
|
|
vector_threshold: float = 0.5
|
|
|
|
# Maximum number of vector candidates to verify with LLM
|
|
# Limits cost - uses same LLM as LightRAG main config
|
|
max_candidates: int = 3
|
|
|
|
# LLM verification prompt template
|
|
llm_prompt_template: str = field(
|
|
default="""Are these two terms referring to the same entity?
|
|
Consider typos, misspellings, abbreviations, or alternate names.
|
|
|
|
Term A: {term_a}
|
|
Term B: {term_b}
|
|
|
|
Answer only YES or NO.""",
|
|
)
|
|
|
|
|
|
# Default configuration
|
|
DEFAULT_CONFIG = EntityResolutionConfig()
|