Implement automatic entity resolution to prevent duplicate nodes in the knowledge graph. The system uses a 3-layer approach: 1. Case-insensitive exact matching (free, instant) 2. Fuzzy string matching >85% threshold (free, instant) 3. Vector similarity + LLM verification (for acronyms/synonyms) Key features: - Pre-resolution phase prevents race conditions in parallel processing - Numeric suffix detection blocks false matches (IL-4 ≠ IL-13) - PostgreSQL alias cache for fast lookups on subsequent ingestion - Configurable thresholds via environment variables Bug fixes included: - Fix fuzzy matching false positives for numbered entities - Fix alias cache not being populated (missing db parameter) - Skip entity_aliases table from generic id index creation New files: - lightrag/entity_resolution/ - Core resolution module - tests/test_entity_resolution/ - Unit tests - docker/postgres-age-vector/ - Custom PG image with pgvector + AGE - docker-compose.test.yml - Integration test environment Configuration (env.example): - ENTITY_RESOLUTION_ENABLED=true - ENTITY_RESOLUTION_FUZZY_THRESHOLD=0.85 - ENTITY_RESOLUTION_VECTOR_THRESHOLD=0.5 - ENTITY_RESOLUTION_MAX_CANDIDATES=3
26 lines
880 B
Docker
26 lines
880 B
Docker
# Start from pgvector image (has vector extension pre-built correctly)
|
|
FROM pgvector/pgvector:pg17
|
|
|
|
# Install build dependencies for AGE
|
|
RUN apt-get update && apt-get install -y \
|
|
build-essential \
|
|
git \
|
|
postgresql-server-dev-17 \
|
|
libreadline-dev \
|
|
zlib1g-dev \
|
|
flex \
|
|
bison \
|
|
&& rm -rf /var/lib/apt/lists/*
|
|
|
|
# Install Apache AGE 1.6.0 for PG17
|
|
RUN cd /tmp \
|
|
&& git clone --branch release/PG17/1.6.0 https://github.com/apache/age.git \
|
|
&& cd age \
|
|
&& make \
|
|
&& make install \
|
|
&& rm -rf /tmp/age
|
|
|
|
# Add initialization script to create extensions
|
|
RUN echo "CREATE EXTENSION IF NOT EXISTS vector;" > /docker-entrypoint-initdb.d/01-vector.sql \
|
|
&& echo "CREATE EXTENSION IF NOT EXISTS age;" > /docker-entrypoint-initdb.d/02-age.sql \
|
|
&& echo "SET search_path = ag_catalog, public;" >> /docker-entrypoint-initdb.d/02-age.sql
|