Merge branch 'HKUDS:main' into main

This commit is contained in:
Ken Chen 2025-06-24 20:20:54 +08:00 committed by GitHub
commit 12054fa8d9
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
22 changed files with 1514 additions and 3058 deletions

View file

@ -4,7 +4,7 @@
## 🎉 新闻
- [X] [2025.06.05]🎯📢LightRAG现已集成MinerU支持多模态文档解析与RAGPDF、图片、Office、表格、公式等)。详见下方[多模态处理模块](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#multimodal-document-processing-mineru-integration)。
- [X] [2025.06.05]🎯📢LightRAG现已集成RAG-Anything支持全面的多模态文档解析与RAG能力PDF、图片、Office文档、表格、公式等)。详见下方[多模态处理模块](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#多模态文档处理rag-anything集成)。
- [X] [2025.03.18]🎯📢LightRAG现已支持引文功能。
- [X] [2025.02.05]🎯📢我们团队发布了[VideoRAG](https://github.com/HKUDS/VideoRAG),用于理解超长上下文视频。
- [X] [2025.01.13]🎯📢我们团队发布了[MiniRAG](https://github.com/HKUDS/MiniRAG)使用小型模型简化RAG。
@ -932,6 +932,94 @@ rag.insert_custom_kg(custom_kg)
</details>
## 删除功能
LightRAG提供了全面的删除功能允许您删除文档、实体和关系。
<details>
<summary> <b>删除实体</b> </summary>
您可以通过实体名称删除实体及其所有关联关系:
```python
# 删除实体及其所有关系(同步版本)
rag.delete_by_entity("Google")
# 异步版本
await rag.adelete_by_entity("Google")
```
删除实体时会:
- 从知识图谱中移除该实体节点
- 删除该实体的所有关联关系
- 从向量数据库中移除相关的嵌入向量
- 保持知识图谱的完整性
</details>
<details>
<summary> <b>删除关系</b> </summary>
您可以删除两个特定实体之间的关系:
```python
# 删除两个实体之间的关系(同步版本)
rag.delete_by_relation("Google", "Gmail")
# 异步版本
await rag.adelete_by_relation("Google", "Gmail")
```
删除关系时会:
- 移除指定的关系边
- 从向量数据库中删除关系的嵌入向量
- 保留两个实体节点及其他关系
</details>
<details>
<summary> <b>通过文档ID删除</b> </summary>
您可以通过文档ID删除整个文档及其相关的所有知识
```python
# 通过文档ID删除异步版本
await rag.adelete_by_doc_id("doc-12345")
```
通过文档ID删除时的优化处理
- **智能清理**:自动识别并删除仅属于该文档的实体和关系
- **保留共享知识**:如果实体或关系在其他文档中也存在,则会保留并重新构建描述
- **缓存优化**清理相关的LLM缓存以减少存储开销
- **增量重建**:从剩余文档重新构建受影响的实体和关系描述
删除过程包括:
1. 删除文档相关的所有文本块
2. 识别仅属于该文档的实体和关系并删除
3. 重新构建在其他文档中仍存在的实体和关系
4. 更新所有相关的向量索引
5. 清理文档状态记录
注意通过文档ID删除是一个异步操作因为它涉及复杂的知识图谱重构过程。
</details>
<details>
<summary> <b>删除注意事项</b> </summary>
**重要提醒:**
1. **不可逆操作**:所有删除操作都是不可逆的,请谨慎使用
2. **性能考虑**删除大量数据时可能需要一些时间特别是通过文档ID删除
3. **数据一致性**:删除操作会自动维护知识图谱和向量数据库之间的一致性
4. **备份建议**:在执行重要删除操作前建议备份数据
**批量删除建议:**
- 对于批量删除操作,建议使用异步方法以获得更好的性能
- 大规模删除时,考虑分批进行以避免系统负载过高
</details>
## 实体合并
<details>
@ -1003,31 +1091,59 @@ rag.merge_entities(
</details>
## 多模态文档处理(MinerU集成)
## 多模态文档处理(RAG-Anything集成)
LightRAG 现已支持通过 [MinerU](https://github.com/opendatalab/MinerU) 实现多模态文档解析与检索增强生成RAG。您可以从 PDF、图片、Office 文档中提取结构化内容(文本、图片、表格、公式等),并在 RAG 流程中使用
LightRAG 现已与 [RAG-Anything](https://github.com/HKUDS/RAG-Anything) 实现无缝集成,这是一个专为 LightRAG 构建的**全能多模态文档处理RAG系统**。RAG-Anything 提供先进的解析和检索增强生成RAG能力让您能够无缝处理多模态文档并从各种文档格式中提取结构化内容——包括文本、图片、表格和公式——以集成到您的RAG流程中
**主要特性:**
- 支持解析 PDF、图片、DOC/DOCX/PPT/PPTX 等多种格式
- 提取并索引文本、图片、表格、公式及文档结构
- 在 RAG 中查询和检索多模态内容(文本、图片、表格、公式)
- 与 LightRAG Core 及 RAGAnything 无缝集成
- **端到端多模态流程**:从文档摄取解析到智能多模态问答的完整工作流程
- **通用文档支持**无缝处理PDF、Office文档DOC/DOCX/PPT/PPTX/XLS/XLSX、图片和各种文件格式
- **专业内容分析**:针对图片、表格、数学公式和异构内容类型的专用处理器
- **多模态知识图谱**:自动实体提取和跨模态关系发现以增强理解
- **混合智能检索**:覆盖文本和多模态内容的高级搜索能力,具备上下文理解
**快速开始:**
1. 安装依赖
1. 安装RAG-Anything
```bash
pip install "magic-pdf[full]>=1.2.2" huggingface_hub
pip install raganything
```
2. 下载 MinerU 模型权重(详见 [MinerU 集成指南](docs/mineru_integration_zh.md)
3. 使用新版 `MineruParser` 或 RAGAnything 的 `process_document_complete` 处理文件:
2. 处理多模态文档:
```python
from lightrag.mineru_parser import MineruParser
content_list, md_content = MineruParser.parse_pdf('path/to/document.pdf', 'output_dir')
# 或自动识别类型:
content_list, md_content = MineruParser.parse_document('path/to/file', 'auto', 'output_dir')
```
4. 使用 LightRAG 查询多模态内容请参见 [docs/mineru_integration_zh.md](docs/mineru_integration_zh.md)。
import asyncio
from raganything import RAGAnything
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
async def main():
# 使用LightRAG集成初始化RAGAnything
rag = RAGAnything(
working_dir="./rag_storage",
llm_model_func=lambda prompt, **kwargs: openai_complete_if_cache(
"gpt-4o-mini", prompt, api_key="your-api-key", **kwargs
),
embedding_func=lambda texts: openai_embed(
texts, model="text-embedding-3-large", api_key="your-api-key"
),
embedding_dim=3072,
)
# 处理多模态文档
await rag.process_document_complete(
file_path="path/to/your/document.pdf",
output_dir="./output"
)
# 查询多模态内容
result = await rag.query_with_multimodal(
"图表中显示的主要发现是什么?",
mode="hybrid"
)
print(result)
if __name__ == "__main__":
asyncio.run(main())
```
如需详细文档和高级用法,请参阅 [RAG-Anything 仓库](https://github.com/HKUDS/RAG-Anything)。
## Token统计功能

288
README.md
View file

@ -39,7 +39,8 @@
</div>
## 🎉 News
- [X] [2025.06.05]🎯📢LightRAG now supports multi-modal data handling through MinerU integration, enabling comprehensive document parsing and RAG capabilities across diverse formats including PDFs, images, Office documents, tables, and formulas. Please refer to the new [multimodal section](https://github.com/HKUDS/LightRAG/?tab=readme-ov-file#multimodal-document-processing-mineru-integration) for details.
- [X] [2025.06.16]🎯📢Our team has released [RAG-Anything](https://github.com/HKUDS/RAG-Anything) an All-in-One Multimodal RAG System for seamless text, image, table, and equation processing.
- [X] [2025.06.05]🎯📢LightRAG now supports comprehensive multimodal data handling through [RAG-Anything](https://github.com/HKUDS/RAG-Anything) integration, enabling seamless document parsing and RAG capabilities across diverse formats including PDFs, images, Office documents, tables, and formulas. Please refer to the new [multimodal section](https://github.com/HKUDS/LightRAG/?tab=readme-ov-file#multimodal-document-processing-rag-anything-integration) for details.
- [X] [2025.03.18]🎯📢LightRAG now supports citation functionality, enabling proper source attribution.
- [X] [2025.02.05]🎯📢Our team has released [VideoRAG](https://github.com/HKUDS/VideoRAG) understanding extremely long-context videos.
- [X] [2025.01.13]🎯📢Our team has released [MiniRAG](https://github.com/HKUDS/MiniRAG) making RAG simpler with small models.
@ -191,7 +192,7 @@ async def main():
rag.insert("Your text")
# Perform hybrid search
mode="hybrid"
mode = "hybrid"
print(
await rag.query(
"What are the top themes in this story?",
@ -987,6 +988,89 @@ These operations maintain data consistency across both the graph database and ve
</details>
## Delete Functions
LightRAG provides comprehensive deletion capabilities, allowing you to delete documents, entities, and relationships.
<details>
<summary> <b>Delete Entities</b> </summary>
You can delete entities by their name along with all associated relationships:
```python
# Delete entity and all its relationships (synchronous version)
rag.delete_by_entity("Google")
# Asynchronous version
await rag.adelete_by_entity("Google")
```
When deleting an entity:
- Removes the entity node from the knowledge graph
- Deletes all associated relationships
- Removes related embedding vectors from the vector database
- Maintains knowledge graph integrity
</details>
<details>
<summary> <b>Delete Relations</b> </summary>
You can delete relationships between two specific entities:
```python
# Delete relationship between two entities (synchronous version)
rag.delete_by_relation("Google", "Gmail")
# Asynchronous version
await rag.adelete_by_relation("Google", "Gmail")
```
When deleting a relationship:
- Removes the specified relationship edge
- Deletes the relationship's embedding vector from the vector database
- Preserves both entity nodes and their other relationships
</details>
<details>
<summary> <b>Delete by Document ID</b> </summary>
You can delete an entire document and all its related knowledge through document ID:
```python
# Delete by document ID (asynchronous version)
await rag.adelete_by_doc_id("doc-12345")
```
Optimized processing when deleting by document ID:
- **Smart Cleanup**: Automatically identifies and removes entities and relationships that belong only to this document
- **Preserve Shared Knowledge**: If entities or relationships exist in other documents, they are preserved and their descriptions are rebuilt
- **Cache Optimization**: Clears related LLM cache to reduce storage overhead
- **Incremental Rebuilding**: Reconstructs affected entity and relationship descriptions from remaining documents
The deletion process includes:
1. Delete all text chunks related to the document
2. Identify and delete entities and relationships that belong only to this document
3. Rebuild entities and relationships that still exist in other documents
4. Update all related vector indexes
5. Clean up document status records
Note: Deletion by document ID is an asynchronous operation as it involves complex knowledge graph reconstruction processes.
</details>
**Important Reminders:**
1. **Irreversible Operations**: All deletion operations are irreversible, please use with caution
2. **Performance Considerations**: Deleting large amounts of data may take some time, especially deletion by document ID
3. **Data Consistency**: Deletion operations automatically maintain consistency between the knowledge graph and vector database
4. **Backup Recommendations**: Consider backing up data before performing important deletion operations
**Batch Deletion Recommendations:**
- For batch deletion operations, consider using asynchronous methods for better performance
- For large-scale deletions, consider processing in batches to avoid excessive system load
## Entity Merging
<details>
@ -1058,31 +1142,59 @@ When merging entities:
</details>
## Multimodal Document Processing (MinerU Integration)
## Multimodal Document Processing (RAG-Anything Integration)
LightRAG now supports comprehensive multi-modal document processing through [MinerU](https://github.com/opendatalab/MinerU) integration, enabling advanced parsing and retrieval-augmented generation (RAG) capabilities. This powerful feature allows you to handle multi-modal documents seamlessly, extracting structured content—including text, images, tables, and formulas—from various document formats for integration into your RAG pipeline.
LightRAG now seamlessly integrates with [RAG-Anything](https://github.com/HKUDS/RAG-Anything), a comprehensive **All-in-One Multimodal Document Processing RAG system** built specifically for LightRAG. RAG-Anything enables advanced parsing and retrieval-augmented generation (RAG) capabilities, allowing you to handle multimodal documents seamlessly and extract structured content—including text, images, tables, and formulas—from various document formats for integration into your RAG pipeline.
**Key Features:**
- **Multimodal Document Handling**: Process complex documents containing mixed content types (text, images, tables, formulas)
- **Comprehensive Format Support**: Parse PDFs, images, DOC/DOCX/PPT/PPTX, and additional file types
- **Multi-Element Extraction**: Extract and index text, images, tables, formulas, and document structure
- **Multimodal Retrieval**: Query and retrieve diverse content types (text, images, tables, formulas) within RAG workflows
- **Seamless Integration**: Works smoothly with LightRAG core and RAG-Anything frameworks
- **End-to-End Multimodal Pipeline**: Complete workflow from document ingestion and parsing to intelligent multimodal query answering
- **Universal Document Support**: Seamless processing of PDFs, Office documents (DOC/DOCX/PPT/PPTX/XLS/XLSX), images, and diverse file formats
- **Specialized Content Analysis**: Dedicated processors for images, tables, mathematical equations, and heterogeneous content types
- **Multimodal Knowledge Graph**: Automatic entity extraction and cross-modal relationship discovery for enhanced understanding
- **Hybrid Intelligent Retrieval**: Advanced search capabilities spanning textual and multimodal content with contextual understanding
**Quick Start:**
1. Install dependencies:
1. Install RAG-Anything:
```bash
pip install "magic-pdf[full]>=1.2.2" huggingface_hub
pip install raganything
```
2. Download MinerU model weights (refer to [MinerU Integration Guide](docs/mineru_integration_en.md))
3. Process multi-modal documents using the new MineruParser or RAG-Anything's process_document_complete:
2. Process multimodal documents:
```python
from lightrag.mineru_parser import MineruParser
content_list, md_content = MineruParser.parse_pdf('path/to/document.pdf', 'output_dir')
# or for any file type:
content_list, md_content = MineruParser.parse_document('path/to/file', 'auto', 'output_dir')
import asyncio
from raganything import RAGAnything
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
async def main():
# Initialize RAGAnything with LightRAG integration
rag = RAGAnything(
working_dir="./rag_storage",
llm_model_func=lambda prompt, **kwargs: openai_complete_if_cache(
"gpt-4o-mini", prompt, api_key="your-api-key", **kwargs
),
embedding_func=lambda texts: openai_embed(
texts, model="text-embedding-3-large", api_key="your-api-key"
),
embedding_dim=3072,
)
# Process multimodal documents
await rag.process_document_complete(
file_path="path/to/your/document.pdf",
output_dir="./output"
)
# Query multimodal content
result = await rag.query_with_multimodal(
"What are the main findings shown in the figures and tables?",
mode="hybrid"
)
print(result)
if __name__ == "__main__":
asyncio.run(main())
```
4. Query multimodal content with LightRAG refer to [docs/mineru_integration_en.md](docs/mineru_integration_en.md).
For detailed documentation and advanced usage, please refer to the [RAG-Anything repository](https://github.com/HKUDS/RAG-Anything).
## Token Usage Tracking
@ -1225,6 +1337,33 @@ Valid modes are:
</details>
## Troubleshooting
### Common Initialization Errors
If you encounter these errors when using LightRAG:
1. **`AttributeError: __aenter__`**
- **Cause**: Storage backends not initialized
- **Solution**: Call `await rag.initialize_storages()` after creating the LightRAG instance
2. **`KeyError: 'history_messages'`**
- **Cause**: Pipeline status not initialized
- **Solution**: Call `await initialize_pipeline_status()` after initializing storages
3. **Both errors in sequence**
- **Cause**: Neither initialization method was called
- **Solution**: Always follow this pattern:
```python
rag = LightRAG(...)
await rag.initialize_storages()
await initialize_pipeline_status()
```
### Model Switching Issues
When switching between different embedding models, you must clear the data directory to avoid errors. The only file you may want to preserve is `kv_store_llm_response_cache.json` if you wish to retain the LLM cache.
## LightRAG API
The LightRAG Server is designed to provide Web UI and API support. **For more information about LightRAG Server, please refer to [LightRAG Server](./lightrag/api/README.md).**
@ -1490,7 +1629,47 @@ def extract_queries(file_path):
</details>
## Star History
## 🔗 Related Projects
*Ecosystem & Extensions*
<div align="center">
<table>
<tr>
<td align="center">
<a href="https://github.com/HKUDS/RAG-Anything">
<div style="width: 100px; height: 100px; background: linear-gradient(135deg, rgba(0, 217, 255, 0.1) 0%, rgba(0, 217, 255, 0.05) 100%); border-radius: 15px; border: 1px solid rgba(0, 217, 255, 0.2); display: flex; align-items: center; justify-content: center; margin-bottom: 10px;">
<span style="font-size: 32px;">📸</span>
</div>
<b>RAG-Anything</b><br>
<sub>Multimodal RAG</sub>
</a>
</td>
<td align="center">
<a href="https://github.com/HKUDS/VideoRAG">
<div style="width: 100px; height: 100px; background: linear-gradient(135deg, rgba(0, 217, 255, 0.1) 0%, rgba(0, 217, 255, 0.05) 100%); border-radius: 15px; border: 1px solid rgba(0, 217, 255, 0.2); display: flex; align-items: center; justify-content: center; margin-bottom: 10px;">
<span style="font-size: 32px;">🎥</span>
</div>
<b>VideoRAG</b><br>
<sub>Extreme Long-Context Video RAG</sub>
</a>
</td>
<td align="center">
<a href="https://github.com/HKUDS/MiniRAG">
<div style="width: 100px; height: 100px; background: linear-gradient(135deg, rgba(0, 217, 255, 0.1) 0%, rgba(0, 217, 255, 0.05) 100%); border-radius: 15px; border: 1px solid rgba(0, 217, 255, 0.2); display: flex; align-items: center; justify-content: center; margin-bottom: 10px;">
<span style="font-size: 32px;"></span>
</div>
<b>MiniRAG</b><br>
<sub>Extremely Simple RAG</sub>
</a>
</td>
</tr>
</table>
</div>
---
## ⭐ Star History
<a href="https://star-history.com/#HKUDS/LightRAG&Date">
<picture>
@ -1500,42 +1679,22 @@ def extract_queries(file_path):
</picture>
</a>
## Contribution
## 🤝 Contribution
Thank you to all our contributors!
<div align="center">
We thank all our contributors for their valuable contributions.
</div>
<a href="https://github.com/HKUDS/LightRAG/graphs/contributors">
<img src="https://contrib.rocks/image?repo=HKUDS/LightRAG" />
</a>
<div align="center">
<a href="https://github.com/HKUDS/LightRAG/graphs/contributors">
<img src="https://contrib.rocks/image?repo=HKUDS/LightRAG" style="border-radius: 15px; box-shadow: 0 0 20px rgba(0, 217, 255, 0.3);" />
</a>
</div>
## Troubleshooting
---
### Common Initialization Errors
If you encounter these errors when using LightRAG:
1. **`AttributeError: __aenter__`**
- **Cause**: Storage backends not initialized
- **Solution**: Call `await rag.initialize_storages()` after creating the LightRAG instance
2. **`KeyError: 'history_messages'`**
- **Cause**: Pipeline status not initialized
- **Solution**: Call `await initialize_pipeline_status()` after initializing storages
3. **Both errors in sequence**
- **Cause**: Neither initialization method was called
- **Solution**: Always follow this pattern:
```python
rag = LightRAG(...)
await rag.initialize_storages()
await initialize_pipeline_status()
```
### Model Switching Issues
When switching between different embedding models, you must clear the data directory to avoid errors. The only file you may want to preserve is `kv_store_llm_response_cache.json` if you wish to retain the LLM cache.
## 🌟Citation
## 📖 Citation
```python
@article{guo2024lightrag,
@ -1548,4 +1707,31 @@ primaryClass={cs.IR}
}
```
**Thank you for your interest in our work!**
---
<div align="center" style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); border-radius: 15px; padding: 30px; margin: 30px 0;">
<div>
<img src="https://user-images.githubusercontent.com/74038190/212284100-561aa473-3905-4a80-b561-0d28506553ee.gif" width="500">
</div>
<div style="margin-top: 20px;">
<a href="https://github.com/HKUDS/LightRAG" style="text-decoration: none;">
<img src="https://img.shields.io/badge/⭐%20Star%20us%20on%20GitHub-1a1a2e?style=for-the-badge&logo=github&logoColor=white">
</a>
<a href="https://github.com/HKUDS/LightRAG/issues" style="text-decoration: none;">
<img src="https://img.shields.io/badge/🐛%20Report%20Issues-ff6b6b?style=for-the-badge&logo=github&logoColor=white">
</a>
<a href="https://github.com/HKUDS/LightRAG/discussions" style="text-decoration: none;">
<img src="https://img.shields.io/badge/💬%20Discussions-4ecdc4?style=for-the-badge&logo=github&logoColor=white">
</a>
</div>
</div>
<div align="center">
<div style="width: 100%; max-width: 600px; margin: 20px auto; padding: 20px; background: linear-gradient(135deg, rgba(0, 217, 255, 0.1) 0%, rgba(0, 217, 255, 0.05) 100%); border-radius: 15px; border: 1px solid rgba(0, 217, 255, 0.2);">
<div style="display: flex; justify-content: center; align-items: center; gap: 15px;">
<span style="font-size: 24px;"></span>
<span style="color: #00d9ff; font-size: 18px;">Thank you for visiting LightRAG!</span>
<span style="font-size: 24px;"></span>
</div>
</div>
</div>

View file

@ -1,360 +0,0 @@
# MinerU Integration Guide
### About MinerU
MinerU is a powerful open-source tool for extracting high-quality structured data from PDF, image, and office documents. It provides the following features:
- Text extraction while preserving document structure (headings, paragraphs, lists, etc.)
- Handling complex layouts including multi-column formats
- Automatic formula recognition and conversion to LaTeX format
- Image, table, and footnote extraction
- Automatic scanned document detection and OCR application
- Support for multiple output formats (Markdown, JSON)
### Installation
#### Installing MinerU Dependencies
If you have already installed LightRAG but don't have MinerU support, you can add MinerU support by installing the magic-pdf package directly:
```bash
pip install "magic-pdf[full]>=1.2.2" huggingface_hub
```
These are the MinerU-related dependencies required by LightRAG.
#### MinerU Model Weights
MinerU requires model weight files to function properly. After installation, you need to download the required model weights. You can use either Hugging Face or ModelScope to download the models.
##### Option 1: Download from Hugging Face
```bash
pip install huggingface_hub
wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
python download_models_hf.py
```
##### Option 2: Download from ModelScope (Recommended for users in China)
```bash
pip install modelscope
wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models.py -O download_models.py
python download_models.py
```
Both methods will automatically download the model files and configure the model directory in the configuration file. The configuration file is located in your user directory and named `magic-pdf.json`.
> **Note for Windows users**: User directory is at `C:\Users\username`
> **Note for Linux users**: User directory is at `/home/username`
> **Note for macOS users**: User directory is at `/Users/username`
#### Optional: LibreOffice Installation
To process Office documents (DOC, DOCX, PPT, PPTX), you need to install LibreOffice:
**Linux/macOS:**
```bash
apt-get/yum/brew install libreoffice
```
**Windows:**
1. Install LibreOffice
2. Add the installation directory to your PATH: `install_dir\LibreOffice\program`
### Using MinerU Parser
#### Basic Usage
```python
from lightrag.mineru_parser import MineruParser
# Parse a PDF document
content_list, md_content = MineruParser.parse_pdf('path/to/document.pdf', 'output_dir')
# Parse an image
content_list, md_content = MineruParser.parse_image('path/to/image.jpg', 'output_dir')
# Parse an Office document
content_list, md_content = MineruParser.parse_office_doc('path/to/document.docx', 'output_dir')
# Auto-detect and parse any supported document type
content_list, md_content = MineruParser.parse_document('path/to/file', 'auto', 'output_dir')
```
#### RAGAnything Integration
In RAGAnything, you can directly use file paths as input to the `process_document_complete` method to process documents. Here's a complete configuration example:
```python
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
from lightrag.raganything import RAGAnything
# Initialize RAGAnything
rag = RAGAnything(
working_dir="./rag_storage", # Working directory
llm_model_func=lambda prompt, system_prompt=None, history_messages=[], **kwargs: openai_complete_if_cache(
"gpt-4o-mini", # Model to use
prompt,
system_prompt=system_prompt,
history_messages=history_messages,
api_key="your-api-key", # Replace with your API key
base_url="your-base-url", # Replace with your API base URL
**kwargs,
),
vision_model_func=lambda prompt, system_prompt=None, history_messages=[], image_data=None, **kwargs: openai_complete_if_cache(
"gpt-4o", # Vision model
"",
system_prompt=None,
history_messages=[],
messages=[
{"role": "system", "content": system_prompt} if system_prompt else None,
{"role": "user", "content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_data}"
}
}
]} if image_data else {"role": "user", "content": prompt}
],
api_key="your-api-key", # Replace with your API key
base_url="your-base-url", # Replace with your API base URL
**kwargs,
) if image_data else openai_complete_if_cache(
"gpt-4o-mini",
prompt,
system_prompt=system_prompt,
history_messages=history_messages,
api_key="your-api-key", # Replace with your API key
base_url="your-base-url", # Replace with your API base URL
**kwargs,
),
embedding_func=lambda texts: openai_embed(
texts,
model="text-embedding-3-large",
api_key="your-api-key", # Replace with your API key
base_url="your-base-url", # Replace with your API base URL
),
embedding_dim=3072,
max_token_size=8192
)
# Process a single file
await rag.process_document_complete(
file_path="path/to/document.pdf",
output_dir="./output",
parse_method="auto"
)
# Query the processed document
result = await rag.query_with_multimodal(
"What is the main content of the document?",
mode="hybrid"
)
```
MinerU categorizes document content into text, formulas, images, and tables, processing each with its corresponding ingestion type:
- Text content: `ingestion_type='text'`
- Image content: `ingestion_type='image'`
- Table content: `ingestion_type='table'`
- Formula content: `ingestion_type='equation'`
#### Query Examples
Here are some common query examples:
```python
# Query text content
result = await rag.query_with_multimodal(
"What is the main topic of the document?",
mode="hybrid"
)
# Query image-related content
result = await rag.query_with_multimodal(
"Describe the images and figures in the document",
mode="hybrid"
)
# Query table-related content
result = await rag.query_with_multimodal(
"Tell me about the experimental results and data tables",
mode="hybrid"
)
```
#### Command Line Tool
We also provide a command-line tool for document parsing:
```bash
python examples/mineru_example.py path/to/document.pdf
```
Optional parameters:
- `--output` or `-o`: Specify output directory
- `--method` or `-m`: Choose parsing method (auto, ocr, txt)
- `--stats`: Display content statistics
### Output Format
MinerU generates three files for each parsed document:
1. `{filename}.md` - Markdown representation of the document
2. `{filename}_content_list.json` - Structured JSON content
3. `{filename}_model.json` - Detailed model parsing results
The `content_list.json` file contains all structured content extracted from the document, including:
- Text blocks (body text, headings, etc.)
- Images (paths and optional captions)
- Tables (table content and optional captions)
- Lists
- Formulas
### Troubleshooting
If you encounter issues with MinerU:
1. Check that model weights are correctly downloaded
2. Ensure you have sufficient RAM (16GB+ recommended)
3. For CUDA acceleration issues, see [MinerU documentation](https://mineru.readthedocs.io/en/latest/additional_notes/faq.html)
4. If parsing Office documents fails, verify LibreOffice is properly installed
5. If you encounter `pickle.UnpicklingError: invalid load key, 'v'.`, it might be due to an incomplete model download. Try re-downloading the models.
6. For users with newer graphics cards (H100, etc.) and garbled OCR text, try upgrading the CUDA version used by Paddle:
```bash
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
```
7. If you encounter a "filename too long" error, the latest version of MineruParser includes logic to automatically handle this issue.
#### Updating Existing Models
If you have previously downloaded models and need to update them, you can simply run the download script again. The script will update the model directory to the latest version.
### Advanced Configuration
The MinerU configuration file `magic-pdf.json` supports various customization options, including:
- Model directory path
- OCR engine selection
- GPU acceleration settings
- Cache settings
For complete configuration options, refer to the [MinerU official documentation](https://mineru.readthedocs.io/).
### Using Modal Processors Directly
You can also use LightRAG's modal processors directly without going through MinerU. This is useful when you want to process specific types of content or have more control over the processing pipeline.
Each modal processor returns a tuple containing:
1. A description of the processed content
2. Entity information that can be used for further processing or storage
The processors support different types of content:
- `ImageModalProcessor`: Processes images with captions and footnotes
- `TableModalProcessor`: Processes tables with captions and footnotes
- `EquationModalProcessor`: Processes mathematical equations in LaTeX format
- `GenericModalProcessor`: A base processor that can be extended for custom content types
> **Note**: A complete working example can be found in `examples/modalprocessors_example.py`. You can run it using:
> ```bash
> python examples/modalprocessors_example.py --api-key YOUR_API_KEY
> ```
<details>
<summary> Here's an example of how to use different modal processors: </summary>
```python
from lightrag.modalprocessors import (
ImageModalProcessor,
TableModalProcessor,
EquationModalProcessor,
GenericModalProcessor
)
# Initialize LightRAG
lightrag = LightRAG(
working_dir="./rag_storage",
embedding_func=lambda texts: openai_embed(
texts,
model="text-embedding-3-large",
api_key="your-api-key",
base_url="your-base-url",
),
llm_model_func=lambda prompt, system_prompt=None, history_messages=[], **kwargs: openai_complete_if_cache(
"gpt-4o-mini",
prompt,
system_prompt=system_prompt,
history_messages=history_messages,
api_key="your-api-key",
base_url="your-base-url",
**kwargs,
),
)
# Process an image
image_processor = ImageModalProcessor(
lightrag=lightrag,
modal_caption_func=vision_model_func
)
image_content = {
"img_path": "image.jpg",
"img_caption": ["Example image caption"],
"img_footnote": ["Example image footnote"]
}
description, entity_info = await image_processor.process_multimodal_content(
modal_content=image_content,
content_type="image",
file_path="image_example.jpg",
entity_name="Example Image"
)
# Process a table
table_processor = TableModalProcessor(
lightrag=lightrag,
modal_caption_func=llm_model_func
)
table_content = {
"table_body": """
| Name | Age | Occupation |
|------|-----|------------|
| John | 25 | Engineer |
| Mary | 30 | Designer |
""",
"table_caption": ["Employee Information Table"],
"table_footnote": ["Data updated as of 2024"]
}
description, entity_info = await table_processor.process_multimodal_content(
modal_content=table_content,
content_type="table",
file_path="table_example.md",
entity_name="Employee Table"
)
# Process an equation
equation_processor = EquationModalProcessor(
lightrag=lightrag,
modal_caption_func=llm_model_func
)
equation_content = {
"text": "E = mc^2",
"text_format": "LaTeX"
}
description, entity_info = await equation_processor.process_multimodal_content(
modal_content=equation_content,
content_type="equation",
file_path="equation_example.txt",
entity_name="Mass-Energy Equivalence"
)
```
</details>

View file

@ -1,358 +0,0 @@
# MinerU 集成指南
### 关于 MinerU
MinerU 是一个强大的开源工具,用于从 PDF、图像和 Office 文档中提取高质量的结构化数据。它提供以下功能:
- 保留文档结构(标题、段落、列表等)的文本提取
- 处理包括多列格式在内的复杂布局
- 自动识别并将公式转换为 LaTeX 格式
- 提取图像、表格和脚注
- 自动检测扫描文档并应用 OCR
- 支持多种输出格式Markdown、JSON
### 安装
#### 安装 MinerU 依赖
如果您已经安装了 LightRAG但没有 MinerU 支持,您可以通过安装 magic-pdf 包来直接添加 MinerU 支持:
```bash
pip install "magic-pdf[full]>=1.2.2" huggingface_hub
```
这些是 LightRAG 所需的 MinerU 相关依赖项。
#### MinerU 模型权重
MinerU 需要模型权重文件才能正常运行。安装后,您需要下载所需的模型权重。您可以使用 Hugging Face 或 ModelScope 下载模型。
##### 选项 1从 Hugging Face 下载
```bash
pip install huggingface_hub
wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
python download_models_hf.py
```
##### 选项 2从 ModelScope 下载(推荐中国用户使用)
```bash
pip install modelscope
wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models.py -O download_models.py
python download_models.py
```
两种方法都会自动下载模型文件并在配置文件中配置模型目录。配置文件位于用户目录中,名为 `magic-pdf.json`
> **Windows 用户注意**:用户目录位于 `C:\Users\用户名`
> **Linux 用户注意**:用户目录位于 `/home/用户名`
> **macOS 用户注意**:用户目录位于 `/Users/用户名`
#### 可选:安装 LibreOffice
要处理 Office 文档DOC、DOCX、PPT、PPTX您需要安装 LibreOffice
**Linux/macOS**
```bash
apt-get/yum/brew install libreoffice
```
**Windows**
1. 安装 LibreOffice
2. 将安装目录添加到 PATH 环境变量:`安装目录\LibreOffice\program`
### 使用 MinerU 解析器
#### 基本用法
```python
from lightrag.mineru_parser import MineruParser
# 解析 PDF 文档
content_list, md_content = MineruParser.parse_pdf('path/to/document.pdf', 'output_dir')
# 解析图像
content_list, md_content = MineruParser.parse_image('path/to/image.jpg', 'output_dir')
# 解析 Office 文档
content_list, md_content = MineruParser.parse_office_doc('path/to/document.docx', 'output_dir')
# 自动检测并解析任何支持的文档类型
content_list, md_content = MineruParser.parse_document('path/to/file', 'auto', 'output_dir')
```
#### RAGAnything 集成
在 RAGAnything 中,您可以直接使用文件路径作为 `process_document_complete` 方法的输入来处理文档。以下是一个完整的配置示例:
```python
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
from lightrag.raganything import RAGAnything
# 初始化 RAGAnything
rag = RAGAnything(
working_dir="./rag_storage", # 工作目录
llm_model_func=lambda prompt, system_prompt=None, history_messages=[], **kwargs: openai_complete_if_cache(
"gpt-4o-mini", # 使用的模型
prompt,
system_prompt=system_prompt,
history_messages=history_messages,
api_key="your-api-key", # 替换为您的 API 密钥
base_url="your-base-url", # 替换为您的 API 基础 URL
**kwargs,
),
vision_model_func=lambda prompt, system_prompt=None, history_messages=[], image_data=None, **kwargs: openai_complete_if_cache(
"gpt-4o", # 视觉模型
"",
system_prompt=None,
history_messages=[],
messages=[
{"role": "system", "content": system_prompt} if system_prompt else None,
{"role": "user", "content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_data}"
}
}
]} if image_data else {"role": "user", "content": prompt}
],
api_key="your-api-key", # 替换为您的 API 密钥
base_url="your-base-url", # 替换为您的 API 基础 URL
**kwargs,
) if image_data else openai_complete_if_cache(
"gpt-4o-mini",
prompt,
system_prompt=system_prompt,
history_messages=history_messages,
api_key="your-api-key", # 替换为您的 API 密钥
base_url="your-base-url", # 替换为您的 API 基础 URL
**kwargs,
),
embedding_func=lambda texts: openai_embed(
texts,
model="text-embedding-3-large",
api_key="your-api-key", # 替换为您的 API 密钥
base_url="your-base-url", # 替换为您的 API 基础 URL
),
embedding_dim=3072,
max_token_size=8192
)
# 处理单个文件
await rag.process_document_complete(
file_path="path/to/document.pdf",
output_dir="./output",
parse_method="auto"
)
# 查询处理后的文档
result = await rag.query_with_multimodal(
"What is the main content of the document?",
mode="hybrid"
)
```
MinerU 会将文档内容分类为文本、公式、图像和表格,分别使用相应的摄入类型进行处理:
- 文本内容:`ingestion_type='text'`
- 图像内容:`ingestion_type='image'`
- 表格内容:`ingestion_type='table'`
- 公式内容:`ingestion_type='equation'`
#### 查询示例
以下是一些常见的查询示例:
```python
# 查询文本内容
result = await rag.query_with_multimodal(
"What is the main topic of the document?",
mode="hybrid"
)
# 查询图片相关内容
result = await rag.query_with_multimodal(
"Describe the images and figures in the document",
mode="hybrid"
)
# 查询表格相关内容
result = await rag.query_with_multimodal(
"Tell me about the experimental results and data tables",
mode="hybrid"
)
```
#### 命令行工具
我们还提供了一个用于文档解析的命令行工具:
```bash
python examples/mineru_example.py path/to/document.pdf
```
可选参数:
- `--output``-o`:指定输出目录
- `--method``-m`选择解析方法auto、ocr、txt
- `--stats`:显示内容统计信息
### 输出格式
MinerU 为每个解析的文档生成三个文件:
1. `{文件名}.md` - 文档的 Markdown 表示
2. `{文件名}_content_list.json` - 结构化 JSON 内容
3. `{文件名}_model.json` - 详细的模型解析结果
`content_list.json` 文件包含从文档中提取的所有结构化内容,包括:
- 文本块(正文、标题等)
- 图像(路径和可选的标题)
- 表格(表格内容和可选的标题)
- 列表
- 公式
### 疑难解答
如果您在使用 MinerU 时遇到问题:
1. 检查模型权重是否正确下载
2. 确保有足够的内存(建议 16GB+
3. 对于 CUDA 加速问题,请参阅 [MinerU 文档](https://mineru.readthedocs.io/en/latest/additional_notes/faq.html)
4. 如果解析 Office 文档失败,请验证 LibreOffice 是否正确安装
5. 如果遇到 `pickle.UnpicklingError: invalid load key, 'v'.`,可能是因为模型下载不完整。尝试重新下载模型。
6. 对于使用较新显卡H100 等)并出现 OCR 文本乱码的用户,请尝试升级 Paddle 使用的 CUDA 版本:
```bash
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
```
7. 如果遇到 "文件名太长" 错误,最新版本的 MineruParser 已经包含了自动处理此问题的逻辑。
#### 更新现有模型
如果您之前已经下载了模型并需要更新它们,只需再次运行下载脚本即可。脚本将更新模型目录到最新版本。
### 高级配置
MinerU 配置文件 `magic-pdf.json` 支持多种自定义选项,包括:
- 模型目录路径
- OCR 引擎选择
- GPU 加速设置
- 缓存设置
有关完整的配置选项,请参阅 [MinerU 官方文档](https://mineru.readthedocs.io/)。
### 直接使用模态处理器
您也可以直接使用 LightRAG 的模态处理器,而不需要通过 MinerU。这在您想要处理特定类型的内容或对处理流程有更多控制时特别有用。
每个模态处理器都会返回一个包含以下内容的元组:
1. 处理后内容的描述
2. 可用于进一步处理或存储的实体信息
处理器支持不同类型的内容:
- `ImageModalProcessor`:处理带有标题和脚注的图像
- `TableModalProcessor`:处理带有标题和脚注的表格
- `EquationModalProcessor`:处理 LaTeX 格式的数学公式
- `GenericModalProcessor`:可用于扩展自定义内容类型的基础处理器
> **注意**:完整的可运行示例可以在 `examples/modalprocessors_example.py` 中找到。您可以使用以下命令运行它:
> ```bash
> python examples/modalprocessors_example.py --api-key YOUR_API_KEY
> ```
<details>
<summary> 使用不同模态处理器的示例 </summary>
```python
from lightrag.modalprocessors import (
ImageModalProcessor,
TableModalProcessor,
EquationModalProcessor,
GenericModalProcessor
)
# 初始化 LightRAG
lightrag = LightRAG(
working_dir="./rag_storage",
embedding_func=lambda texts: openai_embed(
texts,
model="text-embedding-3-large",
api_key="your-api-key",
base_url="your-base-url",
),
llm_model_func=lambda prompt, system_prompt=None, history_messages=[], **kwargs: openai_complete_if_cache(
"gpt-4o-mini",
prompt,
system_prompt=system_prompt,
history_messages=history_messages,
api_key="your-api-key",
base_url="your-base-url",
**kwargs,
),
)
# 处理图像
image_processor = ImageModalProcessor(
lightrag=lightrag,
modal_caption_func=vision_model_func
)
image_content = {
"img_path": "image.jpg",
"img_caption": ["示例图像标题"],
"img_footnote": ["示例图像脚注"]
}
description, entity_info = await image_processor.process_multimodal_content(
modal_content=image_content,
content_type="image",
file_path="image_example.jpg",
entity_name="示例图像"
)
# 处理表格
table_processor = TableModalProcessor(
lightrag=lightrag,
modal_caption_func=llm_model_func
)
table_content = {
"table_body": """
| 姓名 | 年龄 | 职业 |
|------|-----|------|
| 张三 | 25 | 工程师 |
| 李四 | 30 | 设计师 |
""",
"table_caption": ["员工信息表"],
"table_footnote": ["数据更新至2024年"]
}
description, entity_info = await table_processor.process_multimodal_content(
modal_content=table_content,
content_type="table",
file_path="table_example.md",
entity_name="员工表格"
)
# 处理公式
equation_processor = EquationModalProcessor(
lightrag=lightrag,
modal_caption_func=llm_model_func
)
equation_content = {
"text": "E = mc^2",
"text_format": "LaTeX"
}
description, entity_info = await equation_processor.process_multimodal_content(
modal_content=equation_content,
content_type="equation",
file_path="equation_example.txt",
entity_name="质能方程"
)
```
</details>

View file

@ -1,85 +0,0 @@
#!/usr/bin/env python
"""
Example script demonstrating the basic usage of MinerU parser
This example shows how to:
1. Parse different types of documents (PDF, images, office documents)
2. Use different parsing methods
3. Display document statistics
"""
import os
import argparse
from lightrag.mineru_parser import MineruParser
def parse_document(
file_path: str, output_dir: str = None, method: str = "auto", stats: bool = False
):
"""
Parse a document using MinerU parser
Args:
file_path: Path to the document
output_dir: Output directory for parsed results
method: Parsing method (auto, ocr, txt)
stats: Whether to display content statistics
"""
try:
# Parse the document
content_list, md_content = MineruParser.parse_document(
file_path=file_path, parse_method=method, output_dir=output_dir
)
# Display statistics if requested
if stats:
print("\nDocument Statistics:")
print(f"Total content blocks: {len(content_list)}")
# Count different types of content
content_types = {}
for item in content_list:
content_type = item.get("type", "unknown")
content_types[content_type] = content_types.get(content_type, 0) + 1
print("\nContent Type Distribution:")
for content_type, count in content_types.items():
print(f"- {content_type}: {count}")
return content_list, md_content
except Exception as e:
print(f"Error parsing document: {str(e)}")
return None, None
def main():
"""Main function to run the example"""
parser = argparse.ArgumentParser(description="MinerU Parser Example")
parser.add_argument("file_path", help="Path to the document to parse")
parser.add_argument("--output", "-o", help="Output directory path")
parser.add_argument(
"--method",
"-m",
choices=["auto", "ocr", "txt"],
default="auto",
help="Parsing method (auto, ocr, txt)",
)
parser.add_argument(
"--stats", action="store_true", help="Display content statistics"
)
args = parser.parse_args()
# Create output directory if specified
if args.output:
os.makedirs(args.output, exist_ok=True)
# Parse document
content_list, md_content = parse_document(
args.file_path, args.output, args.method, args.stats
)
if __name__ == "__main__":
main()

View file

@ -9,7 +9,7 @@ import argparse
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
from lightrag.kg.shared_storage import initialize_pipeline_status
from lightrag import LightRAG
from lightrag.modalprocessors import (
from raganything.modalprocessors import (
ImageModalProcessor,
TableModalProcessor,
EquationModalProcessor,

View file

@ -12,7 +12,7 @@ import os
import argparse
import asyncio
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
from lightrag.raganything import RAGAnything
from raganything.raganything import RAGAnything
async def process_with_rag(

View file

@ -1 +1 @@
__api_version__ = "0173"
__api_version__ = "0174"

View file

@ -355,7 +355,13 @@ def create_app(args):
)
# Add routes
app.include_router(create_document_routes(rag, doc_manager, api_key))
app.include_router(
create_document_routes(
rag,
doc_manager,
api_key,
)
)
app.include_router(create_query_routes(rag, api_key, args.top_k))
app.include_router(create_graph_routes(rag, api_key))

View file

@ -12,11 +12,18 @@ import pipmaster as pm
from datetime import datetime, timezone
from pathlib import Path
from typing import Dict, List, Optional, Any, Literal
from fastapi import APIRouter, BackgroundTasks, Depends, File, HTTPException, UploadFile
from fastapi import (
APIRouter,
BackgroundTasks,
Depends,
File,
HTTPException,
UploadFile,
)
from pydantic import BaseModel, Field, field_validator
from lightrag import LightRAG
from lightrag.base import DocProcessingStatus, DocStatus
from lightrag.base import DeletionResult, DocProcessingStatus, DocStatus
from lightrag.api.utils_api import get_combined_auth_dependency
from ..config import global_args
@ -252,6 +259,40 @@ Attributes:
"""
class DeleteDocRequest(BaseModel):
doc_id: str = Field(..., description="The ID of the document to delete.")
@field_validator("doc_id", mode="after")
@classmethod
def validate_doc_id(cls, doc_id: str) -> str:
if not doc_id or not doc_id.strip():
raise ValueError("Document ID cannot be empty")
return doc_id.strip()
class DeleteEntityRequest(BaseModel):
entity_name: str = Field(..., description="The name of the entity to delete.")
@field_validator("entity_name", mode="after")
@classmethod
def validate_entity_name(cls, entity_name: str) -> str:
if not entity_name or not entity_name.strip():
raise ValueError("Entity name cannot be empty")
return entity_name.strip()
class DeleteRelationRequest(BaseModel):
source_entity: str = Field(..., description="The name of the source entity.")
target_entity: str = Field(..., description="The name of the target entity.")
@field_validator("source_entity", "target_entity", mode="after")
@classmethod
def validate_entity_names(cls, entity_name: str) -> str:
if not entity_name or not entity_name.strip():
raise ValueError("Entity name cannot be empty")
return entity_name.strip()
class DocStatusResponse(BaseModel):
id: str = Field(description="Document identifier")
content_summary: str = Field(description="Summary of document content")
@ -1318,6 +1359,119 @@ def create_document_routes(
logger.error(traceback.format_exc())
raise HTTPException(status_code=500, detail=str(e))
class DeleteDocByIdResponse(BaseModel):
"""Response model for single document deletion operation."""
status: Literal["success", "fail", "not_found", "busy"] = Field(
description="Status of the deletion operation"
)
message: str = Field(description="Message describing the operation result")
doc_id: Optional[str] = Field(
default=None, description="The ID of the document."
)
@router.delete(
"/delete_document",
response_model=DeleteDocByIdResponse,
dependencies=[Depends(combined_auth)],
summary="Delete a document and all its associated data by its ID.",
)
# TODO This method needs to be modified to be asynchronous (please do not use)
async def delete_document(
delete_request: DeleteDocRequest,
) -> DeleteDocByIdResponse:
"""
This method needs to be modified to be asynchronous (please do not use)
Deletes a specific document and all its associated data, including its status,
text chunks, vector embeddings, and any related graph data.
It is disabled when llm cache for entity extraction is disabled.
This operation is irreversible and will interact with the pipeline status.
Args:
delete_request (DeleteDocRequest): The request containing the document ID.
Returns:
DeleteDocByIdResponse: The result of the deletion operation.
- status="success": The document was successfully deleted.
- status="not_found": The document with the specified ID was not found.
- status="fail": The deletion operation failed.
- status="busy": The pipeline is busy with another operation.
Raises:
HTTPException:
- 500: If an unexpected internal error occurs.
"""
# The rag object is initialized from the server startup args,
# so we can access its properties here.
if not rag.enable_llm_cache_for_entity_extract:
raise HTTPException(
status_code=403,
detail="Operation not allowed when LLM cache for entity extraction is disabled.",
)
from lightrag.kg.shared_storage import (
get_namespace_data,
get_pipeline_status_lock,
)
doc_id = delete_request.doc_id
pipeline_status = await get_namespace_data("pipeline_status")
pipeline_status_lock = get_pipeline_status_lock()
async with pipeline_status_lock:
if pipeline_status.get("busy", False):
return DeleteDocByIdResponse(
status="busy",
message="Cannot delete document while pipeline is busy",
doc_id=doc_id,
)
pipeline_status.update(
{
"busy": True,
"job_name": f"Deleting Document: {doc_id}",
"job_start": datetime.now().isoformat(),
"latest_message": "Starting document deletion process",
}
)
# Use slice assignment to clear the list in place
pipeline_status["history_messages"][:] = [
f"Starting deletion for doc_id: {doc_id}"
]
try:
result = await rag.adelete_by_doc_id(doc_id)
if "history_messages" in pipeline_status:
pipeline_status["history_messages"].append(result.message)
if result.status == "not_found":
raise HTTPException(status_code=404, detail=result.message)
if result.status == "fail":
raise HTTPException(status_code=500, detail=result.message)
return DeleteDocByIdResponse(
doc_id=result.doc_id,
message=result.message,
status=result.status,
)
except Exception as e:
error_msg = f"Error deleting document {doc_id}: {str(e)}"
logger.error(error_msg)
logger.error(traceback.format_exc())
if "history_messages" in pipeline_status:
pipeline_status["history_messages"].append(error_msg)
# Re-raise as HTTPException for consistent error handling by FastAPI
raise HTTPException(status_code=500, detail=error_msg)
finally:
async with pipeline_status_lock:
pipeline_status["busy"] = False
completion_msg = f"Document deletion process for {doc_id} completed."
pipeline_status["latest_message"] = completion_msg
if "history_messages" in pipeline_status:
pipeline_status["history_messages"].append(completion_msg)
@router.post(
"/clear_cache",
response_model=ClearCacheResponse,
@ -1371,4 +1525,77 @@ def create_document_routes(
logger.error(traceback.format_exc())
raise HTTPException(status_code=500, detail=str(e))
@router.delete(
"/delete_entity",
response_model=DeletionResult,
dependencies=[Depends(combined_auth)],
)
async def delete_entity(request: DeleteEntityRequest):
"""
Delete an entity and all its relationships from the knowledge graph.
Args:
request (DeleteEntityRequest): The request body containing the entity name.
Returns:
DeletionResult: An object containing the outcome of the deletion process.
Raises:
HTTPException: If the entity is not found (404) or an error occurs (500).
"""
try:
result = await rag.adelete_by_entity(entity_name=request.entity_name)
if result.status == "not_found":
raise HTTPException(status_code=404, detail=result.message)
if result.status == "fail":
raise HTTPException(status_code=500, detail=result.message)
# Set doc_id to empty string since this is an entity operation, not document
result.doc_id = ""
return result
except HTTPException:
raise
except Exception as e:
error_msg = f"Error deleting entity '{request.entity_name}': {str(e)}"
logger.error(error_msg)
logger.error(traceback.format_exc())
raise HTTPException(status_code=500, detail=error_msg)
@router.delete(
"/delete_relation",
response_model=DeletionResult,
dependencies=[Depends(combined_auth)],
)
async def delete_relation(request: DeleteRelationRequest):
"""
Delete a relationship between two entities from the knowledge graph.
Args:
request (DeleteRelationRequest): The request body containing the source and target entity names.
Returns:
DeletionResult: An object containing the outcome of the deletion process.
Raises:
HTTPException: If the relation is not found (404) or an error occurs (500).
"""
try:
result = await rag.adelete_by_relation(
source_entity=request.source_entity,
target_entity=request.target_entity,
)
if result.status == "not_found":
raise HTTPException(status_code=404, detail=result.message)
if result.status == "fail":
raise HTTPException(status_code=500, detail=result.message)
# Set doc_id to empty string since this is a relation operation, not document
result.doc_id = ""
return result
except HTTPException:
raise
except Exception as e:
error_msg = f"Error deleting relation from '{request.source_entity}' to '{request.target_entity}': {str(e)}"
logger.error(error_msg)
logger.error(traceback.format_exc())
raise HTTPException(status_code=500, detail=error_msg)
return router

View file

@ -278,6 +278,21 @@ class BaseKVStorage(StorageNameSpace, ABC):
False: if the cache drop failed, or the cache mode is not supported
"""
# async def drop_cache_by_chunk_ids(self, chunk_ids: list[str] | None = None) -> bool:
# """Delete specific cache records from storage by chunk IDs
# Importance notes for in-memory storage:
# 1. Changes will be persisted to disk during the next index_done_callback
# 2. update flags to notify other processes that data persistence is needed
# Args:
# chunk_ids (list[str]): List of chunk IDs to be dropped from storage
# Returns:
# True: if the cache drop successfully
# False: if the cache drop failed, or the operation is not supported
# """
@dataclass
class BaseGraphStorage(StorageNameSpace, ABC):
@ -598,3 +613,13 @@ class StoragesStatus(str, Enum):
CREATED = "created"
INITIALIZED = "initialized"
FINALIZED = "finalized"
@dataclass
class DeletionResult:
"""Represents the result of a deletion operation."""
status: Literal["success", "not_found", "fail"]
doc_id: str
message: str
status_code: int = 200

View file

@ -172,6 +172,53 @@ class JsonKVStorage(BaseKVStorage):
except Exception:
return False
# async def drop_cache_by_chunk_ids(self, chunk_ids: list[str] | None = None) -> bool:
# """Delete specific cache records from storage by chunk IDs
# Importance notes for in-memory storage:
# 1. Changes will be persisted to disk during the next index_done_callback
# 2. update flags to notify other processes that data persistence is needed
# Args:
# chunk_ids (list[str]): List of chunk IDs to be dropped from storage
# Returns:
# True: if the cache drop successfully
# False: if the cache drop failed
# """
# if not chunk_ids:
# return False
# try:
# async with self._storage_lock:
# # Iterate through all cache modes to find entries with matching chunk_ids
# for mode_key, mode_data in list(self._data.items()):
# if isinstance(mode_data, dict):
# # Check each cached entry in this mode
# for cache_key, cache_entry in list(mode_data.items()):
# if (
# isinstance(cache_entry, dict)
# and cache_entry.get("chunk_id") in chunk_ids
# ):
# # Remove this cache entry
# del mode_data[cache_key]
# logger.debug(
# f"Removed cache entry {cache_key} for chunk {cache_entry.get('chunk_id')}"
# )
# # If the mode is now empty, remove it entirely
# if not mode_data:
# del self._data[mode_key]
# # Set update flags to notify persistence is needed
# await set_all_update_flags(self.namespace)
# logger.info(f"Cleared cache for {len(chunk_ids)} chunk IDs")
# return True
# except Exception as e:
# logger.error(f"Error clearing cache by chunk IDs: {e}")
# return False
async def drop(self) -> dict[str, str]:
"""Drop all data from storage and clean up resources
This action will persistent the data to disk immediately.

View file

@ -1,4 +1,3 @@
import inspect
import os
import re
from dataclasses import dataclass
@ -307,7 +306,7 @@ class Neo4JStorage(BaseGraphStorage):
for label in node_dict["labels"]
if label != "base"
]
logger.debug(f"Neo4j query node {query} return: {node_dict}")
# logger.debug(f"Neo4j query node {query} return: {node_dict}")
return node_dict
return None
finally:
@ -382,9 +381,9 @@ class Neo4JStorage(BaseGraphStorage):
return 0
degree = record["degree"]
logger.debug(
f"Neo4j query node degree for {node_id} return: {degree}"
)
# logger.debug(
# f"Neo4j query node degree for {node_id} return: {degree}"
# )
return degree
finally:
await result.consume() # Ensure result is fully consumed
@ -424,7 +423,7 @@ class Neo4JStorage(BaseGraphStorage):
logger.warning(f"No node found with label '{nid}'")
degrees[nid] = 0
logger.debug(f"Neo4j batch node degree query returned: {degrees}")
# logger.debug(f"Neo4j batch node degree query returned: {degrees}")
return degrees
async def edge_degree(self, src_id: str, tgt_id: str) -> int:
@ -512,7 +511,7 @@ class Neo4JStorage(BaseGraphStorage):
if records:
try:
edge_result = dict(records[0]["edge_properties"])
logger.debug(f"Result: {edge_result}")
# logger.debug(f"Result: {edge_result}")
# Ensure required keys exist with defaults
required_keys = {
"weight": 0.0,
@ -528,9 +527,9 @@ class Neo4JStorage(BaseGraphStorage):
f"missing {key}, using default: {default_value}"
)
logger.debug(
f"{inspect.currentframe().f_code.co_name}:query:{query}:result:{edge_result}"
)
# logger.debug(
# f"{inspect.currentframe().f_code.co_name}:query:{query}:result:{edge_result}"
# )
return edge_result
except (KeyError, TypeError, ValueError) as e:
logger.error(
@ -545,9 +544,9 @@ class Neo4JStorage(BaseGraphStorage):
"keywords": None,
}
logger.debug(
f"{inspect.currentframe().f_code.co_name}: No edge found between {source_node_id} and {target_node_id}"
)
# logger.debug(
# f"{inspect.currentframe().f_code.co_name}: No edge found between {source_node_id} and {target_node_id}"
# )
# Return None when no edge found
return None
finally:
@ -766,9 +765,6 @@ class Neo4JStorage(BaseGraphStorage):
result = await tx.run(
query, entity_id=node_id, properties=properties
)
logger.debug(
f"Upserted node with entity_id '{node_id}' and properties: {properties}"
)
await result.consume() # Ensure result is fully consumed
await session.execute_write(execute_upsert)
@ -824,12 +820,7 @@ class Neo4JStorage(BaseGraphStorage):
properties=edge_properties,
)
try:
records = await result.fetch(2)
if records:
logger.debug(
f"Upserted edge from '{source_node_id}' to '{target_node_id}'"
f"with properties: {edge_properties}"
)
await result.fetch(2)
finally:
await result.consume() # Ensure result is consumed

View file

@ -106,6 +106,35 @@ class PostgreSQLDB:
):
pass
async def _migrate_llm_cache_add_chunk_id(self):
"""Add chunk_id column to LIGHTRAG_LLM_CACHE table if it doesn't exist"""
try:
# Check if chunk_id column exists
check_column_sql = """
SELECT column_name
FROM information_schema.columns
WHERE table_name = 'lightrag_llm_cache'
AND column_name = 'chunk_id'
"""
column_info = await self.query(check_column_sql)
if not column_info:
logger.info("Adding chunk_id column to LIGHTRAG_LLM_CACHE table")
add_column_sql = """
ALTER TABLE LIGHTRAG_LLM_CACHE
ADD COLUMN chunk_id VARCHAR(255) NULL
"""
await self.execute(add_column_sql)
logger.info(
"Successfully added chunk_id column to LIGHTRAG_LLM_CACHE table"
)
else:
logger.info(
"chunk_id column already exists in LIGHTRAG_LLM_CACHE table"
)
except Exception as e:
logger.warning(f"Failed to add chunk_id column to LIGHTRAG_LLM_CACHE: {e}")
async def _migrate_timestamp_columns(self):
"""Migrate timestamp columns in tables to timezone-aware types, assuming original data is in UTC time"""
# Tables and columns that need migration
@ -203,6 +232,13 @@ class PostgreSQLDB:
logger.error(f"PostgreSQL, Failed to migrate timestamp columns: {e}")
# Don't throw an exception, allow the initialization process to continue
# Migrate LLM cache table to add chunk_id field if needed
try:
await self._migrate_llm_cache_add_chunk_id()
except Exception as e:
logger.error(f"PostgreSQL, Failed to migrate LLM cache chunk_id field: {e}")
# Don't throw an exception, allow the initialization process to continue
async def query(
self,
sql: str,
@ -253,25 +289,31 @@ class PostgreSQLDB:
sql: str,
data: dict[str, Any] | None = None,
upsert: bool = False,
ignore_if_exists: bool = False,
with_age: bool = False,
graph_name: str | None = None,
):
try:
async with self.pool.acquire() as connection: # type: ignore
if with_age and graph_name:
await self.configure_age(connection, graph_name) # type: ignore
await self.configure_age(connection, graph_name)
elif with_age and not graph_name:
raise ValueError("Graph name is required when with_age is True")
if data is None:
await connection.execute(sql) # type: ignore
await connection.execute(sql)
else:
await connection.execute(sql, *data.values()) # type: ignore
await connection.execute(sql, *data.values())
except (
asyncpg.exceptions.UniqueViolationError,
asyncpg.exceptions.DuplicateTableError,
asyncpg.exceptions.DuplicateObjectError, # Catch "already exists" error
asyncpg.exceptions.InvalidSchemaNameError, # Also catch for AGE extension "already exists"
) as e:
if upsert:
if ignore_if_exists:
# If the flag is set, just ignore these specific errors
pass
elif upsert:
print("Key value duplicate, but upsert succeeded.")
else:
logger.error(f"Upsert error: {e}")
@ -497,6 +539,7 @@ class PGKVStorage(BaseKVStorage):
"original_prompt": v["original_prompt"],
"return_value": v["return"],
"mode": mode,
"chunk_id": v.get("chunk_id"),
}
await self.db.execute(upsert_sql, _data)
@ -1175,16 +1218,15 @@ class PGGraphStorage(BaseGraphStorage):
]
for query in queries:
try:
await self.db.execute(
query,
upsert=True,
with_age=True,
graph_name=self.graph_name,
)
# logger.info(f"Successfully executed: {query}")
except Exception:
continue
# Use the new flag to silently ignore "already exists" errors
# at the source, preventing log spam.
await self.db.execute(
query,
upsert=True,
ignore_if_exists=True, # Pass the new flag
with_age=True,
graph_name=self.graph_name,
)
async def finalize(self):
if self.db is not None:
@ -2357,6 +2399,7 @@ TABLES = {
mode varchar(32) NOT NULL,
original_prompt TEXT,
return_value TEXT,
chunk_id VARCHAR(255) NULL,
create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
update_time TIMESTAMP,
CONSTRAINT LIGHTRAG_LLM_CACHE_PK PRIMARY KEY (workspace, mode, id)
@ -2389,10 +2432,10 @@ SQL_TEMPLATES = {
chunk_order_index, full_doc_id, file_path
FROM LIGHTRAG_DOC_CHUNKS WHERE workspace=$1 AND id=$2
""",
"get_by_id_llm_response_cache": """SELECT id, original_prompt, COALESCE(return_value, '') as "return", mode
"get_by_id_llm_response_cache": """SELECT id, original_prompt, COALESCE(return_value, '') as "return", mode, chunk_id
FROM LIGHTRAG_LLM_CACHE WHERE workspace=$1 AND mode=$2
""",
"get_by_mode_id_llm_response_cache": """SELECT id, original_prompt, COALESCE(return_value, '') as "return", mode
"get_by_mode_id_llm_response_cache": """SELECT id, original_prompt, COALESCE(return_value, '') as "return", mode, chunk_id
FROM LIGHTRAG_LLM_CACHE WHERE workspace=$1 AND mode=$2 AND id=$3
""",
"get_by_ids_full_docs": """SELECT id, COALESCE(content, '') as content
@ -2402,7 +2445,7 @@ SQL_TEMPLATES = {
chunk_order_index, full_doc_id, file_path
FROM LIGHTRAG_DOC_CHUNKS WHERE workspace=$1 AND id IN ({ids})
""",
"get_by_ids_llm_response_cache": """SELECT id, original_prompt, COALESCE(return_value, '') as "return", mode
"get_by_ids_llm_response_cache": """SELECT id, original_prompt, COALESCE(return_value, '') as "return", mode, chunk_id
FROM LIGHTRAG_LLM_CACHE WHERE workspace=$1 AND mode= IN ({ids})
""",
"filter_keys": "SELECT id FROM {table_name} WHERE workspace=$1 AND id IN ({ids})",
@ -2411,12 +2454,13 @@ SQL_TEMPLATES = {
ON CONFLICT (workspace,id) DO UPDATE
SET content = $2, update_time = CURRENT_TIMESTAMP
""",
"upsert_llm_response_cache": """INSERT INTO LIGHTRAG_LLM_CACHE(workspace,id,original_prompt,return_value,mode)
VALUES ($1, $2, $3, $4, $5)
"upsert_llm_response_cache": """INSERT INTO LIGHTRAG_LLM_CACHE(workspace,id,original_prompt,return_value,mode,chunk_id)
VALUES ($1, $2, $3, $4, $5, $6)
ON CONFLICT (workspace,mode,id) DO UPDATE
SET original_prompt = EXCLUDED.original_prompt,
return_value=EXCLUDED.return_value,
mode=EXCLUDED.mode,
chunk_id=EXCLUDED.chunk_id,
update_time = CURRENT_TIMESTAMP
""",
"upsert_chunk": """INSERT INTO LIGHTRAG_DOC_CHUNKS (workspace, id, tokens,

View file

@ -35,6 +35,7 @@ from lightrag.kg import (
from lightrag.kg.shared_storage import (
get_namespace_data,
get_pipeline_status_lock,
get_graph_db_lock,
)
from .base import (
@ -47,6 +48,7 @@ from .base import (
QueryParam,
StorageNameSpace,
StoragesStatus,
DeletionResult,
)
from .namespace import NameSpace, make_namespace
from .operate import (
@ -56,6 +58,7 @@ from .operate import (
kg_query,
naive_query,
query_with_keywords,
_rebuild_knowledge_from_chunks,
)
from .prompt import GRAPH_FIELD_SEP
from .utils import (
@ -1207,6 +1210,7 @@ class LightRAG:
cast(StorageNameSpace, storage_inst).index_done_callback()
for storage_inst in [ # type: ignore
self.full_docs,
self.doc_status,
self.text_chunks,
self.llm_response_cache,
self.entities_vdb,
@ -1674,24 +1678,45 @@ class LightRAG:
# Return the dictionary containing statuses only for the found document IDs
return found_statuses
# TODO: Deprecated (Deleting documents can cause hallucinations in RAG.)
# Document delete is not working properly for most of the storage implementations.
async def adelete_by_doc_id(self, doc_id: str) -> None:
"""Delete a document and all its related data
async def adelete_by_doc_id(self, doc_id: str) -> DeletionResult:
"""Delete a document and all its related data, including chunks, graph elements, and cached entries.
This method orchestrates a comprehensive deletion process for a given document ID.
It ensures that not only the document itself but also all its derived and associated
data across different storage layers are removed. This includes:
1. **Document and Status**: Deletes the document from `full_docs` and its status from `doc_status`.
2. **Chunks**: Removes all associated text chunks from `chunks_vdb`.
3. **Graph Data**:
- Deletes related entities from `entities_vdb`.
- Deletes related relationships from `relationships_vdb`.
- Removes corresponding nodes and edges from the `chunk_entity_relation_graph`.
4. **Graph Reconstruction**: If entities or relationships are partially affected, it triggers
a reconstruction of their data from the remaining chunks to ensure consistency.
Args:
doc_id: Document ID to delete
doc_id (str): The unique identifier of the document to be deleted.
Returns:
DeletionResult: An object containing the outcome of the deletion process.
- `status` (str): "success", "not_found", or "failure".
- `doc_id` (str): The ID of the document attempted to be deleted.
- `message` (str): A summary of the operation's result.
- `status_code` (int): HTTP status code (e.g., 200, 404, 500).
"""
try:
# 1. Get the document status and related data
if not await self.doc_status.get_by_id(doc_id):
logger.warning(f"Document {doc_id} not found")
return
return DeletionResult(
status="not_found",
doc_id=doc_id,
message=f"Document {doc_id} not found.",
status_code=404,
)
logger.debug(f"Starting deletion for document {doc_id}")
logger.info(f"Starting optimized deletion for document {doc_id}")
# 2. Get all chunks related to this document
# Find all chunks where full_doc_id equals the current doc_id
all_chunks = await self.text_chunks.get_all()
related_chunks = {
chunk_id: chunk_data
@ -1702,241 +1727,197 @@ class LightRAG:
if not related_chunks:
logger.warning(f"No chunks found for document {doc_id}")
return
# Still need to delete the doc status and full doc
await self.full_docs.delete([doc_id])
await self.doc_status.delete([doc_id])
return DeletionResult(
status="success",
doc_id=doc_id,
message=f"Document {doc_id} found but had no associated chunks. Document entry deleted.",
status_code=200,
)
# Get all related chunk IDs
chunk_ids = set(related_chunks.keys())
logger.debug(f"Found {len(chunk_ids)} chunks to delete")
logger.info(f"Found {len(chunk_ids)} chunks to delete")
# TODO: self.entities_vdb.client_storage only works for local storage, need to fix this
# # 3. **OPTIMIZATION 1**: Clear LLM cache for related chunks
# logger.info("Clearing LLM cache for related chunks...")
# cache_cleared = await self.llm_response_cache.drop_cache_by_chunk_ids(
# list(chunk_ids)
# )
# if cache_cleared:
# logger.info(f"Successfully cleared cache for {len(chunk_ids)} chunks")
# else:
# logger.warning(
# "Failed to clear chunk cache or cache clearing not supported"
# )
# 3. Before deleting, check the related entities and relationships for these chunks
for chunk_id in chunk_ids:
# Check entities
entities_storage = await self.entities_vdb.client_storage
entities = [
dp
for dp in entities_storage["data"]
if chunk_id in dp.get("source_id")
]
logger.debug(f"Chunk {chunk_id} has {len(entities)} related entities")
# Check relationships
relationships_storage = await self.relationships_vdb.client_storage
relations = [
dp
for dp in relationships_storage["data"]
if chunk_id in dp.get("source_id")
]
logger.debug(f"Chunk {chunk_id} has {len(relations)} related relations")
# Continue with the original deletion process...
# 4. Delete chunks from vector database
if chunk_ids:
await self.chunks_vdb.delete(chunk_ids)
await self.text_chunks.delete(chunk_ids)
# 5. Find and process entities and relationships that have these chunks as source
# Get all nodes and edges from the graph storage using storage-agnostic methods
# 4. Analyze entities and relationships that will be affected
entities_to_delete = set()
entities_to_update = {} # entity_name -> new_source_id
entities_to_rebuild = {} # entity_name -> remaining_chunk_ids
relationships_to_delete = set()
relationships_to_update = {} # (src, tgt) -> new_source_id
relationships_to_rebuild = {} # (src, tgt) -> remaining_chunk_ids
# Process entities - use storage-agnostic methods
all_labels = await self.chunk_entity_relation_graph.get_all_labels()
for node_label in all_labels:
node_data = await self.chunk_entity_relation_graph.get_node(node_label)
if node_data and "source_id" in node_data:
# Split source_id using GRAPH_FIELD_SEP
sources = set(node_data["source_id"].split(GRAPH_FIELD_SEP))
sources.difference_update(chunk_ids)
if not sources:
entities_to_delete.add(node_label)
logger.debug(
f"Entity {node_label} marked for deletion - no remaining sources"
)
else:
new_source_id = GRAPH_FIELD_SEP.join(sources)
entities_to_update[node_label] = new_source_id
logger.debug(
f"Entity {node_label} will be updated with new source_id: {new_source_id}"
)
# Use graph database lock to ensure atomic merges and updates
graph_db_lock = get_graph_db_lock(enable_logging=False)
async with graph_db_lock:
# Process entities
# TODO There is performance when iterating get_all_labels for PostgresSQL
all_labels = await self.chunk_entity_relation_graph.get_all_labels()
for node_label in all_labels:
node_data = await self.chunk_entity_relation_graph.get_node(
node_label
)
if node_data and "source_id" in node_data:
# Split source_id using GRAPH_FIELD_SEP
sources = set(node_data["source_id"].split(GRAPH_FIELD_SEP))
remaining_sources = sources - chunk_ids
# Process relationships
for node_label in all_labels:
node_edges = await self.chunk_entity_relation_graph.get_node_edges(
node_label
)
if node_edges:
for src, tgt in node_edges:
edge_data = await self.chunk_entity_relation_graph.get_edge(
src, tgt
)
if edge_data and "source_id" in edge_data:
# Split source_id using GRAPH_FIELD_SEP
sources = set(edge_data["source_id"].split(GRAPH_FIELD_SEP))
sources.difference_update(chunk_ids)
if not sources:
relationships_to_delete.add((src, tgt))
logger.debug(
f"Relationship {src}-{tgt} marked for deletion - no remaining sources"
)
else:
new_source_id = GRAPH_FIELD_SEP.join(sources)
relationships_to_update[(src, tgt)] = new_source_id
logger.debug(
f"Relationship {src}-{tgt} will be updated with new source_id: {new_source_id}"
if not remaining_sources:
entities_to_delete.add(node_label)
logger.debug(
f"Entity {node_label} marked for deletion - no remaining sources"
)
elif remaining_sources != sources:
# Entity needs to be rebuilt from remaining chunks
entities_to_rebuild[node_label] = remaining_sources
logger.debug(
f"Entity {node_label} will be rebuilt from {len(remaining_sources)} remaining chunks"
)
# Process relationships
# TODO There is performance when iterating get_all_labels for PostgresSQL
for node_label in all_labels:
node_edges = await self.chunk_entity_relation_graph.get_node_edges(
node_label
)
if node_edges:
for src, tgt in node_edges:
# To avoid processing the same edge twice in an undirected graph
if (tgt, src) in relationships_to_delete or (
tgt,
src,
) in relationships_to_rebuild:
continue
edge_data = await self.chunk_entity_relation_graph.get_edge(
src, tgt
)
if edge_data and "source_id" in edge_data:
# Split source_id using GRAPH_FIELD_SEP
sources = set(
edge_data["source_id"].split(GRAPH_FIELD_SEP)
)
remaining_sources = sources - chunk_ids
# Delete entities
if entities_to_delete:
for entity in entities_to_delete:
await self.entities_vdb.delete_entity(entity)
logger.debug(f"Deleted entity {entity} from vector DB")
await self.chunk_entity_relation_graph.remove_nodes(
list(entities_to_delete)
)
logger.debug(f"Deleted {len(entities_to_delete)} entities from graph")
if not remaining_sources:
relationships_to_delete.add((src, tgt))
logger.debug(
f"Relationship {src}-{tgt} marked for deletion - no remaining sources"
)
elif remaining_sources != sources:
# Relationship needs to be rebuilt from remaining chunks
relationships_to_rebuild[(src, tgt)] = (
remaining_sources
)
logger.debug(
f"Relationship {src}-{tgt} will be rebuilt from {len(remaining_sources)} remaining chunks"
)
# Update entities
for entity, new_source_id in entities_to_update.items():
node_data = await self.chunk_entity_relation_graph.get_node(entity)
if node_data:
node_data["source_id"] = new_source_id
await self.chunk_entity_relation_graph.upsert_node(
entity, node_data
# 5. Delete chunks from storage
if chunk_ids:
await self.chunks_vdb.delete(chunk_ids)
await self.text_chunks.delete(chunk_ids)
logger.info(f"Deleted {len(chunk_ids)} chunks from storage")
# 6. Delete entities that have no remaining sources
if entities_to_delete:
# Delete from vector database
entity_vdb_ids = [
compute_mdhash_id(entity, prefix="ent-")
for entity in entities_to_delete
]
await self.entities_vdb.delete(entity_vdb_ids)
# Delete from graph
await self.chunk_entity_relation_graph.remove_nodes(
list(entities_to_delete)
)
logger.debug(
f"Updated entity {entity} with new source_id: {new_source_id}"
logger.info(f"Deleted {len(entities_to_delete)} entities")
# 7. Delete relationships that have no remaining sources
if relationships_to_delete:
# Delete from vector database
rel_ids_to_delete = []
for src, tgt in relationships_to_delete:
rel_ids_to_delete.extend(
[
compute_mdhash_id(src + tgt, prefix="rel-"),
compute_mdhash_id(tgt + src, prefix="rel-"),
]
)
await self.relationships_vdb.delete(rel_ids_to_delete)
# Delete from graph
await self.chunk_entity_relation_graph.remove_edges(
list(relationships_to_delete)
)
logger.info(f"Deleted {len(relationships_to_delete)} relationships")
# 8. **OPTIMIZATION 2**: Rebuild entities and relationships from remaining chunks
if entities_to_rebuild or relationships_to_rebuild:
logger.info(
f"Rebuilding {len(entities_to_rebuild)} entities and {len(relationships_to_rebuild)} relationships..."
)
await _rebuild_knowledge_from_chunks(
entities_to_rebuild=entities_to_rebuild,
relationships_to_rebuild=relationships_to_rebuild,
knowledge_graph_inst=self.chunk_entity_relation_graph,
entities_vdb=self.entities_vdb,
relationships_vdb=self.relationships_vdb,
text_chunks=self.text_chunks,
llm_response_cache=self.llm_response_cache,
global_config=asdict(self),
)
# Delete relationships
if relationships_to_delete:
for src, tgt in relationships_to_delete:
rel_id_0 = compute_mdhash_id(src + tgt, prefix="rel-")
rel_id_1 = compute_mdhash_id(tgt + src, prefix="rel-")
await self.relationships_vdb.delete([rel_id_0, rel_id_1])
logger.debug(f"Deleted relationship {src}-{tgt} from vector DB")
await self.chunk_entity_relation_graph.remove_edges(
list(relationships_to_delete)
)
logger.debug(
f"Deleted {len(relationships_to_delete)} relationships from graph"
)
# Update relationships
for (src, tgt), new_source_id in relationships_to_update.items():
edge_data = await self.chunk_entity_relation_graph.get_edge(src, tgt)
if edge_data:
edge_data["source_id"] = new_source_id
await self.chunk_entity_relation_graph.upsert_edge(
src, tgt, edge_data
)
logger.debug(
f"Updated relationship {src}-{tgt} with new source_id: {new_source_id}"
)
# 6. Delete original document and status
# 9. Delete original document and status
await self.full_docs.delete([doc_id])
await self.doc_status.delete([doc_id])
# 7. Ensure all indexes are updated
# 10. Ensure all indexes are updated
await self._insert_done()
logger.info(
f"Successfully deleted document {doc_id} and related data. "
f"Deleted {len(entities_to_delete)} entities and {len(relationships_to_delete)} relationships. "
f"Updated {len(entities_to_update)} entities and {len(relationships_to_update)} relationships."
success_message = f"""Successfully deleted document {doc_id}.
Deleted: {len(entities_to_delete)} entities, {len(relationships_to_delete)} relationships.
Rebuilt: {len(entities_to_rebuild)} entities, {len(relationships_to_rebuild)} relationships."""
logger.info(success_message)
return DeletionResult(
status="success",
doc_id=doc_id,
message=success_message,
status_code=200,
)
async def process_data(data_type, vdb, chunk_id):
# Check data (entities or relationships)
storage = await vdb.client_storage
data_with_chunk = [
dp
for dp in storage["data"]
if chunk_id in (dp.get("source_id") or "").split(GRAPH_FIELD_SEP)
]
data_for_vdb = {}
if data_with_chunk:
logger.warning(
f"found {len(data_with_chunk)} {data_type} still referencing chunk {chunk_id}"
)
for item in data_with_chunk:
old_sources = item["source_id"].split(GRAPH_FIELD_SEP)
new_sources = [src for src in old_sources if src != chunk_id]
if not new_sources:
logger.info(
f"{data_type} {item.get('entity_name', 'N/A')} is deleted because source_id is not exists"
)
await vdb.delete_entity(item)
else:
item["source_id"] = GRAPH_FIELD_SEP.join(new_sources)
item_id = item["__id__"]
data_for_vdb[item_id] = item.copy()
if data_type == "entities":
data_for_vdb[item_id]["content"] = data_for_vdb[
item_id
].get("content") or (
item.get("entity_name", "")
+ (item.get("description") or "")
)
else: # relationships
data_for_vdb[item_id]["content"] = data_for_vdb[
item_id
].get("content") or (
(item.get("keywords") or "")
+ (item.get("src_id") or "")
+ (item.get("tgt_id") or "")
+ (item.get("description") or "")
)
if data_for_vdb:
await vdb.upsert(data_for_vdb)
logger.info(f"Successfully updated {data_type} in vector DB")
# Add verification step
async def verify_deletion():
# Verify if the document has been deleted
if await self.full_docs.get_by_id(doc_id):
logger.warning(f"Document {doc_id} still exists in full_docs")
# Verify if chunks have been deleted
all_remaining_chunks = await self.text_chunks.get_all()
remaining_related_chunks = {
chunk_id: chunk_data
for chunk_id, chunk_data in all_remaining_chunks.items()
if isinstance(chunk_data, dict)
and chunk_data.get("full_doc_id") == doc_id
}
if remaining_related_chunks:
logger.warning(
f"Found {len(remaining_related_chunks)} remaining chunks"
)
# Verify entities and relationships
for chunk_id in chunk_ids:
await process_data("entities", self.entities_vdb, chunk_id)
await process_data(
"relationships", self.relationships_vdb, chunk_id
)
await verify_deletion()
except Exception as e:
logger.error(f"Error while deleting document {doc_id}: {e}")
error_message = f"Error while deleting document {doc_id}: {e}"
logger.error(error_message)
logger.error(traceback.format_exc())
return DeletionResult(
status="fail",
doc_id=doc_id,
message=error_message,
status_code=500,
)
async def adelete_by_entity(self, entity_name: str) -> None:
async def adelete_by_entity(self, entity_name: str) -> DeletionResult:
"""Asynchronously delete an entity and all its relationships.
Args:
entity_name: Name of the entity to delete
entity_name: Name of the entity to delete.
Returns:
DeletionResult: An object containing the outcome of the deletion process.
"""
from .utils_graph import adelete_by_entity
@ -1947,16 +1928,29 @@ class LightRAG:
entity_name,
)
def delete_by_entity(self, entity_name: str) -> None:
def delete_by_entity(self, entity_name: str) -> DeletionResult:
"""Synchronously delete an entity and all its relationships.
Args:
entity_name: Name of the entity to delete.
Returns:
DeletionResult: An object containing the outcome of the deletion process.
"""
loop = always_get_an_event_loop()
return loop.run_until_complete(self.adelete_by_entity(entity_name))
async def adelete_by_relation(self, source_entity: str, target_entity: str) -> None:
async def adelete_by_relation(
self, source_entity: str, target_entity: str
) -> DeletionResult:
"""Asynchronously delete a relation between two entities.
Args:
source_entity: Name of the source entity
target_entity: Name of the target entity
source_entity: Name of the source entity.
target_entity: Name of the target entity.
Returns:
DeletionResult: An object containing the outcome of the deletion process.
"""
from .utils_graph import adelete_by_relation
@ -1967,7 +1961,18 @@ class LightRAG:
target_entity,
)
def delete_by_relation(self, source_entity: str, target_entity: str) -> None:
def delete_by_relation(
self, source_entity: str, target_entity: str
) -> DeletionResult:
"""Synchronously delete a relation between two entities.
Args:
source_entity: Name of the source entity.
target_entity: Name of the target entity.
Returns:
DeletionResult: An object containing the outcome of the deletion process.
"""
loop = always_get_an_event_loop()
return loop.run_until_complete(
self.adelete_by_relation(source_entity, target_entity)

View file

@ -50,7 +50,7 @@ async def _ollama_model_if_cache(
kwargs.pop("max_tokens", None)
# kwargs.pop("response_format", None) # allow json
host = kwargs.pop("host", None)
timeout = kwargs.pop("timeout", None) or 300 # Default timeout 300s
timeout = kwargs.pop("timeout", None) or 600 # Default timeout 600s
kwargs.pop("hashing_kv", None)
api_key = kwargs.pop("api_key", None)
headers = {
@ -146,7 +146,7 @@ async def ollama_embed(texts: list[str], embed_model, **kwargs) -> np.ndarray:
headers["Authorization"] = f"Bearer {api_key}"
host = kwargs.pop("host", None)
timeout = kwargs.pop("timeout", None) or 90 # Default time out 90s
timeout = kwargs.pop("timeout", None) or 300 # Default time out 300s
ollama_client = ollama.AsyncClient(host=host, timeout=timeout, headers=headers)

View file

@ -1,513 +0,0 @@
# type: ignore
"""
MinerU Document Parser Utility
This module provides functionality for parsing PDF, image and office documents using MinerU library,
and converts the parsing results into markdown and JSON formats
"""
from __future__ import annotations
__all__ = ["MineruParser"]
import os
import json
import argparse
from pathlib import Path
from typing import (
Dict,
List,
Optional,
Union,
Tuple,
Any,
TypeVar,
cast,
TYPE_CHECKING,
ClassVar,
)
# Type stubs for magic_pdf
FileBasedDataWriter = Any
FileBasedDataReader = Any
PymuDocDataset = Any
InferResult = Any
PipeResult = Any
SupportedPdfParseMethod = Any
doc_analyze = Any
read_local_office = Any
read_local_images = Any
if TYPE_CHECKING:
from magic_pdf.data.data_reader_writer import (
FileBasedDataWriter,
FileBasedDataReader,
)
from magic_pdf.data.dataset import PymuDocDataset
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
from magic_pdf.config.enums import SupportedPdfParseMethod
from magic_pdf.data.read_api import read_local_office, read_local_images
else:
# MinerU imports
from magic_pdf.data.data_reader_writer import (
FileBasedDataWriter,
FileBasedDataReader,
)
from magic_pdf.data.dataset import PymuDocDataset
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
from magic_pdf.config.enums import SupportedPdfParseMethod
from magic_pdf.data.read_api import read_local_office, read_local_images
T = TypeVar("T")
class MineruParser:
"""
MinerU document parsing utility class
Supports parsing PDF, image and office documents (like Word, PPT, etc.),
converting the content into structured data and generating markdown and JSON output
"""
__slots__: ClassVar[Tuple[str, ...]] = ()
def __init__(self) -> None:
"""Initialize MineruParser"""
pass
@staticmethod
def safe_write(
writer: Any,
content: Union[str, bytes, Dict[str, Any], List[Any]],
filename: str,
) -> None:
"""
Safely write content to a file, ensuring the filename is valid
Args:
writer: The writer object to use
content: The content to write
filename: The filename to write to
"""
# Ensure the filename isn't too long
if len(filename) > 200: # Most filesystems have limits around 255 characters
# Truncate the filename while keeping the extension
base, ext = os.path.splitext(filename)
filename = base[:190] + ext # Leave room for the extension and some margin
# Handle specific content types
if isinstance(content, str):
# Ensure str content is encoded to bytes if required
try:
writer.write(content, filename)
except TypeError:
# If the writer expects bytes, convert string to bytes
writer.write(content.encode("utf-8"), filename)
else:
# For dict/list content, always encode as JSON string first
if isinstance(content, (dict, list)):
try:
writer.write(
json.dumps(content, ensure_ascii=False, indent=4), filename
)
except TypeError:
# If the writer expects bytes, convert JSON string to bytes
writer.write(
json.dumps(content, ensure_ascii=False, indent=4).encode(
"utf-8"
),
filename,
)
else:
# Regular content (assumed to be bytes or compatible)
writer.write(content, filename)
@staticmethod
def parse_pdf(
pdf_path: Union[str, Path],
output_dir: Optional[str] = None,
use_ocr: bool = False,
) -> Tuple[List[Dict[str, Any]], str]:
"""
Parse PDF document
Args:
pdf_path: Path to the PDF file
output_dir: Output directory path
use_ocr: Whether to force OCR parsing
Returns:
Tuple[List[Dict[str, Any]], str]: Tuple containing (content list JSON, Markdown text)
"""
try:
# Convert to Path object for easier handling
pdf_path = Path(pdf_path)
name_without_suff = pdf_path.stem
# Prepare output directories - ensure file name is in path
if output_dir:
base_output_dir = Path(output_dir)
local_md_dir = base_output_dir / name_without_suff
else:
local_md_dir = pdf_path.parent / name_without_suff
local_image_dir = local_md_dir / "images"
image_dir = local_image_dir.name
# Create directories
os.makedirs(local_image_dir, exist_ok=True)
os.makedirs(local_md_dir, exist_ok=True)
# Initialize writers and reader
image_writer = FileBasedDataWriter(str(local_image_dir)) # type: ignore
md_writer = FileBasedDataWriter(str(local_md_dir)) # type: ignore
reader = FileBasedDataReader("") # type: ignore
# Read PDF bytes
pdf_bytes = reader.read(str(pdf_path)) # type: ignore
# Create dataset instance
ds = PymuDocDataset(pdf_bytes) # type: ignore
# Process based on PDF type and user preference
if use_ocr or ds.classify() == SupportedPdfParseMethod.OCR: # type: ignore
infer_result = ds.apply(doc_analyze, ocr=True) # type: ignore
pipe_result = infer_result.pipe_ocr_mode(image_writer) # type: ignore
else:
infer_result = ds.apply(doc_analyze, ocr=False) # type: ignore
pipe_result = infer_result.pipe_txt_mode(image_writer) # type: ignore
# Draw visualizations
try:
infer_result.draw_model(
os.path.join(local_md_dir, f"{name_without_suff}_model.pdf")
) # type: ignore
pipe_result.draw_layout(
os.path.join(local_md_dir, f"{name_without_suff}_layout.pdf")
) # type: ignore
pipe_result.draw_span(
os.path.join(local_md_dir, f"{name_without_suff}_spans.pdf")
) # type: ignore
except Exception as e:
print(f"Warning: Failed to draw visualizations: {str(e)}")
# Get data using API methods
md_content = pipe_result.get_markdown(image_dir) # type: ignore
content_list = pipe_result.get_content_list(image_dir) # type: ignore
# Save files using dump methods (consistent with API)
pipe_result.dump_md(md_writer, f"{name_without_suff}.md", image_dir) # type: ignore
pipe_result.dump_content_list(
md_writer, f"{name_without_suff}_content_list.json", image_dir
) # type: ignore
pipe_result.dump_middle_json(md_writer, f"{name_without_suff}_middle.json") # type: ignore
# Save model result - convert JSON string to bytes before writing
model_inference_result = infer_result.get_infer_res() # type: ignore
json_str = json.dumps(model_inference_result, ensure_ascii=False, indent=4)
try:
# Try to write to a file manually to avoid FileBasedDataWriter issues
model_file_path = os.path.join(
local_md_dir, f"{name_without_suff}_model.json"
)
with open(model_file_path, "w", encoding="utf-8") as f:
f.write(json_str)
except Exception as e:
print(
f"Warning: Failed to save model result using file write: {str(e)}"
)
try:
# If direct file write fails, try using the writer with bytes encoding
md_writer.write(
json_str.encode("utf-8"), f"{name_without_suff}_model.json"
) # type: ignore
except Exception as e2:
print(
f"Warning: Failed to save model result using writer: {str(e2)}"
)
return cast(Tuple[List[Dict[str, Any]], str], (content_list, md_content))
except Exception as e:
print(f"Error in parse_pdf: {str(e)}")
raise
@staticmethod
def parse_office_doc(
doc_path: Union[str, Path], output_dir: Optional[str] = None
) -> Tuple[List[Dict[str, Any]], str]:
"""
Parse office document (Word, PPT, etc.)
Args:
doc_path: Path to the document file
output_dir: Output directory path
Returns:
Tuple[List[Dict[str, Any]], str]: Tuple containing (content list JSON, Markdown text)
"""
try:
# Convert to Path object for easier handling
doc_path = Path(doc_path)
name_without_suff = doc_path.stem
# Prepare output directories - ensure file name is in path
if output_dir:
base_output_dir = Path(output_dir)
local_md_dir = base_output_dir / name_without_suff
else:
local_md_dir = doc_path.parent / name_without_suff
local_image_dir = local_md_dir / "images"
image_dir = local_image_dir.name
# Create directories
os.makedirs(local_image_dir, exist_ok=True)
os.makedirs(local_md_dir, exist_ok=True)
# Initialize writers
image_writer = FileBasedDataWriter(str(local_image_dir)) # type: ignore
md_writer = FileBasedDataWriter(str(local_md_dir)) # type: ignore
# Read office document
ds = read_local_office(str(doc_path))[0] # type: ignore
# Apply chain of operations according to API documentation
# This follows the pattern shown in MS-Office example in the API docs
ds.apply(doc_analyze, ocr=True).pipe_txt_mode(image_writer).dump_md(
md_writer, f"{name_without_suff}.md", image_dir
) # type: ignore
# Re-execute for getting the content data
infer_result = ds.apply(doc_analyze, ocr=True) # type: ignore
pipe_result = infer_result.pipe_txt_mode(image_writer) # type: ignore
# Get data for return values and additional outputs
md_content = pipe_result.get_markdown(image_dir) # type: ignore
content_list = pipe_result.get_content_list(image_dir) # type: ignore
# Save additional output files
pipe_result.dump_content_list(
md_writer, f"{name_without_suff}_content_list.json", image_dir
) # type: ignore
pipe_result.dump_middle_json(md_writer, f"{name_without_suff}_middle.json") # type: ignore
# Save model result - convert JSON string to bytes before writing
model_inference_result = infer_result.get_infer_res() # type: ignore
json_str = json.dumps(model_inference_result, ensure_ascii=False, indent=4)
try:
# Try to write to a file manually to avoid FileBasedDataWriter issues
model_file_path = os.path.join(
local_md_dir, f"{name_without_suff}_model.json"
)
with open(model_file_path, "w", encoding="utf-8") as f:
f.write(json_str)
except Exception as e:
print(
f"Warning: Failed to save model result using file write: {str(e)}"
)
try:
# If direct file write fails, try using the writer with bytes encoding
md_writer.write(
json_str.encode("utf-8"), f"{name_without_suff}_model.json"
) # type: ignore
except Exception as e2:
print(
f"Warning: Failed to save model result using writer: {str(e2)}"
)
return cast(Tuple[List[Dict[str, Any]], str], (content_list, md_content))
except Exception as e:
print(f"Error in parse_office_doc: {str(e)}")
raise
@staticmethod
def parse_image(
image_path: Union[str, Path], output_dir: Optional[str] = None
) -> Tuple[List[Dict[str, Any]], str]:
"""
Parse image document
Args:
image_path: Path to the image file
output_dir: Output directory path
Returns:
Tuple[List[Dict[str, Any]], str]: Tuple containing (content list JSON, Markdown text)
"""
try:
# Convert to Path object for easier handling
image_path = Path(image_path)
name_without_suff = image_path.stem
# Prepare output directories - ensure file name is in path
if output_dir:
base_output_dir = Path(output_dir)
local_md_dir = base_output_dir / name_without_suff
else:
local_md_dir = image_path.parent / name_without_suff
local_image_dir = local_md_dir / "images"
image_dir = local_image_dir.name
# Create directories
os.makedirs(local_image_dir, exist_ok=True)
os.makedirs(local_md_dir, exist_ok=True)
# Initialize writers
image_writer = FileBasedDataWriter(str(local_image_dir)) # type: ignore
md_writer = FileBasedDataWriter(str(local_md_dir)) # type: ignore
# Read image
ds = read_local_images(str(image_path))[0] # type: ignore
# Apply chain of operations according to API documentation
# This follows the pattern shown in Image example in the API docs
ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
md_writer, f"{name_without_suff}.md", image_dir
) # type: ignore
# Re-execute for getting the content data
infer_result = ds.apply(doc_analyze, ocr=True) # type: ignore
pipe_result = infer_result.pipe_ocr_mode(image_writer) # type: ignore
# Get data for return values and additional outputs
md_content = pipe_result.get_markdown(image_dir) # type: ignore
content_list = pipe_result.get_content_list(image_dir) # type: ignore
# Save additional output files
pipe_result.dump_content_list(
md_writer, f"{name_without_suff}_content_list.json", image_dir
) # type: ignore
pipe_result.dump_middle_json(md_writer, f"{name_without_suff}_middle.json") # type: ignore
# Save model result - convert JSON string to bytes before writing
model_inference_result = infer_result.get_infer_res() # type: ignore
json_str = json.dumps(model_inference_result, ensure_ascii=False, indent=4)
try:
# Try to write to a file manually to avoid FileBasedDataWriter issues
model_file_path = os.path.join(
local_md_dir, f"{name_without_suff}_model.json"
)
with open(model_file_path, "w", encoding="utf-8") as f:
f.write(json_str)
except Exception as e:
print(
f"Warning: Failed to save model result using file write: {str(e)}"
)
try:
# If direct file write fails, try using the writer with bytes encoding
md_writer.write(
json_str.encode("utf-8"), f"{name_without_suff}_model.json"
) # type: ignore
except Exception as e2:
print(
f"Warning: Failed to save model result using writer: {str(e2)}"
)
return cast(Tuple[List[Dict[str, Any]], str], (content_list, md_content))
except Exception as e:
print(f"Error in parse_image: {str(e)}")
raise
@staticmethod
def parse_document(
file_path: Union[str, Path],
parse_method: str = "auto",
output_dir: Optional[str] = None,
save_results: bool = True,
) -> Tuple[List[Dict[str, Any]], str]:
"""
Parse document using MinerU based on file extension
Args:
file_path: Path to the file to be parsed
parse_method: Parsing method, supports "auto", "ocr", "txt", default is "auto"
output_dir: Output directory path, if None, use the directory of the input file
save_results: Whether to save parsing results to files
Returns:
Tuple[List[Dict[str, Any]], str]: Tuple containing (content list JSON, Markdown text)
"""
# Convert to Path object
file_path = Path(file_path)
if not file_path.exists():
raise FileNotFoundError(f"File does not exist: {file_path}")
# Get file extension
ext = file_path.suffix.lower()
# Choose appropriate parser based on file type
if ext in [".pdf"]:
return MineruParser.parse_pdf(
file_path, output_dir, use_ocr=(parse_method == "ocr")
)
elif ext in [".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif"]:
return MineruParser.parse_image(file_path, output_dir)
elif ext in [".doc", ".docx", ".ppt", ".pptx"]:
return MineruParser.parse_office_doc(file_path, output_dir)
else:
# For unsupported file types, default to PDF parsing
print(
f"Warning: Unsupported file extension '{ext}', trying generic PDF parser"
)
return MineruParser.parse_pdf(
file_path, output_dir, use_ocr=(parse_method == "ocr")
)
def main():
"""
Main function to run the MinerU parser from command line
"""
parser = argparse.ArgumentParser(description="Parse documents using MinerU")
parser.add_argument("file_path", help="Path to the document to parse")
parser.add_argument("--output", "-o", help="Output directory path")
parser.add_argument(
"--method",
"-m",
choices=["auto", "ocr", "txt"],
default="auto",
help="Parsing method (auto, ocr, txt)",
)
parser.add_argument(
"--stats", action="store_true", help="Display content statistics"
)
args = parser.parse_args()
try:
# Parse the document
content_list, md_content = MineruParser.parse_document(
file_path=args.file_path, parse_method=args.method, output_dir=args.output
)
# Display statistics if requested
if args.stats:
print("\nDocument Statistics:")
print(f"Total content blocks: {len(content_list)}")
# Count different types of content
content_types = {}
for item in content_list:
content_type = item.get("type", "unknown")
content_types[content_type] = content_types.get(content_type, 0) + 1
print("\nContent Type Distribution:")
for content_type, count in content_types.items():
print(f"- {content_type}: {count}")
except Exception as e:
print(f"Error: {str(e)}")
return 1
return 0
if __name__ == "__main__":
exit(main())

View file

@ -1,699 +0,0 @@
"""
Specialized processors for different modalities
Includes:
- ImageModalProcessor: Specialized processor for image content
- TableModalProcessor: Specialized processor for table content
- EquationModalProcessor: Specialized processor for equation content
- GenericModalProcessor: Processor for other modal content
"""
import re
import json
import time
import asyncio
import base64
from typing import Dict, Any, Tuple, cast
from pathlib import Path
from lightrag.base import StorageNameSpace
from lightrag.utils import (
logger,
compute_mdhash_id,
)
from lightrag.lightrag import LightRAG
from dataclasses import asdict
from lightrag.kg.shared_storage import get_namespace_data, get_pipeline_status_lock
class BaseModalProcessor:
"""Base class for modal processors"""
def __init__(self, lightrag: LightRAG, modal_caption_func):
"""Initialize base processor
Args:
lightrag: LightRAG instance
modal_caption_func: Function for generating descriptions
"""
self.lightrag = lightrag
self.modal_caption_func = modal_caption_func
# Use LightRAG's storage instances
self.text_chunks_db = lightrag.text_chunks
self.chunks_vdb = lightrag.chunks_vdb
self.entities_vdb = lightrag.entities_vdb
self.relationships_vdb = lightrag.relationships_vdb
self.knowledge_graph_inst = lightrag.chunk_entity_relation_graph
# Use LightRAG's configuration and functions
self.embedding_func = lightrag.embedding_func
self.llm_model_func = lightrag.llm_model_func
self.global_config = asdict(lightrag)
self.hashing_kv = lightrag.llm_response_cache
self.tokenizer = lightrag.tokenizer
async def process_multimodal_content(
self,
modal_content,
content_type: str,
file_path: str = "manual_creation",
entity_name: str = None,
) -> Tuple[str, Dict[str, Any]]:
"""Process multimodal content"""
# Subclasses need to implement specific processing logic
raise NotImplementedError("Subclasses must implement this method")
async def _create_entity_and_chunk(
self, modal_chunk: str, entity_info: Dict[str, Any], file_path: str
) -> Tuple[str, Dict[str, Any]]:
"""Create entity and text chunk"""
# Create chunk
chunk_id = compute_mdhash_id(str(modal_chunk), prefix="chunk-")
tokens = len(self.tokenizer.encode(modal_chunk))
chunk_data = {
"tokens": tokens,
"content": modal_chunk,
"chunk_order_index": 0,
"full_doc_id": chunk_id,
"file_path": file_path,
}
# Store chunk
await self.text_chunks_db.upsert({chunk_id: chunk_data})
# Create entity node
node_data = {
"entity_id": entity_info["entity_name"],
"entity_type": entity_info["entity_type"],
"description": entity_info["summary"],
"source_id": chunk_id,
"file_path": file_path,
"created_at": int(time.time()),
}
await self.knowledge_graph_inst.upsert_node(
entity_info["entity_name"], node_data
)
# Insert entity into vector database
entity_vdb_data = {
compute_mdhash_id(entity_info["entity_name"], prefix="ent-"): {
"entity_name": entity_info["entity_name"],
"entity_type": entity_info["entity_type"],
"content": f"{entity_info['entity_name']}\n{entity_info['summary']}",
"source_id": chunk_id,
"file_path": file_path,
}
}
await self.entities_vdb.upsert(entity_vdb_data)
# Process entity and relationship extraction
await self._process_chunk_for_extraction(chunk_id, entity_info["entity_name"])
# Ensure all storage updates are complete
await self._insert_done()
return entity_info["summary"], {
"entity_name": entity_info["entity_name"],
"entity_type": entity_info["entity_type"],
"description": entity_info["summary"],
"chunk_id": chunk_id,
}
async def _process_chunk_for_extraction(
self, chunk_id: str, modal_entity_name: str
):
"""Process chunk for entity and relationship extraction"""
chunk_data = await self.text_chunks_db.get_by_id(chunk_id)
if not chunk_data:
logger.error(f"Chunk {chunk_id} not found")
return
# Create text chunk for vector database
chunk_vdb_data = {
chunk_id: {
"content": chunk_data["content"],
"full_doc_id": chunk_id,
"tokens": chunk_data["tokens"],
"chunk_order_index": chunk_data["chunk_order_index"],
"file_path": chunk_data["file_path"],
}
}
await self.chunks_vdb.upsert(chunk_vdb_data)
# Trigger extraction process
from lightrag.operate import extract_entities, merge_nodes_and_edges
pipeline_status = await get_namespace_data("pipeline_status")
pipeline_status_lock = get_pipeline_status_lock()
# Prepare chunk for extraction
chunks = {chunk_id: chunk_data}
# Extract entities and relationships
chunk_results = await extract_entities(
chunks=chunks,
global_config=self.global_config,
pipeline_status=pipeline_status,
pipeline_status_lock=pipeline_status_lock,
llm_response_cache=self.hashing_kv,
)
# Add "belongs_to" relationships for all extracted entities
for maybe_nodes, _ in chunk_results:
for entity_name in maybe_nodes.keys():
if entity_name != modal_entity_name: # Skip self-relationship
# Create belongs_to relationship
relation_data = {
"description": f"Entity {entity_name} belongs to {modal_entity_name}",
"keywords": "belongs_to,part_of,contained_in",
"source_id": chunk_id,
"weight": 10.0,
"file_path": chunk_data.get("file_path", "manual_creation"),
}
await self.knowledge_graph_inst.upsert_edge(
entity_name, modal_entity_name, relation_data
)
relation_id = compute_mdhash_id(
entity_name + modal_entity_name, prefix="rel-"
)
relation_vdb_data = {
relation_id: {
"src_id": entity_name,
"tgt_id": modal_entity_name,
"keywords": relation_data["keywords"],
"content": f"{relation_data['keywords']}\t{entity_name}\n{modal_entity_name}\n{relation_data['description']}",
"source_id": chunk_id,
"file_path": chunk_data.get("file_path", "manual_creation"),
}
}
await self.relationships_vdb.upsert(relation_vdb_data)
await merge_nodes_and_edges(
chunk_results=chunk_results,
knowledge_graph_inst=self.knowledge_graph_inst,
entity_vdb=self.entities_vdb,
relationships_vdb=self.relationships_vdb,
global_config=self.global_config,
pipeline_status=pipeline_status,
pipeline_status_lock=pipeline_status_lock,
llm_response_cache=self.hashing_kv,
)
async def _insert_done(self) -> None:
await asyncio.gather(
*[
cast(StorageNameSpace, storage_inst).index_done_callback()
for storage_inst in [
self.text_chunks_db,
self.chunks_vdb,
self.entities_vdb,
self.relationships_vdb,
self.knowledge_graph_inst,
]
]
)
class ImageModalProcessor(BaseModalProcessor):
"""Processor specialized for image content"""
def __init__(self, lightrag: LightRAG, modal_caption_func):
"""Initialize image processor
Args:
lightrag: LightRAG instance
modal_caption_func: Function for generating descriptions (supporting image understanding)
"""
super().__init__(lightrag, modal_caption_func)
def _encode_image_to_base64(self, image_path: str) -> str:
"""Encode image to base64"""
try:
with open(image_path, "rb") as image_file:
encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
return encoded_string
except Exception as e:
logger.error(f"Failed to encode image {image_path}: {e}")
return ""
async def process_multimodal_content(
self,
modal_content,
content_type: str,
file_path: str = "manual_creation",
entity_name: str = None,
) -> Tuple[str, Dict[str, Any]]:
"""Process image content"""
try:
# Parse image content
if isinstance(modal_content, str):
try:
content_data = json.loads(modal_content)
except json.JSONDecodeError:
content_data = {"description": modal_content}
else:
content_data = modal_content
image_path = content_data.get("img_path")
captions = content_data.get("img_caption", [])
footnotes = content_data.get("img_footnote", [])
# Build detailed visual analysis prompt
vision_prompt = f"""Please analyze this image in detail and provide a JSON response with the following structure:
{{
"detailed_description": "A comprehensive and detailed visual description of the image following these guidelines:
- Describe the overall composition and layout
- Identify all objects, people, text, and visual elements
- Explain relationships between elements
- Note colors, lighting, and visual style
- Describe any actions or activities shown
- Include technical details if relevant (charts, diagrams, etc.)
- Always use specific names instead of pronouns",
"entity_info": {{
"entity_name": "{entity_name if entity_name else 'unique descriptive name for this image'}",
"entity_type": "image",
"summary": "concise summary of the image content and its significance (max 100 words)"
}}
}}
Additional context:
- Image Path: {image_path}
- Captions: {captions if captions else 'None'}
- Footnotes: {footnotes if footnotes else 'None'}
Focus on providing accurate, detailed visual analysis that would be useful for knowledge retrieval."""
# If image path exists, try to encode image
image_base64 = ""
if image_path and Path(image_path).exists():
image_base64 = self._encode_image_to_base64(image_path)
# Call vision model
if image_base64:
# Use real image for analysis
response = await self.modal_caption_func(
vision_prompt,
image_data=image_base64,
system_prompt="You are an expert image analyst. Provide detailed, accurate descriptions.",
)
else:
# Analyze based on existing text information
text_prompt = f"""Based on the following image information, provide analysis:
Image Path: {image_path}
Captions: {captions}
Footnotes: {footnotes}
{vision_prompt}"""
response = await self.modal_caption_func(
text_prompt,
system_prompt="You are an expert image analyst. Provide detailed analysis based on available information.",
)
# Parse response
enhanced_caption, entity_info = self._parse_response(response, entity_name)
# Build complete image content
modal_chunk = f"""
Image Content Analysis:
Image Path: {image_path}
Captions: {', '.join(captions) if captions else 'None'}
Footnotes: {', '.join(footnotes) if footnotes else 'None'}
Visual Analysis: {enhanced_caption}"""
return await self._create_entity_and_chunk(
modal_chunk, entity_info, file_path
)
except Exception as e:
logger.error(f"Error processing image content: {e}")
# Fallback processing
fallback_entity = {
"entity_name": entity_name
if entity_name
else f"image_{compute_mdhash_id(str(modal_content))}",
"entity_type": "image",
"summary": f"Image content: {str(modal_content)[:100]}",
}
return str(modal_content), fallback_entity
def _parse_response(
self, response: str, entity_name: str = None
) -> Tuple[str, Dict[str, Any]]:
"""Parse model response"""
try:
response_data = json.loads(
re.search(r"\{.*\}", response, re.DOTALL).group(0)
)
description = response_data.get("detailed_description", "")
entity_data = response_data.get("entity_info", {})
if not description or not entity_data:
raise ValueError("Missing required fields in response")
if not all(
key in entity_data for key in ["entity_name", "entity_type", "summary"]
):
raise ValueError("Missing required fields in entity_info")
entity_data["entity_name"] = (
entity_data["entity_name"] + f" ({entity_data['entity_type']})"
)
if entity_name:
entity_data["entity_name"] = entity_name
return description, entity_data
except (json.JSONDecodeError, AttributeError, ValueError) as e:
logger.error(f"Error parsing image analysis response: {e}")
fallback_entity = {
"entity_name": entity_name
if entity_name
else f"image_{compute_mdhash_id(response)}",
"entity_type": "image",
"summary": response[:100] + "..." if len(response) > 100 else response,
}
return response, fallback_entity
class TableModalProcessor(BaseModalProcessor):
"""Processor specialized for table content"""
async def process_multimodal_content(
self,
modal_content,
content_type: str,
file_path: str = "manual_creation",
entity_name: str = None,
) -> Tuple[str, Dict[str, Any]]:
"""Process table content"""
# Parse table content
if isinstance(modal_content, str):
try:
content_data = json.loads(modal_content)
except json.JSONDecodeError:
content_data = {"table_body": modal_content}
else:
content_data = modal_content
table_img_path = content_data.get("img_path")
table_caption = content_data.get("table_caption", [])
table_body = content_data.get("table_body", "")
table_footnote = content_data.get("table_footnote", [])
# Build table analysis prompt
table_prompt = f"""Please analyze this table content and provide a JSON response with the following structure:
{{
"detailed_description": "A comprehensive analysis of the table including:
- Table structure and organization
- Column headers and their meanings
- Key data points and patterns
- Statistical insights and trends
- Relationships between data elements
- Significance of the data presented
Always use specific names and values instead of general references.",
"entity_info": {{
"entity_name": "{entity_name if entity_name else 'descriptive name for this table'}",
"entity_type": "table",
"summary": "concise summary of the table's purpose and key findings (max 100 words)"
}}
}}
Table Information:
Image Path: {table_img_path}
Caption: {table_caption if table_caption else 'None'}
Body: {table_body}
Footnotes: {table_footnote if table_footnote else 'None'}
Focus on extracting meaningful insights and relationships from the tabular data."""
response = await self.modal_caption_func(
table_prompt,
system_prompt="You are an expert data analyst. Provide detailed table analysis with specific insights.",
)
# Parse response
enhanced_caption, entity_info = self._parse_table_response(
response, entity_name
)
# TODO: Add Retry Mechanism
# Build complete table content
modal_chunk = f"""Table Analysis:
Image Path: {table_img_path}
Caption: {', '.join(table_caption) if table_caption else 'None'}
Structure: {table_body}
Footnotes: {', '.join(table_footnote) if table_footnote else 'None'}
Analysis: {enhanced_caption}"""
return await self._create_entity_and_chunk(modal_chunk, entity_info, file_path)
def _parse_table_response(
self, response: str, entity_name: str = None
) -> Tuple[str, Dict[str, Any]]:
"""Parse table analysis response"""
try:
response_data = json.loads(
re.search(r"\{.*\}", response, re.DOTALL).group(0)
)
description = response_data.get("detailed_description", "")
entity_data = response_data.get("entity_info", {})
if not description or not entity_data:
raise ValueError("Missing required fields in response")
if not all(
key in entity_data for key in ["entity_name", "entity_type", "summary"]
):
raise ValueError("Missing required fields in entity_info")
entity_data["entity_name"] = (
entity_data["entity_name"] + f" ({entity_data['entity_type']})"
)
if entity_name:
entity_data["entity_name"] = entity_name
return description, entity_data
except (json.JSONDecodeError, AttributeError, ValueError) as e:
logger.error(f"Error parsing table analysis response: {e}")
fallback_entity = {
"entity_name": entity_name
if entity_name
else f"table_{compute_mdhash_id(response)}",
"entity_type": "table",
"summary": response[:100] + "..." if len(response) > 100 else response,
}
return response, fallback_entity
class EquationModalProcessor(BaseModalProcessor):
"""Processor specialized for equation content"""
async def process_multimodal_content(
self,
modal_content,
content_type: str,
file_path: str = "manual_creation",
entity_name: str = None,
) -> Tuple[str, Dict[str, Any]]:
"""Process equation content"""
# Parse equation content
if isinstance(modal_content, str):
try:
content_data = json.loads(modal_content)
except json.JSONDecodeError:
content_data = {"equation": modal_content}
else:
content_data = modal_content
equation_text = content_data.get("text")
equation_format = content_data.get("text_format", "")
# Build equation analysis prompt
equation_prompt = f"""Please analyze this mathematical equation and provide a JSON response with the following structure:
{{
"detailed_description": "A comprehensive analysis of the equation including:
- Mathematical meaning and interpretation
- Variables and their definitions
- Mathematical operations and functions used
- Application domain and context
- Physical or theoretical significance
- Relationship to other mathematical concepts
- Practical applications or use cases
Always use specific mathematical terminology.",
"entity_info": {{
"entity_name": "{entity_name if entity_name else 'descriptive name for this equation'}",
"entity_type": "equation",
"summary": "concise summary of the equation's purpose and significance (max 100 words)"
}}
}}
Equation Information:
Equation: {equation_text}
Format: {equation_format}
Focus on providing mathematical insights and explaining the equation's significance."""
response = await self.modal_caption_func(
equation_prompt,
system_prompt="You are an expert mathematician. Provide detailed mathematical analysis.",
)
# Parse response
enhanced_caption, entity_info = self._parse_equation_response(
response, entity_name
)
# Build complete equation content
modal_chunk = f"""Mathematical Equation Analysis:
Equation: {equation_text}
Format: {equation_format}
Mathematical Analysis: {enhanced_caption}"""
return await self._create_entity_and_chunk(modal_chunk, entity_info, file_path)
def _parse_equation_response(
self, response: str, entity_name: str = None
) -> Tuple[str, Dict[str, Any]]:
"""Parse equation analysis response"""
try:
response_data = json.loads(
re.search(r"\{.*\}", response, re.DOTALL).group(0)
)
description = response_data.get("detailed_description", "")
entity_data = response_data.get("entity_info", {})
if not description or not entity_data:
raise ValueError("Missing required fields in response")
if not all(
key in entity_data for key in ["entity_name", "entity_type", "summary"]
):
raise ValueError("Missing required fields in entity_info")
entity_data["entity_name"] = (
entity_data["entity_name"] + f" ({entity_data['entity_type']})"
)
if entity_name:
entity_data["entity_name"] = entity_name
return description, entity_data
except (json.JSONDecodeError, AttributeError, ValueError) as e:
logger.error(f"Error parsing equation analysis response: {e}")
fallback_entity = {
"entity_name": entity_name
if entity_name
else f"equation_{compute_mdhash_id(response)}",
"entity_type": "equation",
"summary": response[:100] + "..." if len(response) > 100 else response,
}
return response, fallback_entity
class GenericModalProcessor(BaseModalProcessor):
"""Generic processor for other types of modal content"""
async def process_multimodal_content(
self,
modal_content,
content_type: str,
file_path: str = "manual_creation",
entity_name: str = None,
) -> Tuple[str, Dict[str, Any]]:
"""Process generic modal content"""
# Build generic analysis prompt
generic_prompt = f"""Please analyze this {content_type} content and provide a JSON response with the following structure:
{{
"detailed_description": "A comprehensive analysis of the content including:
- Content structure and organization
- Key information and elements
- Relationships between components
- Context and significance
- Relevant details for knowledge retrieval
Always use specific terminology appropriate for {content_type} content.",
"entity_info": {{
"entity_name": "{entity_name if entity_name else f'descriptive name for this {content_type}'}",
"entity_type": "{content_type}",
"summary": "concise summary of the content's purpose and key points (max 100 words)"
}}
}}
Content: {str(modal_content)}
Focus on extracting meaningful information that would be useful for knowledge retrieval."""
response = await self.modal_caption_func(
generic_prompt,
system_prompt=f"You are an expert content analyst specializing in {content_type} content.",
)
# Parse response
enhanced_caption, entity_info = self._parse_generic_response(
response, entity_name, content_type
)
# Build complete content
modal_chunk = f"""{content_type.title()} Content Analysis:
Content: {str(modal_content)}
Analysis: {enhanced_caption}"""
return await self._create_entity_and_chunk(modal_chunk, entity_info, file_path)
def _parse_generic_response(
self, response: str, entity_name: str = None, content_type: str = "content"
) -> Tuple[str, Dict[str, Any]]:
"""Parse generic analysis response"""
try:
response_data = json.loads(
re.search(r"\{.*\}", response, re.DOTALL).group(0)
)
description = response_data.get("detailed_description", "")
entity_data = response_data.get("entity_info", {})
if not description or not entity_data:
raise ValueError("Missing required fields in response")
if not all(
key in entity_data for key in ["entity_name", "entity_type", "summary"]
):
raise ValueError("Missing required fields in entity_info")
entity_data["entity_name"] = (
entity_data["entity_name"] + f" ({entity_data['entity_type']})"
)
if entity_name:
entity_data["entity_name"] = entity_name
return description, entity_data
except (json.JSONDecodeError, AttributeError, ValueError) as e:
logger.error(f"Error parsing generic analysis response: {e}")
fallback_entity = {
"entity_name": entity_name
if entity_name
else f"{content_type}_{compute_mdhash_id(response)}",
"entity_type": content_type,
"summary": response[:100] + "..." if len(response) > 100 else response,
}
return response, fallback_entity

View file

@ -240,6 +240,466 @@ async def _handle_single_relationship_extraction(
)
async def _rebuild_knowledge_from_chunks(
entities_to_rebuild: dict[str, set[str]],
relationships_to_rebuild: dict[tuple[str, str], set[str]],
knowledge_graph_inst: BaseGraphStorage,
entities_vdb: BaseVectorStorage,
relationships_vdb: BaseVectorStorage,
text_chunks: BaseKVStorage,
llm_response_cache: BaseKVStorage,
global_config: dict[str, str],
) -> None:
"""Rebuild entity and relationship descriptions from cached extraction results
This method uses cached LLM extraction results instead of calling LLM again,
following the same approach as the insert process.
Args:
entities_to_rebuild: Dict mapping entity_name -> set of remaining chunk_ids
relationships_to_rebuild: Dict mapping (src, tgt) -> set of remaining chunk_ids
"""
if not entities_to_rebuild and not relationships_to_rebuild:
return
# Get all referenced chunk IDs
all_referenced_chunk_ids = set()
for chunk_ids in entities_to_rebuild.values():
all_referenced_chunk_ids.update(chunk_ids)
for chunk_ids in relationships_to_rebuild.values():
all_referenced_chunk_ids.update(chunk_ids)
logger.info(
f"Rebuilding knowledge from {len(all_referenced_chunk_ids)} cached chunk extractions"
)
# Get cached extraction results for these chunks
cached_results = await _get_cached_extraction_results(
llm_response_cache, all_referenced_chunk_ids
)
if not cached_results:
logger.warning("No cached extraction results found, cannot rebuild")
return
# Process cached results to get entities and relationships for each chunk
chunk_entities = {} # chunk_id -> {entity_name: [entity_data]}
chunk_relationships = {} # chunk_id -> {(src, tgt): [relationship_data]}
for chunk_id, extraction_result in cached_results.items():
try:
entities, relationships = await _parse_extraction_result(
text_chunks=text_chunks,
extraction_result=extraction_result,
chunk_id=chunk_id,
)
chunk_entities[chunk_id] = entities
chunk_relationships[chunk_id] = relationships
except Exception as e:
logger.error(
f"Failed to parse cached extraction result for chunk {chunk_id}: {e}"
)
continue
# Rebuild entities
for entity_name, chunk_ids in entities_to_rebuild.items():
try:
await _rebuild_single_entity(
knowledge_graph_inst=knowledge_graph_inst,
entities_vdb=entities_vdb,
entity_name=entity_name,
chunk_ids=chunk_ids,
chunk_entities=chunk_entities,
llm_response_cache=llm_response_cache,
global_config=global_config,
)
logger.debug(
f"Rebuilt entity {entity_name} from {len(chunk_ids)} cached extractions"
)
except Exception as e:
logger.error(f"Failed to rebuild entity {entity_name}: {e}")
# Rebuild relationships
for (src, tgt), chunk_ids in relationships_to_rebuild.items():
try:
await _rebuild_single_relationship(
knowledge_graph_inst=knowledge_graph_inst,
relationships_vdb=relationships_vdb,
src=src,
tgt=tgt,
chunk_ids=chunk_ids,
chunk_relationships=chunk_relationships,
llm_response_cache=llm_response_cache,
global_config=global_config,
)
logger.debug(
f"Rebuilt relationship {src}-{tgt} from {len(chunk_ids)} cached extractions"
)
except Exception as e:
logger.error(f"Failed to rebuild relationship {src}-{tgt}: {e}")
logger.info("Completed rebuilding knowledge from cached extractions")
async def _get_cached_extraction_results(
llm_response_cache: BaseKVStorage, chunk_ids: set[str]
) -> dict[str, str]:
"""Get cached extraction results for specific chunk IDs
Args:
chunk_ids: Set of chunk IDs to get cached results for
Returns:
Dict mapping chunk_id -> extraction_result_text
"""
cached_results = {}
# Get all cached data for "default" mode (entity extraction cache)
default_cache = await llm_response_cache.get_by_id("default") or {}
for cache_key, cache_entry in default_cache.items():
if (
isinstance(cache_entry, dict)
and cache_entry.get("cache_type") == "extract"
and cache_entry.get("chunk_id") in chunk_ids
):
chunk_id = cache_entry["chunk_id"]
extraction_result = cache_entry["return"]
cached_results[chunk_id] = extraction_result
logger.info(
f"Found {len(cached_results)} cached extraction results for {len(chunk_ids)} chunk IDs"
)
return cached_results
async def _parse_extraction_result(
text_chunks: BaseKVStorage, extraction_result: str, chunk_id: str
) -> tuple[dict, dict]:
"""Parse cached extraction result using the same logic as extract_entities
Args:
extraction_result: The cached LLM extraction result
chunk_id: The chunk ID for source tracking
Returns:
Tuple of (entities_dict, relationships_dict)
"""
# Get chunk data for file_path
chunk_data = await text_chunks.get_by_id(chunk_id)
file_path = (
chunk_data.get("file_path", "unknown_source")
if chunk_data
else "unknown_source"
)
context_base = dict(
tuple_delimiter=PROMPTS["DEFAULT_TUPLE_DELIMITER"],
record_delimiter=PROMPTS["DEFAULT_RECORD_DELIMITER"],
completion_delimiter=PROMPTS["DEFAULT_COMPLETION_DELIMITER"],
)
maybe_nodes = defaultdict(list)
maybe_edges = defaultdict(list)
# Parse the extraction result using the same logic as in extract_entities
records = split_string_by_multi_markers(
extraction_result,
[context_base["record_delimiter"], context_base["completion_delimiter"]],
)
for record in records:
record = re.search(r"\((.*)\)", record)
if record is None:
continue
record = record.group(1)
record_attributes = split_string_by_multi_markers(
record, [context_base["tuple_delimiter"]]
)
# Try to parse as entity
entity_data = await _handle_single_entity_extraction(
record_attributes, chunk_id, file_path
)
if entity_data is not None:
maybe_nodes[entity_data["entity_name"]].append(entity_data)
continue
# Try to parse as relationship
relationship_data = await _handle_single_relationship_extraction(
record_attributes, chunk_id, file_path
)
if relationship_data is not None:
maybe_edges[
(relationship_data["src_id"], relationship_data["tgt_id"])
].append(relationship_data)
return dict(maybe_nodes), dict(maybe_edges)
async def _rebuild_single_entity(
knowledge_graph_inst: BaseGraphStorage,
entities_vdb: BaseVectorStorage,
entity_name: str,
chunk_ids: set[str],
chunk_entities: dict,
llm_response_cache: BaseKVStorage,
global_config: dict[str, str],
) -> None:
"""Rebuild a single entity from cached extraction results"""
# Get current entity data
current_entity = await knowledge_graph_inst.get_node(entity_name)
if not current_entity:
return
# Helper function to update entity in both graph and vector storage
async def _update_entity_storage(
final_description: str, entity_type: str, file_paths: set[str]
):
# Update entity in graph storage
updated_entity_data = {
**current_entity,
"description": final_description,
"entity_type": entity_type,
"source_id": GRAPH_FIELD_SEP.join(chunk_ids),
"file_path": GRAPH_FIELD_SEP.join(file_paths)
if file_paths
else current_entity.get("file_path", "unknown_source"),
}
await knowledge_graph_inst.upsert_node(entity_name, updated_entity_data)
# Update entity in vector database
entity_vdb_id = compute_mdhash_id(entity_name, prefix="ent-")
# Delete old vector record first
try:
await entities_vdb.delete([entity_vdb_id])
except Exception as e:
logger.debug(
f"Could not delete old entity vector record {entity_vdb_id}: {e}"
)
# Insert new vector record
entity_content = f"{entity_name}\n{final_description}"
await entities_vdb.upsert(
{
entity_vdb_id: {
"content": entity_content,
"entity_name": entity_name,
"source_id": updated_entity_data["source_id"],
"description": final_description,
"entity_type": entity_type,
"file_path": updated_entity_data["file_path"],
}
}
)
# Helper function to generate final description with optional LLM summary
async def _generate_final_description(combined_description: str) -> str:
if len(combined_description) > global_config["summary_to_max_tokens"]:
return await _handle_entity_relation_summary(
entity_name,
combined_description,
global_config,
llm_response_cache=llm_response_cache,
)
else:
return combined_description
# Collect all entity data from relevant chunks
all_entity_data = []
for chunk_id in chunk_ids:
if chunk_id in chunk_entities and entity_name in chunk_entities[chunk_id]:
all_entity_data.extend(chunk_entities[chunk_id][entity_name])
if not all_entity_data:
logger.warning(
f"No cached entity data found for {entity_name}, trying to rebuild from relationships"
)
# Get all edges connected to this entity
edges = await knowledge_graph_inst.get_node_edges(entity_name)
if not edges:
logger.warning(f"No relationships found for entity {entity_name}")
return
# Collect relationship data to extract entity information
relationship_descriptions = []
file_paths = set()
# Get edge data for all connected relationships
for src_id, tgt_id in edges:
edge_data = await knowledge_graph_inst.get_edge(src_id, tgt_id)
if edge_data:
if edge_data.get("description"):
relationship_descriptions.append(edge_data["description"])
if edge_data.get("file_path"):
edge_file_paths = edge_data["file_path"].split(GRAPH_FIELD_SEP)
file_paths.update(edge_file_paths)
# Generate description from relationships or fallback to current
if relationship_descriptions:
combined_description = GRAPH_FIELD_SEP.join(relationship_descriptions)
final_description = await _generate_final_description(combined_description)
else:
final_description = current_entity.get("description", "")
entity_type = current_entity.get("entity_type", "UNKNOWN")
await _update_entity_storage(final_description, entity_type, file_paths)
return
# Process cached entity data
descriptions = []
entity_types = []
file_paths = set()
for entity_data in all_entity_data:
if entity_data.get("description"):
descriptions.append(entity_data["description"])
if entity_data.get("entity_type"):
entity_types.append(entity_data["entity_type"])
if entity_data.get("file_path"):
file_paths.add(entity_data["file_path"])
# Combine all descriptions
combined_description = (
GRAPH_FIELD_SEP.join(descriptions)
if descriptions
else current_entity.get("description", "")
)
# Get most common entity type
entity_type = (
max(set(entity_types), key=entity_types.count)
if entity_types
else current_entity.get("entity_type", "UNKNOWN")
)
# Generate final description and update storage
final_description = await _generate_final_description(combined_description)
await _update_entity_storage(final_description, entity_type, file_paths)
async def _rebuild_single_relationship(
knowledge_graph_inst: BaseGraphStorage,
relationships_vdb: BaseVectorStorage,
src: str,
tgt: str,
chunk_ids: set[str],
chunk_relationships: dict,
llm_response_cache: BaseKVStorage,
global_config: dict[str, str],
) -> None:
"""Rebuild a single relationship from cached extraction results"""
# Get current relationship data
current_relationship = await knowledge_graph_inst.get_edge(src, tgt)
if not current_relationship:
return
# Collect all relationship data from relevant chunks
all_relationship_data = []
for chunk_id in chunk_ids:
if chunk_id in chunk_relationships:
# Check both (src, tgt) and (tgt, src) since relationships can be bidirectional
for edge_key in [(src, tgt), (tgt, src)]:
if edge_key in chunk_relationships[chunk_id]:
all_relationship_data.extend(
chunk_relationships[chunk_id][edge_key]
)
if not all_relationship_data:
logger.warning(f"No cached relationship data found for {src}-{tgt}")
return
# Merge descriptions and keywords
descriptions = []
keywords = []
weights = []
file_paths = set()
for rel_data in all_relationship_data:
if rel_data.get("description"):
descriptions.append(rel_data["description"])
if rel_data.get("keywords"):
keywords.append(rel_data["keywords"])
if rel_data.get("weight"):
weights.append(rel_data["weight"])
if rel_data.get("file_path"):
file_paths.add(rel_data["file_path"])
# Combine descriptions and keywords
combined_description = (
GRAPH_FIELD_SEP.join(descriptions)
if descriptions
else current_relationship.get("description", "")
)
combined_keywords = (
", ".join(set(keywords))
if keywords
else current_relationship.get("keywords", "")
)
# weight = (
# sum(weights) / len(weights)
# if weights
# else current_relationship.get("weight", 1.0)
# )
weight = sum(weights) if weights else current_relationship.get("weight", 1.0)
# Use summary if description is too long
if len(combined_description) > global_config["summary_to_max_tokens"]:
final_description = await _handle_entity_relation_summary(
f"{src}-{tgt}",
combined_description,
global_config,
llm_response_cache=llm_response_cache,
)
else:
final_description = combined_description
# Update relationship in graph storage
updated_relationship_data = {
**current_relationship,
"description": final_description,
"keywords": combined_keywords,
"weight": weight,
"source_id": GRAPH_FIELD_SEP.join(chunk_ids),
"file_path": GRAPH_FIELD_SEP.join(file_paths)
if file_paths
else current_relationship.get("file_path", "unknown_source"),
}
await knowledge_graph_inst.upsert_edge(src, tgt, updated_relationship_data)
# Update relationship in vector database
rel_vdb_id = compute_mdhash_id(src + tgt, prefix="rel-")
rel_vdb_id_reverse = compute_mdhash_id(tgt + src, prefix="rel-")
# Delete old vector records first (both directions to be safe)
try:
await relationships_vdb.delete([rel_vdb_id, rel_vdb_id_reverse])
except Exception as e:
logger.debug(
f"Could not delete old relationship vector records {rel_vdb_id}, {rel_vdb_id_reverse}: {e}"
)
# Insert new vector record
rel_content = f"{combined_keywords}\t{src}\n{tgt}\n{final_description}"
await relationships_vdb.upsert(
{
rel_vdb_id: {
"src_id": src,
"tgt_id": tgt,
"source_id": updated_relationship_data["source_id"],
"content": rel_content,
"keywords": combined_keywords,
"description": final_description,
"weight": weight,
"file_path": updated_relationship_data["file_path"],
}
}
)
async def _merge_nodes_then_upsert(
entity_name: str,
nodes_data: list[dict],
@ -757,6 +1217,7 @@ async def extract_entities(
use_llm_func,
llm_response_cache=llm_response_cache,
cache_type="extract",
chunk_id=chunk_key,
)
history = pack_user_ass_to_openai_messages(hint_prompt, final_result)
@ -773,6 +1234,7 @@ async def extract_entities(
llm_response_cache=llm_response_cache,
history_messages=history,
cache_type="extract",
chunk_id=chunk_key,
)
history += pack_user_ass_to_openai_messages(continue_prompt, glean_result)

View file

@ -1,686 +0,0 @@
"""
Complete MinerU parsing + multimodal content insertion Pipeline
This script integrates:
1. MinerU document parsing
2. Pure text content LightRAG insertion
3. Specialized processing for multimodal content (using different processors)
"""
import os
import asyncio
import logging
from pathlib import Path
from typing import Dict, List, Any, Tuple, Optional, Callable
import sys
# Add project root directory to Python path
sys.path.insert(0, str(Path(__file__).parent.parent))
from lightrag import LightRAG, QueryParam
from lightrag.utils import EmbeddingFunc, setup_logger
# Import parser and multimodal processors
from lightrag.mineru_parser import MineruParser
# Import specialized processors
from lightrag.modalprocessors import (
ImageModalProcessor,
TableModalProcessor,
EquationModalProcessor,
GenericModalProcessor,
)
class RAGAnything:
"""Multimodal Document Processing Pipeline - Complete document parsing and insertion pipeline"""
def __init__(
self,
lightrag: Optional[LightRAG] = None,
llm_model_func: Optional[Callable] = None,
vision_model_func: Optional[Callable] = None,
embedding_func: Optional[Callable] = None,
working_dir: str = "./rag_storage",
embedding_dim: int = 3072,
max_token_size: int = 8192,
):
"""
Initialize Multimodal Document Processing Pipeline
Args:
lightrag: Optional pre-initialized LightRAG instance
llm_model_func: LLM model function for text analysis
vision_model_func: Vision model function for image analysis
embedding_func: Embedding function for text vectorization
working_dir: Working directory for storage (used when creating new RAG)
embedding_dim: Embedding dimension (used when creating new RAG)
max_token_size: Maximum token size for embeddings (used when creating new RAG)
"""
self.working_dir = working_dir
self.llm_model_func = llm_model_func
self.vision_model_func = vision_model_func
self.embedding_func = embedding_func
self.embedding_dim = embedding_dim
self.max_token_size = max_token_size
# Set up logging
setup_logger("RAGAnything")
self.logger = logging.getLogger("RAGAnything")
# Create working directory if needed
if not os.path.exists(working_dir):
os.makedirs(working_dir)
# Use provided LightRAG or mark for later initialization
self.lightrag = lightrag
self.modal_processors = {}
# If LightRAG is provided, initialize processors immediately
if self.lightrag is not None:
self._initialize_processors()
def _initialize_processors(self):
"""Initialize multimodal processors with appropriate model functions"""
if self.lightrag is None:
raise ValueError(
"LightRAG instance must be initialized before creating processors"
)
# Create different multimodal processors
self.modal_processors = {
"image": ImageModalProcessor(
lightrag=self.lightrag,
modal_caption_func=self.vision_model_func or self.llm_model_func,
),
"table": TableModalProcessor(
lightrag=self.lightrag, modal_caption_func=self.llm_model_func
),
"equation": EquationModalProcessor(
lightrag=self.lightrag, modal_caption_func=self.llm_model_func
),
"generic": GenericModalProcessor(
lightrag=self.lightrag, modal_caption_func=self.llm_model_func
),
}
self.logger.info("Multimodal processors initialized")
self.logger.info(f"Available processors: {list(self.modal_processors.keys())}")
async def _ensure_lightrag_initialized(self):
"""Ensure LightRAG instance is initialized, create if necessary"""
if self.lightrag is not None:
return
# Validate required functions
if self.llm_model_func is None:
raise ValueError(
"llm_model_func must be provided when LightRAG is not pre-initialized"
)
if self.embedding_func is None:
raise ValueError(
"embedding_func must be provided when LightRAG is not pre-initialized"
)
from lightrag.kg.shared_storage import initialize_pipeline_status
# Create LightRAG instance with provided functions
self.lightrag = LightRAG(
working_dir=self.working_dir,
llm_model_func=self.llm_model_func,
embedding_func=EmbeddingFunc(
embedding_dim=self.embedding_dim,
max_token_size=self.max_token_size,
func=self.embedding_func,
),
)
await self.lightrag.initialize_storages()
await initialize_pipeline_status()
# Initialize processors after LightRAG is ready
self._initialize_processors()
self.logger.info("LightRAG and multimodal processors initialized")
def parse_document(
self,
file_path: str,
output_dir: str = "./output",
parse_method: str = "auto",
display_stats: bool = True,
) -> Tuple[List[Dict[str, Any]], str]:
"""
Parse document using MinerU
Args:
file_path: Path to the file to parse
output_dir: Output directory
parse_method: Parse method ("auto", "ocr", "txt")
display_stats: Whether to display content statistics
Returns:
(content_list, md_content): Content list and markdown text
"""
self.logger.info(f"Starting document parsing: {file_path}")
file_path = Path(file_path)
if not file_path.exists():
raise FileNotFoundError(f"File not found: {file_path}")
# Choose appropriate parsing method based on file extension
ext = file_path.suffix.lower()
try:
if ext in [".pdf"]:
self.logger.info(
f"Detected PDF file, using PDF parser (OCR={parse_method == 'ocr'})..."
)
content_list, md_content = MineruParser.parse_pdf(
file_path, output_dir, use_ocr=(parse_method == "ocr")
)
elif ext in [".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif"]:
self.logger.info("Detected image file, using image parser...")
content_list, md_content = MineruParser.parse_image(
file_path, output_dir
)
elif ext in [".doc", ".docx", ".ppt", ".pptx"]:
self.logger.info("Detected Office document, using Office parser...")
content_list, md_content = MineruParser.parse_office_doc(
file_path, output_dir
)
else:
# For other or unknown formats, use generic parser
self.logger.info(
f"Using generic parser for {ext} file (method={parse_method})..."
)
content_list, md_content = MineruParser.parse_document(
file_path, parse_method=parse_method, output_dir=output_dir
)
except Exception as e:
self.logger.error(f"Error during parsing with specific parser: {str(e)}")
self.logger.warning("Falling back to generic parser...")
# If specific parser fails, fall back to generic parser
content_list, md_content = MineruParser.parse_document(
file_path, parse_method=parse_method, output_dir=output_dir
)
self.logger.info(
f"Parsing complete! Extracted {len(content_list)} content blocks"
)
self.logger.info(f"Markdown text length: {len(md_content)} characters")
# Display content statistics if requested
if display_stats:
self.logger.info("\nContent Information:")
self.logger.info(f"* Total blocks in content_list: {len(content_list)}")
self.logger.info(f"* Markdown content length: {len(md_content)} characters")
# Count elements by type
block_types: Dict[str, int] = {}
for block in content_list:
if isinstance(block, dict):
block_type = block.get("type", "unknown")
if isinstance(block_type, str):
block_types[block_type] = block_types.get(block_type, 0) + 1
self.logger.info("* Content block types:")
for block_type, count in block_types.items():
self.logger.info(f" - {block_type}: {count}")
return content_list, md_content
def _separate_content(
self, content_list: List[Dict[str, Any]]
) -> Tuple[str, List[Dict[str, Any]]]:
"""
Separate text content and multimodal content
Args:
content_list: Content list from MinerU parsing
Returns:
(text_content, multimodal_items): Pure text content and multimodal items list
"""
text_parts = []
multimodal_items = []
for item in content_list:
content_type = item.get("type", "text")
if content_type == "text":
# Text content
text = item.get("text", "")
if text.strip():
text_parts.append(text)
else:
# Multimodal content (image, table, equation, etc.)
multimodal_items.append(item)
# Merge all text content
text_content = "\n\n".join(text_parts)
self.logger.info("Content separation complete:")
self.logger.info(f" - Text content length: {len(text_content)} characters")
self.logger.info(f" - Multimodal items count: {len(multimodal_items)}")
# Count multimodal types
modal_types = {}
for item in multimodal_items:
modal_type = item.get("type", "unknown")
modal_types[modal_type] = modal_types.get(modal_type, 0) + 1
if modal_types:
self.logger.info(f" - Multimodal type distribution: {modal_types}")
return text_content, multimodal_items
async def _insert_text_content(
self,
input: str | list[str],
split_by_character: str | None = None,
split_by_character_only: bool = False,
ids: str | list[str] | None = None,
file_paths: str | list[str] | None = None,
):
"""
Insert pure text content into LightRAG
Args:
input: Single document string or list of document strings
split_by_character: if split_by_character is not None, split the string by character, if chunk longer than
chunk_token_size, it will be split again by token size.
split_by_character_only: if split_by_character_only is True, split the string by character only, when
split_by_character is None, this parameter is ignored.
ids: single string of the document ID or list of unique document IDs, if not provided, MD5 hash IDs will be generated
file_paths: single string of the file path or list of file paths, used for citation
"""
self.logger.info("Starting text content insertion into LightRAG...")
# Use LightRAG's insert method with all parameters
await self.lightrag.ainsert(
input=input,
file_paths=file_paths,
split_by_character=split_by_character,
split_by_character_only=split_by_character_only,
ids=ids,
)
self.logger.info("Text content insertion complete")
async def _process_multimodal_content(
self, multimodal_items: List[Dict[str, Any]], file_path: str
):
"""
Process multimodal content (using specialized processors)
Args:
multimodal_items: List of multimodal items
file_path: File path (for reference)
"""
if not multimodal_items:
self.logger.debug("No multimodal content to process")
return
self.logger.info("Starting multimodal content processing...")
file_name = os.path.basename(file_path)
for i, item in enumerate(multimodal_items):
try:
content_type = item.get("type", "unknown")
self.logger.info(
f"Processing item {i+1}/{len(multimodal_items)}: {content_type} content"
)
# Select appropriate processor
processor = self._get_processor_for_type(content_type)
if processor:
(
enhanced_caption,
entity_info,
) = await processor.process_multimodal_content(
modal_content=item,
content_type=content_type,
file_path=file_name,
)
self.logger.info(
f"{content_type} processing complete: {entity_info.get('entity_name', 'Unknown')}"
)
else:
self.logger.warning(
f"No suitable processor found for {content_type} type content"
)
except Exception as e:
self.logger.error(f"Error processing multimodal content: {str(e)}")
self.logger.debug("Exception details:", exc_info=True)
continue
self.logger.info("Multimodal content processing complete")
def _get_processor_for_type(self, content_type: str):
"""
Get appropriate processor based on content type
Args:
content_type: Content type
Returns:
Corresponding processor instance
"""
# Direct mapping to corresponding processor
if content_type == "image":
return self.modal_processors.get("image")
elif content_type == "table":
return self.modal_processors.get("table")
elif content_type == "equation":
return self.modal_processors.get("equation")
else:
# For other types, use generic processor
return self.modal_processors.get("generic")
async def process_document_complete(
self,
file_path: str,
output_dir: str = "./output",
parse_method: str = "auto",
display_stats: bool = True,
split_by_character: str | None = None,
split_by_character_only: bool = False,
doc_id: str | None = None,
):
"""
Complete document processing workflow
Args:
file_path: Path to the file to process
output_dir: MinerU output directory
parse_method: Parse method
display_stats: Whether to display content statistics
split_by_character: Optional character to split the text by
split_by_character_only: If True, split only by the specified character
doc_id: Optional document ID, if not provided MD5 hash will be generated
"""
# Ensure LightRAG is initialized
await self._ensure_lightrag_initialized()
self.logger.info(f"Starting complete document processing: {file_path}")
# Step 1: Parse document using MinerU
content_list, md_content = self.parse_document(
file_path, output_dir, parse_method, display_stats
)
# Step 2: Separate text and multimodal content
text_content, multimodal_items = self._separate_content(content_list)
# Step 3: Insert pure text content with all parameters
if text_content.strip():
file_name = os.path.basename(file_path)
await self._insert_text_content(
text_content,
file_paths=file_name,
split_by_character=split_by_character,
split_by_character_only=split_by_character_only,
ids=doc_id,
)
# Step 4: Process multimodal content (using specialized processors)
if multimodal_items:
await self._process_multimodal_content(multimodal_items, file_path)
self.logger.info(f"Document {file_path} processing complete!")
async def process_folder_complete(
self,
folder_path: str,
output_dir: str = "./output",
parse_method: str = "auto",
display_stats: bool = False,
split_by_character: str | None = None,
split_by_character_only: bool = False,
file_extensions: Optional[List[str]] = None,
recursive: bool = True,
max_workers: int = 1,
):
"""
Process all files in a folder in batch
Args:
folder_path: Path to the folder to process
output_dir: MinerU output directory
parse_method: Parse method
display_stats: Whether to display content statistics for each file (recommended False for batch processing)
split_by_character: Optional character to split text by
split_by_character_only: If True, split only by the specified character
file_extensions: List of file extensions to process, e.g. [".pdf", ".docx"]. If None, process all supported formats
recursive: Whether to recursively process subfolders
max_workers: Maximum number of concurrent workers
"""
# Ensure LightRAG is initialized
await self._ensure_lightrag_initialized()
folder_path = Path(folder_path)
if not folder_path.exists() or not folder_path.is_dir():
raise ValueError(
f"Folder does not exist or is not a valid directory: {folder_path}"
)
# Supported file formats
supported_extensions = {
".pdf",
".jpg",
".jpeg",
".png",
".bmp",
".tiff",
".tif",
".doc",
".docx",
".ppt",
".pptx",
".txt",
".md",
}
# Use specified extensions or all supported formats
if file_extensions:
target_extensions = set(ext.lower() for ext in file_extensions)
# Validate if all are supported formats
unsupported = target_extensions - supported_extensions
if unsupported:
self.logger.warning(
f"The following file formats may not be fully supported: {unsupported}"
)
else:
target_extensions = supported_extensions
# Collect all files to process
files_to_process = []
if recursive:
# Recursively traverse all subfolders
for file_path in folder_path.rglob("*"):
if (
file_path.is_file()
and file_path.suffix.lower() in target_extensions
):
files_to_process.append(file_path)
else:
# Process only current folder
for file_path in folder_path.glob("*"):
if (
file_path.is_file()
and file_path.suffix.lower() in target_extensions
):
files_to_process.append(file_path)
if not files_to_process:
self.logger.info(f"No files to process found in {folder_path}")
return
self.logger.info(f"Found {len(files_to_process)} files to process")
self.logger.info("File type distribution:")
# Count file types
file_type_count = {}
for file_path in files_to_process:
ext = file_path.suffix.lower()
file_type_count[ext] = file_type_count.get(ext, 0) + 1
for ext, count in sorted(file_type_count.items()):
self.logger.info(f" {ext}: {count} files")
# Create progress tracking
processed_count = 0
failed_files = []
# Use semaphore to control concurrency
semaphore = asyncio.Semaphore(max_workers)
async def process_single_file(file_path: Path, index: int) -> None:
"""Process a single file"""
async with semaphore:
nonlocal processed_count
try:
self.logger.info(
f"[{index}/{len(files_to_process)}] Processing: {file_path}"
)
# Create separate output directory for each file
file_output_dir = Path(output_dir) / file_path.stem
file_output_dir.mkdir(parents=True, exist_ok=True)
# Process file
await self.process_document_complete(
file_path=str(file_path),
output_dir=str(file_output_dir),
parse_method=parse_method,
display_stats=display_stats,
split_by_character=split_by_character,
split_by_character_only=split_by_character_only,
)
processed_count += 1
self.logger.info(
f"[{index}/{len(files_to_process)}] Successfully processed: {file_path}"
)
except Exception as e:
self.logger.error(
f"[{index}/{len(files_to_process)}] Failed to process: {file_path}"
)
self.logger.error(f"Error: {str(e)}")
failed_files.append((file_path, str(e)))
# Create all processing tasks
tasks = []
for index, file_path in enumerate(files_to_process, 1):
task = process_single_file(file_path, index)
tasks.append(task)
# Wait for all tasks to complete
await asyncio.gather(*tasks, return_exceptions=True)
# Output processing statistics
self.logger.info("\n===== Batch Processing Complete =====")
self.logger.info(f"Total files: {len(files_to_process)}")
self.logger.info(f"Successfully processed: {processed_count}")
self.logger.info(f"Failed: {len(failed_files)}")
if failed_files:
self.logger.info("\nFailed files:")
for file_path, error in failed_files:
self.logger.info(f" - {file_path}: {error}")
return {
"total": len(files_to_process),
"success": processed_count,
"failed": len(failed_files),
"failed_files": failed_files,
}
async def query_with_multimodal(self, query: str, mode: str = "hybrid") -> str:
"""
Query with multimodal content support
Args:
query: Query content
mode: Query mode
Returns:
Query result
"""
if self.lightrag is None:
raise ValueError(
"No LightRAG instance available. "
"Please either:\n"
"1. Provide a pre-initialized LightRAG instance when creating RAGAnything, or\n"
"2. Process documents first using process_document_complete() or process_folder_complete() "
"to create and populate the LightRAG instance."
)
result = await self.lightrag.aquery(query, param=QueryParam(mode=mode))
return result
def get_processor_info(self) -> Dict[str, Any]:
"""Get processor information"""
if not self.modal_processors:
return {"status": "Not initialized"}
info = {
"status": "Initialized",
"processors": {},
"models": {
"llm_model": "External function"
if self.llm_model_func
else "Not provided",
"vision_model": "External function"
if self.vision_model_func
else "Not provided",
"embedding_model": "External function"
if self.embedding_func
else "Not provided",
},
}
for proc_type, processor in self.modal_processors.items():
info["processors"][proc_type] = {
"class": processor.__class__.__name__,
"supports": self._get_processor_supports(proc_type),
}
return info
def _get_processor_supports(self, proc_type: str) -> List[str]:
"""Get processor supported features"""
supports_map = {
"image": [
"Image content analysis",
"Visual understanding",
"Image description generation",
"Image entity extraction",
],
"table": [
"Table structure analysis",
"Data statistics",
"Trend identification",
"Table entity extraction",
],
"equation": [
"Mathematical formula parsing",
"Variable identification",
"Formula meaning explanation",
"Formula entity extraction",
],
"generic": [
"General content analysis",
"Structured processing",
"Entity extraction",
],
}
return supports_map.get(proc_type, ["Basic processing"])

View file

@ -990,6 +990,7 @@ class CacheData:
max_val: float | None = None
mode: str = "default"
cache_type: str = "query"
chunk_id: str | None = None
async def save_to_cache(hashing_kv, cache_data: CacheData):
@ -1030,6 +1031,7 @@ async def save_to_cache(hashing_kv, cache_data: CacheData):
mode_cache[cache_data.args_hash] = {
"return": cache_data.content,
"cache_type": cache_data.cache_type,
"chunk_id": cache_data.chunk_id if cache_data.chunk_id is not None else None,
"embedding": cache_data.quantized.tobytes().hex()
if cache_data.quantized is not None
else None,
@ -1534,6 +1536,7 @@ async def use_llm_func_with_cache(
max_tokens: int = None,
history_messages: list[dict[str, str]] = None,
cache_type: str = "extract",
chunk_id: str | None = None,
) -> str:
"""Call LLM function with cache support
@ -1547,6 +1550,7 @@ async def use_llm_func_with_cache(
max_tokens: Maximum tokens for generation
history_messages: History messages list
cache_type: Type of cache
chunk_id: Chunk identifier to store in cache
Returns:
LLM response text
@ -1589,6 +1593,7 @@ async def use_llm_func_with_cache(
content=res,
prompt=_prompt,
cache_type=cache_type,
chunk_id=chunk_id,
),
)

View file

@ -4,6 +4,7 @@ import time
import asyncio
from typing import Any, cast
from .base import DeletionResult
from .kg.shared_storage import get_graph_db_lock
from .prompt import GRAPH_FIELD_SEP
from .utils import compute_mdhash_id, logger
@ -12,7 +13,7 @@ from .base import StorageNameSpace
async def adelete_by_entity(
chunk_entity_relation_graph, entities_vdb, relationships_vdb, entity_name: str
) -> None:
) -> DeletionResult:
"""Asynchronously delete an entity and all its relationships.
Args:
@ -25,18 +26,43 @@ async def adelete_by_entity(
# Use graph database lock to ensure atomic graph and vector db operations
async with graph_db_lock:
try:
# Check if the entity exists
if not await chunk_entity_relation_graph.has_node(entity_name):
logger.warning(f"Entity '{entity_name}' not found.")
return DeletionResult(
status="not_found",
doc_id=entity_name,
message=f"Entity '{entity_name}' not found.",
status_code=404,
)
# Retrieve related relationships before deleting the node
edges = await chunk_entity_relation_graph.get_node_edges(entity_name)
related_relations_count = len(edges) if edges else 0
await entities_vdb.delete_entity(entity_name)
await relationships_vdb.delete_entity_relation(entity_name)
await chunk_entity_relation_graph.delete_node(entity_name)
logger.info(
f"Entity '{entity_name}' and its relationships have been deleted."
)
message = f"Entity '{entity_name}' and its {related_relations_count} relationships have been deleted."
logger.info(message)
await _delete_by_entity_done(
entities_vdb, relationships_vdb, chunk_entity_relation_graph
)
return DeletionResult(
status="success",
doc_id=entity_name,
message=message,
status_code=200,
)
except Exception as e:
logger.error(f"Error while deleting entity '{entity_name}': {e}")
error_message = f"Error while deleting entity '{entity_name}': {e}"
logger.error(error_message)
return DeletionResult(
status="fail",
doc_id=entity_name,
message=error_message,
status_code=500,
)
async def _delete_by_entity_done(
@ -60,7 +86,7 @@ async def adelete_by_relation(
relationships_vdb,
source_entity: str,
target_entity: str,
) -> None:
) -> DeletionResult:
"""Asynchronously delete a relation between two entities.
Args:
@ -69,6 +95,7 @@ async def adelete_by_relation(
source_entity: Name of the source entity
target_entity: Name of the target entity
"""
relation_str = f"{source_entity} -> {target_entity}"
graph_db_lock = get_graph_db_lock(enable_logging=False)
# Use graph database lock to ensure atomic graph and vector db operations
async with graph_db_lock:
@ -78,29 +105,45 @@ async def adelete_by_relation(
source_entity, target_entity
)
if not edge_exists:
logger.warning(
f"Relation from '{source_entity}' to '{target_entity}' does not exist"
message = f"Relation from '{source_entity}' to '{target_entity}' does not exist"
logger.warning(message)
return DeletionResult(
status="not_found",
doc_id=relation_str,
message=message,
status_code=404,
)
return
# Delete relation from vector database
relation_id = compute_mdhash_id(
source_entity + target_entity, prefix="rel-"
)
await relationships_vdb.delete([relation_id])
rel_ids_to_delete = [
compute_mdhash_id(source_entity + target_entity, prefix="rel-"),
compute_mdhash_id(target_entity + source_entity, prefix="rel-"),
]
await relationships_vdb.delete(rel_ids_to_delete)
# Delete relation from knowledge graph
await chunk_entity_relation_graph.remove_edges(
[(source_entity, target_entity)]
)
logger.info(
f"Successfully deleted relation from '{source_entity}' to '{target_entity}'"
)
message = f"Successfully deleted relation from '{source_entity}' to '{target_entity}'"
logger.info(message)
await _delete_relation_done(relationships_vdb, chunk_entity_relation_graph)
return DeletionResult(
status="success",
doc_id=relation_str,
message=message,
status_code=200,
)
except Exception as e:
logger.error(
f"Error while deleting relation from '{source_entity}' to '{target_entity}': {e}"
error_message = f"Error while deleting relation from '{source_entity}' to '{target_entity}': {e}"
logger.error(error_message)
return DeletionResult(
status="fail",
doc_id=relation_str,
message=error_message,
status_code=500,
)