Merge branch 'HKUDS:main' into main
This commit is contained in:
commit
12054fa8d9
22 changed files with 1514 additions and 3058 deletions
150
README-zh.md
150
README-zh.md
|
|
@ -4,7 +4,7 @@
|
|||
|
||||
## 🎉 新闻
|
||||
|
||||
- [X] [2025.06.05]🎯📢LightRAG现已集成MinerU,支持多模态文档解析与RAG(PDF、图片、Office、表格、公式等)。详见下方[多模态处理模块](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#multimodal-document-processing-mineru-integration)。
|
||||
- [X] [2025.06.05]🎯📢LightRAG现已集成RAG-Anything,支持全面的多模态文档解析与RAG能力(PDF、图片、Office文档、表格、公式等)。详见下方[多模态处理模块](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#多模态文档处理rag-anything集成)。
|
||||
- [X] [2025.03.18]🎯📢LightRAG现已支持引文功能。
|
||||
- [X] [2025.02.05]🎯📢我们团队发布了[VideoRAG](https://github.com/HKUDS/VideoRAG),用于理解超长上下文视频。
|
||||
- [X] [2025.01.13]🎯📢我们团队发布了[MiniRAG](https://github.com/HKUDS/MiniRAG),使用小型模型简化RAG。
|
||||
|
|
@ -932,6 +932,94 @@ rag.insert_custom_kg(custom_kg)
|
|||
|
||||
</details>
|
||||
|
||||
## 删除功能
|
||||
|
||||
LightRAG提供了全面的删除功能,允许您删除文档、实体和关系。
|
||||
|
||||
<details>
|
||||
<summary> <b>删除实体</b> </summary>
|
||||
|
||||
您可以通过实体名称删除实体及其所有关联关系:
|
||||
|
||||
```python
|
||||
# 删除实体及其所有关系(同步版本)
|
||||
rag.delete_by_entity("Google")
|
||||
|
||||
# 异步版本
|
||||
await rag.adelete_by_entity("Google")
|
||||
```
|
||||
|
||||
删除实体时会:
|
||||
- 从知识图谱中移除该实体节点
|
||||
- 删除该实体的所有关联关系
|
||||
- 从向量数据库中移除相关的嵌入向量
|
||||
- 保持知识图谱的完整性
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary> <b>删除关系</b> </summary>
|
||||
|
||||
您可以删除两个特定实体之间的关系:
|
||||
|
||||
```python
|
||||
# 删除两个实体之间的关系(同步版本)
|
||||
rag.delete_by_relation("Google", "Gmail")
|
||||
|
||||
# 异步版本
|
||||
await rag.adelete_by_relation("Google", "Gmail")
|
||||
```
|
||||
|
||||
删除关系时会:
|
||||
- 移除指定的关系边
|
||||
- 从向量数据库中删除关系的嵌入向量
|
||||
- 保留两个实体节点及其他关系
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary> <b>通过文档ID删除</b> </summary>
|
||||
|
||||
您可以通过文档ID删除整个文档及其相关的所有知识:
|
||||
|
||||
```python
|
||||
# 通过文档ID删除(异步版本)
|
||||
await rag.adelete_by_doc_id("doc-12345")
|
||||
```
|
||||
|
||||
通过文档ID删除时的优化处理:
|
||||
- **智能清理**:自动识别并删除仅属于该文档的实体和关系
|
||||
- **保留共享知识**:如果实体或关系在其他文档中也存在,则会保留并重新构建描述
|
||||
- **缓存优化**:清理相关的LLM缓存以减少存储开销
|
||||
- **增量重建**:从剩余文档重新构建受影响的实体和关系描述
|
||||
|
||||
删除过程包括:
|
||||
1. 删除文档相关的所有文本块
|
||||
2. 识别仅属于该文档的实体和关系并删除
|
||||
3. 重新构建在其他文档中仍存在的实体和关系
|
||||
4. 更新所有相关的向量索引
|
||||
5. 清理文档状态记录
|
||||
|
||||
注意:通过文档ID删除是一个异步操作,因为它涉及复杂的知识图谱重构过程。
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary> <b>删除注意事项</b> </summary>
|
||||
|
||||
**重要提醒:**
|
||||
|
||||
1. **不可逆操作**:所有删除操作都是不可逆的,请谨慎使用
|
||||
2. **性能考虑**:删除大量数据时可能需要一些时间,特别是通过文档ID删除
|
||||
3. **数据一致性**:删除操作会自动维护知识图谱和向量数据库之间的一致性
|
||||
4. **备份建议**:在执行重要删除操作前建议备份数据
|
||||
|
||||
**批量删除建议:**
|
||||
- 对于批量删除操作,建议使用异步方法以获得更好的性能
|
||||
- 大规模删除时,考虑分批进行以避免系统负载过高
|
||||
|
||||
</details>
|
||||
|
||||
## 实体合并
|
||||
|
||||
<details>
|
||||
|
|
@ -1003,31 +1091,59 @@ rag.merge_entities(
|
|||
|
||||
</details>
|
||||
|
||||
## 多模态文档处理(MinerU集成)
|
||||
## 多模态文档处理(RAG-Anything集成)
|
||||
|
||||
LightRAG 现已支持通过 [MinerU](https://github.com/opendatalab/MinerU) 实现多模态文档解析与检索增强生成(RAG)。您可以从 PDF、图片、Office 文档中提取结构化内容(文本、图片、表格、公式等),并在 RAG 流程中使用。
|
||||
LightRAG 现已与 [RAG-Anything](https://github.com/HKUDS/RAG-Anything) 实现无缝集成,这是一个专为 LightRAG 构建的**全能多模态文档处理RAG系统**。RAG-Anything 提供先进的解析和检索增强生成(RAG)能力,让您能够无缝处理多模态文档,并从各种文档格式中提取结构化内容——包括文本、图片、表格和公式——以集成到您的RAG流程中。
|
||||
|
||||
**主要特性:**
|
||||
- 支持解析 PDF、图片、DOC/DOCX/PPT/PPTX 等多种格式
|
||||
- 提取并索引文本、图片、表格、公式及文档结构
|
||||
- 在 RAG 中查询和检索多模态内容(文本、图片、表格、公式)
|
||||
- 与 LightRAG Core 及 RAGAnything 无缝集成
|
||||
- **端到端多模态流程**:从文档摄取解析到智能多模态问答的完整工作流程
|
||||
- **通用文档支持**:无缝处理PDF、Office文档(DOC/DOCX/PPT/PPTX/XLS/XLSX)、图片和各种文件格式
|
||||
- **专业内容分析**:针对图片、表格、数学公式和异构内容类型的专用处理器
|
||||
- **多模态知识图谱**:自动实体提取和跨模态关系发现以增强理解
|
||||
- **混合智能检索**:覆盖文本和多模态内容的高级搜索能力,具备上下文理解
|
||||
|
||||
**快速开始:**
|
||||
1. 安装依赖:
|
||||
1. 安装RAG-Anything:
|
||||
```bash
|
||||
pip install "magic-pdf[full]>=1.2.2" huggingface_hub
|
||||
pip install raganything
|
||||
```
|
||||
2. 下载 MinerU 模型权重(详见 [MinerU 集成指南](docs/mineru_integration_zh.md))
|
||||
3. 使用新版 `MineruParser` 或 RAGAnything 的 `process_document_complete` 处理文件:
|
||||
2. 处理多模态文档:
|
||||
```python
|
||||
from lightrag.mineru_parser import MineruParser
|
||||
content_list, md_content = MineruParser.parse_pdf('path/to/document.pdf', 'output_dir')
|
||||
# 或自动识别类型:
|
||||
content_list, md_content = MineruParser.parse_document('path/to/file', 'auto', 'output_dir')
|
||||
```
|
||||
4. 使用 LightRAG 查询多模态内容请参见 [docs/mineru_integration_zh.md](docs/mineru_integration_zh.md)。
|
||||
import asyncio
|
||||
from raganything import RAGAnything
|
||||
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
|
||||
|
||||
async def main():
|
||||
# 使用LightRAG集成初始化RAGAnything
|
||||
rag = RAGAnything(
|
||||
working_dir="./rag_storage",
|
||||
llm_model_func=lambda prompt, **kwargs: openai_complete_if_cache(
|
||||
"gpt-4o-mini", prompt, api_key="your-api-key", **kwargs
|
||||
),
|
||||
embedding_func=lambda texts: openai_embed(
|
||||
texts, model="text-embedding-3-large", api_key="your-api-key"
|
||||
),
|
||||
embedding_dim=3072,
|
||||
)
|
||||
|
||||
# 处理多模态文档
|
||||
await rag.process_document_complete(
|
||||
file_path="path/to/your/document.pdf",
|
||||
output_dir="./output"
|
||||
)
|
||||
|
||||
# 查询多模态内容
|
||||
result = await rag.query_with_multimodal(
|
||||
"图表中显示的主要发现是什么?",
|
||||
mode="hybrid"
|
||||
)
|
||||
print(result)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
如需详细文档和高级用法,请参阅 [RAG-Anything 仓库](https://github.com/HKUDS/RAG-Anything)。
|
||||
|
||||
## Token统计功能
|
||||
|
||||
|
|
|
|||
288
README.md
288
README.md
|
|
@ -39,7 +39,8 @@
|
|||
</div>
|
||||
|
||||
## 🎉 News
|
||||
- [X] [2025.06.05]🎯📢LightRAG now supports multi-modal data handling through MinerU integration, enabling comprehensive document parsing and RAG capabilities across diverse formats including PDFs, images, Office documents, tables, and formulas. Please refer to the new [multimodal section](https://github.com/HKUDS/LightRAG/?tab=readme-ov-file#multimodal-document-processing-mineru-integration) for details.
|
||||
- [X] [2025.06.16]🎯📢Our team has released [RAG-Anything](https://github.com/HKUDS/RAG-Anything) an All-in-One Multimodal RAG System for seamless text, image, table, and equation processing.
|
||||
- [X] [2025.06.05]🎯📢LightRAG now supports comprehensive multimodal data handling through [RAG-Anything](https://github.com/HKUDS/RAG-Anything) integration, enabling seamless document parsing and RAG capabilities across diverse formats including PDFs, images, Office documents, tables, and formulas. Please refer to the new [multimodal section](https://github.com/HKUDS/LightRAG/?tab=readme-ov-file#multimodal-document-processing-rag-anything-integration) for details.
|
||||
- [X] [2025.03.18]🎯📢LightRAG now supports citation functionality, enabling proper source attribution.
|
||||
- [X] [2025.02.05]🎯📢Our team has released [VideoRAG](https://github.com/HKUDS/VideoRAG) understanding extremely long-context videos.
|
||||
- [X] [2025.01.13]🎯📢Our team has released [MiniRAG](https://github.com/HKUDS/MiniRAG) making RAG simpler with small models.
|
||||
|
|
@ -191,7 +192,7 @@ async def main():
|
|||
rag.insert("Your text")
|
||||
|
||||
# Perform hybrid search
|
||||
mode="hybrid"
|
||||
mode = "hybrid"
|
||||
print(
|
||||
await rag.query(
|
||||
"What are the top themes in this story?",
|
||||
|
|
@ -987,6 +988,89 @@ These operations maintain data consistency across both the graph database and ve
|
|||
|
||||
</details>
|
||||
|
||||
## Delete Functions
|
||||
|
||||
LightRAG provides comprehensive deletion capabilities, allowing you to delete documents, entities, and relationships.
|
||||
|
||||
<details>
|
||||
<summary> <b>Delete Entities</b> </summary>
|
||||
|
||||
You can delete entities by their name along with all associated relationships:
|
||||
|
||||
```python
|
||||
# Delete entity and all its relationships (synchronous version)
|
||||
rag.delete_by_entity("Google")
|
||||
|
||||
# Asynchronous version
|
||||
await rag.adelete_by_entity("Google")
|
||||
```
|
||||
|
||||
When deleting an entity:
|
||||
- Removes the entity node from the knowledge graph
|
||||
- Deletes all associated relationships
|
||||
- Removes related embedding vectors from the vector database
|
||||
- Maintains knowledge graph integrity
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary> <b>Delete Relations</b> </summary>
|
||||
|
||||
You can delete relationships between two specific entities:
|
||||
|
||||
```python
|
||||
# Delete relationship between two entities (synchronous version)
|
||||
rag.delete_by_relation("Google", "Gmail")
|
||||
|
||||
# Asynchronous version
|
||||
await rag.adelete_by_relation("Google", "Gmail")
|
||||
```
|
||||
|
||||
When deleting a relationship:
|
||||
- Removes the specified relationship edge
|
||||
- Deletes the relationship's embedding vector from the vector database
|
||||
- Preserves both entity nodes and their other relationships
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary> <b>Delete by Document ID</b> </summary>
|
||||
|
||||
You can delete an entire document and all its related knowledge through document ID:
|
||||
|
||||
```python
|
||||
# Delete by document ID (asynchronous version)
|
||||
await rag.adelete_by_doc_id("doc-12345")
|
||||
```
|
||||
|
||||
Optimized processing when deleting by document ID:
|
||||
- **Smart Cleanup**: Automatically identifies and removes entities and relationships that belong only to this document
|
||||
- **Preserve Shared Knowledge**: If entities or relationships exist in other documents, they are preserved and their descriptions are rebuilt
|
||||
- **Cache Optimization**: Clears related LLM cache to reduce storage overhead
|
||||
- **Incremental Rebuilding**: Reconstructs affected entity and relationship descriptions from remaining documents
|
||||
|
||||
The deletion process includes:
|
||||
1. Delete all text chunks related to the document
|
||||
2. Identify and delete entities and relationships that belong only to this document
|
||||
3. Rebuild entities and relationships that still exist in other documents
|
||||
4. Update all related vector indexes
|
||||
5. Clean up document status records
|
||||
|
||||
Note: Deletion by document ID is an asynchronous operation as it involves complex knowledge graph reconstruction processes.
|
||||
|
||||
</details>
|
||||
|
||||
**Important Reminders:**
|
||||
|
||||
1. **Irreversible Operations**: All deletion operations are irreversible, please use with caution
|
||||
2. **Performance Considerations**: Deleting large amounts of data may take some time, especially deletion by document ID
|
||||
3. **Data Consistency**: Deletion operations automatically maintain consistency between the knowledge graph and vector database
|
||||
4. **Backup Recommendations**: Consider backing up data before performing important deletion operations
|
||||
|
||||
**Batch Deletion Recommendations:**
|
||||
- For batch deletion operations, consider using asynchronous methods for better performance
|
||||
- For large-scale deletions, consider processing in batches to avoid excessive system load
|
||||
|
||||
## Entity Merging
|
||||
|
||||
<details>
|
||||
|
|
@ -1058,31 +1142,59 @@ When merging entities:
|
|||
|
||||
</details>
|
||||
|
||||
## Multimodal Document Processing (MinerU Integration)
|
||||
## Multimodal Document Processing (RAG-Anything Integration)
|
||||
|
||||
LightRAG now supports comprehensive multi-modal document processing through [MinerU](https://github.com/opendatalab/MinerU) integration, enabling advanced parsing and retrieval-augmented generation (RAG) capabilities. This powerful feature allows you to handle multi-modal documents seamlessly, extracting structured content—including text, images, tables, and formulas—from various document formats for integration into your RAG pipeline.
|
||||
LightRAG now seamlessly integrates with [RAG-Anything](https://github.com/HKUDS/RAG-Anything), a comprehensive **All-in-One Multimodal Document Processing RAG system** built specifically for LightRAG. RAG-Anything enables advanced parsing and retrieval-augmented generation (RAG) capabilities, allowing you to handle multimodal documents seamlessly and extract structured content—including text, images, tables, and formulas—from various document formats for integration into your RAG pipeline.
|
||||
|
||||
**Key Features:**
|
||||
- **Multimodal Document Handling**: Process complex documents containing mixed content types (text, images, tables, formulas)
|
||||
- **Comprehensive Format Support**: Parse PDFs, images, DOC/DOCX/PPT/PPTX, and additional file types
|
||||
- **Multi-Element Extraction**: Extract and index text, images, tables, formulas, and document structure
|
||||
- **Multimodal Retrieval**: Query and retrieve diverse content types (text, images, tables, formulas) within RAG workflows
|
||||
- **Seamless Integration**: Works smoothly with LightRAG core and RAG-Anything frameworks
|
||||
- **End-to-End Multimodal Pipeline**: Complete workflow from document ingestion and parsing to intelligent multimodal query answering
|
||||
- **Universal Document Support**: Seamless processing of PDFs, Office documents (DOC/DOCX/PPT/PPTX/XLS/XLSX), images, and diverse file formats
|
||||
- **Specialized Content Analysis**: Dedicated processors for images, tables, mathematical equations, and heterogeneous content types
|
||||
- **Multimodal Knowledge Graph**: Automatic entity extraction and cross-modal relationship discovery for enhanced understanding
|
||||
- **Hybrid Intelligent Retrieval**: Advanced search capabilities spanning textual and multimodal content with contextual understanding
|
||||
|
||||
**Quick Start:**
|
||||
1. Install dependencies:
|
||||
1. Install RAG-Anything:
|
||||
```bash
|
||||
pip install "magic-pdf[full]>=1.2.2" huggingface_hub
|
||||
pip install raganything
|
||||
```
|
||||
2. Download MinerU model weights (refer to [MinerU Integration Guide](docs/mineru_integration_en.md))
|
||||
3. Process multi-modal documents using the new MineruParser or RAG-Anything's process_document_complete:
|
||||
2. Process multimodal documents:
|
||||
```python
|
||||
from lightrag.mineru_parser import MineruParser
|
||||
content_list, md_content = MineruParser.parse_pdf('path/to/document.pdf', 'output_dir')
|
||||
# or for any file type:
|
||||
content_list, md_content = MineruParser.parse_document('path/to/file', 'auto', 'output_dir')
|
||||
import asyncio
|
||||
from raganything import RAGAnything
|
||||
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
|
||||
|
||||
async def main():
|
||||
# Initialize RAGAnything with LightRAG integration
|
||||
rag = RAGAnything(
|
||||
working_dir="./rag_storage",
|
||||
llm_model_func=lambda prompt, **kwargs: openai_complete_if_cache(
|
||||
"gpt-4o-mini", prompt, api_key="your-api-key", **kwargs
|
||||
),
|
||||
embedding_func=lambda texts: openai_embed(
|
||||
texts, model="text-embedding-3-large", api_key="your-api-key"
|
||||
),
|
||||
embedding_dim=3072,
|
||||
)
|
||||
|
||||
# Process multimodal documents
|
||||
await rag.process_document_complete(
|
||||
file_path="path/to/your/document.pdf",
|
||||
output_dir="./output"
|
||||
)
|
||||
|
||||
# Query multimodal content
|
||||
result = await rag.query_with_multimodal(
|
||||
"What are the main findings shown in the figures and tables?",
|
||||
mode="hybrid"
|
||||
)
|
||||
print(result)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
4. Query multimodal content with LightRAG refer to [docs/mineru_integration_en.md](docs/mineru_integration_en.md).
|
||||
|
||||
For detailed documentation and advanced usage, please refer to the [RAG-Anything repository](https://github.com/HKUDS/RAG-Anything).
|
||||
|
||||
## Token Usage Tracking
|
||||
|
||||
|
|
@ -1225,6 +1337,33 @@ Valid modes are:
|
|||
|
||||
</details>
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Initialization Errors
|
||||
|
||||
If you encounter these errors when using LightRAG:
|
||||
|
||||
1. **`AttributeError: __aenter__`**
|
||||
- **Cause**: Storage backends not initialized
|
||||
- **Solution**: Call `await rag.initialize_storages()` after creating the LightRAG instance
|
||||
|
||||
2. **`KeyError: 'history_messages'`**
|
||||
- **Cause**: Pipeline status not initialized
|
||||
- **Solution**: Call `await initialize_pipeline_status()` after initializing storages
|
||||
|
||||
3. **Both errors in sequence**
|
||||
- **Cause**: Neither initialization method was called
|
||||
- **Solution**: Always follow this pattern:
|
||||
```python
|
||||
rag = LightRAG(...)
|
||||
await rag.initialize_storages()
|
||||
await initialize_pipeline_status()
|
||||
```
|
||||
|
||||
### Model Switching Issues
|
||||
|
||||
When switching between different embedding models, you must clear the data directory to avoid errors. The only file you may want to preserve is `kv_store_llm_response_cache.json` if you wish to retain the LLM cache.
|
||||
|
||||
## LightRAG API
|
||||
|
||||
The LightRAG Server is designed to provide Web UI and API support. **For more information about LightRAG Server, please refer to [LightRAG Server](./lightrag/api/README.md).**
|
||||
|
|
@ -1490,7 +1629,47 @@ def extract_queries(file_path):
|
|||
|
||||
</details>
|
||||
|
||||
## Star History
|
||||
## 🔗 Related Projects
|
||||
|
||||
*Ecosystem & Extensions*
|
||||
|
||||
<div align="center">
|
||||
<table>
|
||||
<tr>
|
||||
<td align="center">
|
||||
<a href="https://github.com/HKUDS/RAG-Anything">
|
||||
<div style="width: 100px; height: 100px; background: linear-gradient(135deg, rgba(0, 217, 255, 0.1) 0%, rgba(0, 217, 255, 0.05) 100%); border-radius: 15px; border: 1px solid rgba(0, 217, 255, 0.2); display: flex; align-items: center; justify-content: center; margin-bottom: 10px;">
|
||||
<span style="font-size: 32px;">📸</span>
|
||||
</div>
|
||||
<b>RAG-Anything</b><br>
|
||||
<sub>Multimodal RAG</sub>
|
||||
</a>
|
||||
</td>
|
||||
<td align="center">
|
||||
<a href="https://github.com/HKUDS/VideoRAG">
|
||||
<div style="width: 100px; height: 100px; background: linear-gradient(135deg, rgba(0, 217, 255, 0.1) 0%, rgba(0, 217, 255, 0.05) 100%); border-radius: 15px; border: 1px solid rgba(0, 217, 255, 0.2); display: flex; align-items: center; justify-content: center; margin-bottom: 10px;">
|
||||
<span style="font-size: 32px;">🎥</span>
|
||||
</div>
|
||||
<b>VideoRAG</b><br>
|
||||
<sub>Extreme Long-Context Video RAG</sub>
|
||||
</a>
|
||||
</td>
|
||||
<td align="center">
|
||||
<a href="https://github.com/HKUDS/MiniRAG">
|
||||
<div style="width: 100px; height: 100px; background: linear-gradient(135deg, rgba(0, 217, 255, 0.1) 0%, rgba(0, 217, 255, 0.05) 100%); border-radius: 15px; border: 1px solid rgba(0, 217, 255, 0.2); display: flex; align-items: center; justify-content: center; margin-bottom: 10px;">
|
||||
<span style="font-size: 32px;">✨</span>
|
||||
</div>
|
||||
<b>MiniRAG</b><br>
|
||||
<sub>Extremely Simple RAG</sub>
|
||||
</a>
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
---
|
||||
|
||||
## ⭐ Star History
|
||||
|
||||
<a href="https://star-history.com/#HKUDS/LightRAG&Date">
|
||||
<picture>
|
||||
|
|
@ -1500,42 +1679,22 @@ def extract_queries(file_path):
|
|||
</picture>
|
||||
</a>
|
||||
|
||||
## Contribution
|
||||
## 🤝 Contribution
|
||||
|
||||
Thank you to all our contributors!
|
||||
<div align="center">
|
||||
We thank all our contributors for their valuable contributions.
|
||||
</div>
|
||||
|
||||
<a href="https://github.com/HKUDS/LightRAG/graphs/contributors">
|
||||
<img src="https://contrib.rocks/image?repo=HKUDS/LightRAG" />
|
||||
</a>
|
||||
<div align="center">
|
||||
<a href="https://github.com/HKUDS/LightRAG/graphs/contributors">
|
||||
<img src="https://contrib.rocks/image?repo=HKUDS/LightRAG" style="border-radius: 15px; box-shadow: 0 0 20px rgba(0, 217, 255, 0.3);" />
|
||||
</a>
|
||||
</div>
|
||||
|
||||
## Troubleshooting
|
||||
---
|
||||
|
||||
### Common Initialization Errors
|
||||
|
||||
If you encounter these errors when using LightRAG:
|
||||
|
||||
1. **`AttributeError: __aenter__`**
|
||||
- **Cause**: Storage backends not initialized
|
||||
- **Solution**: Call `await rag.initialize_storages()` after creating the LightRAG instance
|
||||
|
||||
2. **`KeyError: 'history_messages'`**
|
||||
- **Cause**: Pipeline status not initialized
|
||||
- **Solution**: Call `await initialize_pipeline_status()` after initializing storages
|
||||
|
||||
3. **Both errors in sequence**
|
||||
- **Cause**: Neither initialization method was called
|
||||
- **Solution**: Always follow this pattern:
|
||||
```python
|
||||
rag = LightRAG(...)
|
||||
await rag.initialize_storages()
|
||||
await initialize_pipeline_status()
|
||||
```
|
||||
|
||||
### Model Switching Issues
|
||||
|
||||
When switching between different embedding models, you must clear the data directory to avoid errors. The only file you may want to preserve is `kv_store_llm_response_cache.json` if you wish to retain the LLM cache.
|
||||
|
||||
## 🌟Citation
|
||||
## 📖 Citation
|
||||
|
||||
```python
|
||||
@article{guo2024lightrag,
|
||||
|
|
@ -1548,4 +1707,31 @@ primaryClass={cs.IR}
|
|||
}
|
||||
```
|
||||
|
||||
**Thank you for your interest in our work!**
|
||||
---
|
||||
|
||||
<div align="center" style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); border-radius: 15px; padding: 30px; margin: 30px 0;">
|
||||
<div>
|
||||
<img src="https://user-images.githubusercontent.com/74038190/212284100-561aa473-3905-4a80-b561-0d28506553ee.gif" width="500">
|
||||
</div>
|
||||
<div style="margin-top: 20px;">
|
||||
<a href="https://github.com/HKUDS/LightRAG" style="text-decoration: none;">
|
||||
<img src="https://img.shields.io/badge/⭐%20Star%20us%20on%20GitHub-1a1a2e?style=for-the-badge&logo=github&logoColor=white">
|
||||
</a>
|
||||
<a href="https://github.com/HKUDS/LightRAG/issues" style="text-decoration: none;">
|
||||
<img src="https://img.shields.io/badge/🐛%20Report%20Issues-ff6b6b?style=for-the-badge&logo=github&logoColor=white">
|
||||
</a>
|
||||
<a href="https://github.com/HKUDS/LightRAG/discussions" style="text-decoration: none;">
|
||||
<img src="https://img.shields.io/badge/💬%20Discussions-4ecdc4?style=for-the-badge&logo=github&logoColor=white">
|
||||
</a>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div align="center">
|
||||
<div style="width: 100%; max-width: 600px; margin: 20px auto; padding: 20px; background: linear-gradient(135deg, rgba(0, 217, 255, 0.1) 0%, rgba(0, 217, 255, 0.05) 100%); border-radius: 15px; border: 1px solid rgba(0, 217, 255, 0.2);">
|
||||
<div style="display: flex; justify-content: center; align-items: center; gap: 15px;">
|
||||
<span style="font-size: 24px;">⭐</span>
|
||||
<span style="color: #00d9ff; font-size: 18px;">Thank you for visiting LightRAG!</span>
|
||||
<span style="font-size: 24px;">⭐</span>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
|
|
|||
|
|
@ -1,360 +0,0 @@
|
|||
# MinerU Integration Guide
|
||||
|
||||
### About MinerU
|
||||
|
||||
MinerU is a powerful open-source tool for extracting high-quality structured data from PDF, image, and office documents. It provides the following features:
|
||||
|
||||
- Text extraction while preserving document structure (headings, paragraphs, lists, etc.)
|
||||
- Handling complex layouts including multi-column formats
|
||||
- Automatic formula recognition and conversion to LaTeX format
|
||||
- Image, table, and footnote extraction
|
||||
- Automatic scanned document detection and OCR application
|
||||
- Support for multiple output formats (Markdown, JSON)
|
||||
|
||||
### Installation
|
||||
|
||||
#### Installing MinerU Dependencies
|
||||
|
||||
If you have already installed LightRAG but don't have MinerU support, you can add MinerU support by installing the magic-pdf package directly:
|
||||
|
||||
```bash
|
||||
pip install "magic-pdf[full]>=1.2.2" huggingface_hub
|
||||
```
|
||||
|
||||
These are the MinerU-related dependencies required by LightRAG.
|
||||
|
||||
#### MinerU Model Weights
|
||||
|
||||
MinerU requires model weight files to function properly. After installation, you need to download the required model weights. You can use either Hugging Face or ModelScope to download the models.
|
||||
|
||||
##### Option 1: Download from Hugging Face
|
||||
|
||||
```bash
|
||||
pip install huggingface_hub
|
||||
wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
|
||||
python download_models_hf.py
|
||||
```
|
||||
|
||||
##### Option 2: Download from ModelScope (Recommended for users in China)
|
||||
|
||||
```bash
|
||||
pip install modelscope
|
||||
wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models.py -O download_models.py
|
||||
python download_models.py
|
||||
```
|
||||
|
||||
Both methods will automatically download the model files and configure the model directory in the configuration file. The configuration file is located in your user directory and named `magic-pdf.json`.
|
||||
|
||||
> **Note for Windows users**: User directory is at `C:\Users\username`
|
||||
> **Note for Linux users**: User directory is at `/home/username`
|
||||
> **Note for macOS users**: User directory is at `/Users/username`
|
||||
|
||||
#### Optional: LibreOffice Installation
|
||||
|
||||
To process Office documents (DOC, DOCX, PPT, PPTX), you need to install LibreOffice:
|
||||
|
||||
**Linux/macOS:**
|
||||
```bash
|
||||
apt-get/yum/brew install libreoffice
|
||||
```
|
||||
|
||||
**Windows:**
|
||||
1. Install LibreOffice
|
||||
2. Add the installation directory to your PATH: `install_dir\LibreOffice\program`
|
||||
|
||||
### Using MinerU Parser
|
||||
|
||||
#### Basic Usage
|
||||
|
||||
```python
|
||||
from lightrag.mineru_parser import MineruParser
|
||||
|
||||
# Parse a PDF document
|
||||
content_list, md_content = MineruParser.parse_pdf('path/to/document.pdf', 'output_dir')
|
||||
|
||||
# Parse an image
|
||||
content_list, md_content = MineruParser.parse_image('path/to/image.jpg', 'output_dir')
|
||||
|
||||
# Parse an Office document
|
||||
content_list, md_content = MineruParser.parse_office_doc('path/to/document.docx', 'output_dir')
|
||||
|
||||
# Auto-detect and parse any supported document type
|
||||
content_list, md_content = MineruParser.parse_document('path/to/file', 'auto', 'output_dir')
|
||||
```
|
||||
|
||||
#### RAGAnything Integration
|
||||
|
||||
In RAGAnything, you can directly use file paths as input to the `process_document_complete` method to process documents. Here's a complete configuration example:
|
||||
|
||||
```python
|
||||
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
|
||||
from lightrag.raganything import RAGAnything
|
||||
|
||||
|
||||
# Initialize RAGAnything
|
||||
rag = RAGAnything(
|
||||
working_dir="./rag_storage", # Working directory
|
||||
llm_model_func=lambda prompt, system_prompt=None, history_messages=[], **kwargs: openai_complete_if_cache(
|
||||
"gpt-4o-mini", # Model to use
|
||||
prompt,
|
||||
system_prompt=system_prompt,
|
||||
history_messages=history_messages,
|
||||
api_key="your-api-key", # Replace with your API key
|
||||
base_url="your-base-url", # Replace with your API base URL
|
||||
**kwargs,
|
||||
),
|
||||
vision_model_func=lambda prompt, system_prompt=None, history_messages=[], image_data=None, **kwargs: openai_complete_if_cache(
|
||||
"gpt-4o", # Vision model
|
||||
"",
|
||||
system_prompt=None,
|
||||
history_messages=[],
|
||||
messages=[
|
||||
{"role": "system", "content": system_prompt} if system_prompt else None,
|
||||
{"role": "user", "content": [
|
||||
{"type": "text", "text": prompt},
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": f"data:image/jpeg;base64,{image_data}"
|
||||
}
|
||||
}
|
||||
]} if image_data else {"role": "user", "content": prompt}
|
||||
],
|
||||
api_key="your-api-key", # Replace with your API key
|
||||
base_url="your-base-url", # Replace with your API base URL
|
||||
**kwargs,
|
||||
) if image_data else openai_complete_if_cache(
|
||||
"gpt-4o-mini",
|
||||
prompt,
|
||||
system_prompt=system_prompt,
|
||||
history_messages=history_messages,
|
||||
api_key="your-api-key", # Replace with your API key
|
||||
base_url="your-base-url", # Replace with your API base URL
|
||||
**kwargs,
|
||||
),
|
||||
embedding_func=lambda texts: openai_embed(
|
||||
texts,
|
||||
model="text-embedding-3-large",
|
||||
api_key="your-api-key", # Replace with your API key
|
||||
base_url="your-base-url", # Replace with your API base URL
|
||||
),
|
||||
embedding_dim=3072,
|
||||
max_token_size=8192
|
||||
)
|
||||
|
||||
# Process a single file
|
||||
await rag.process_document_complete(
|
||||
file_path="path/to/document.pdf",
|
||||
output_dir="./output",
|
||||
parse_method="auto"
|
||||
)
|
||||
|
||||
# Query the processed document
|
||||
result = await rag.query_with_multimodal(
|
||||
"What is the main content of the document?",
|
||||
mode="hybrid"
|
||||
)
|
||||
|
||||
```
|
||||
|
||||
MinerU categorizes document content into text, formulas, images, and tables, processing each with its corresponding ingestion type:
|
||||
- Text content: `ingestion_type='text'`
|
||||
- Image content: `ingestion_type='image'`
|
||||
- Table content: `ingestion_type='table'`
|
||||
- Formula content: `ingestion_type='equation'`
|
||||
|
||||
#### Query Examples
|
||||
|
||||
Here are some common query examples:
|
||||
|
||||
```python
|
||||
# Query text content
|
||||
result = await rag.query_with_multimodal(
|
||||
"What is the main topic of the document?",
|
||||
mode="hybrid"
|
||||
)
|
||||
|
||||
# Query image-related content
|
||||
result = await rag.query_with_multimodal(
|
||||
"Describe the images and figures in the document",
|
||||
mode="hybrid"
|
||||
)
|
||||
|
||||
# Query table-related content
|
||||
result = await rag.query_with_multimodal(
|
||||
"Tell me about the experimental results and data tables",
|
||||
mode="hybrid"
|
||||
)
|
||||
```
|
||||
|
||||
#### Command Line Tool
|
||||
|
||||
We also provide a command-line tool for document parsing:
|
||||
|
||||
```bash
|
||||
python examples/mineru_example.py path/to/document.pdf
|
||||
```
|
||||
|
||||
Optional parameters:
|
||||
- `--output` or `-o`: Specify output directory
|
||||
- `--method` or `-m`: Choose parsing method (auto, ocr, txt)
|
||||
- `--stats`: Display content statistics
|
||||
|
||||
### Output Format
|
||||
|
||||
MinerU generates three files for each parsed document:
|
||||
|
||||
1. `{filename}.md` - Markdown representation of the document
|
||||
2. `{filename}_content_list.json` - Structured JSON content
|
||||
3. `{filename}_model.json` - Detailed model parsing results
|
||||
|
||||
The `content_list.json` file contains all structured content extracted from the document, including:
|
||||
- Text blocks (body text, headings, etc.)
|
||||
- Images (paths and optional captions)
|
||||
- Tables (table content and optional captions)
|
||||
- Lists
|
||||
- Formulas
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
If you encounter issues with MinerU:
|
||||
|
||||
1. Check that model weights are correctly downloaded
|
||||
2. Ensure you have sufficient RAM (16GB+ recommended)
|
||||
3. For CUDA acceleration issues, see [MinerU documentation](https://mineru.readthedocs.io/en/latest/additional_notes/faq.html)
|
||||
4. If parsing Office documents fails, verify LibreOffice is properly installed
|
||||
5. If you encounter `pickle.UnpicklingError: invalid load key, 'v'.`, it might be due to an incomplete model download. Try re-downloading the models.
|
||||
6. For users with newer graphics cards (H100, etc.) and garbled OCR text, try upgrading the CUDA version used by Paddle:
|
||||
```bash
|
||||
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
|
||||
```
|
||||
7. If you encounter a "filename too long" error, the latest version of MineruParser includes logic to automatically handle this issue.
|
||||
|
||||
#### Updating Existing Models
|
||||
|
||||
If you have previously downloaded models and need to update them, you can simply run the download script again. The script will update the model directory to the latest version.
|
||||
|
||||
### Advanced Configuration
|
||||
|
||||
The MinerU configuration file `magic-pdf.json` supports various customization options, including:
|
||||
|
||||
- Model directory path
|
||||
- OCR engine selection
|
||||
- GPU acceleration settings
|
||||
- Cache settings
|
||||
|
||||
For complete configuration options, refer to the [MinerU official documentation](https://mineru.readthedocs.io/).
|
||||
|
||||
### Using Modal Processors Directly
|
||||
|
||||
You can also use LightRAG's modal processors directly without going through MinerU. This is useful when you want to process specific types of content or have more control over the processing pipeline.
|
||||
|
||||
Each modal processor returns a tuple containing:
|
||||
1. A description of the processed content
|
||||
2. Entity information that can be used for further processing or storage
|
||||
|
||||
The processors support different types of content:
|
||||
- `ImageModalProcessor`: Processes images with captions and footnotes
|
||||
- `TableModalProcessor`: Processes tables with captions and footnotes
|
||||
- `EquationModalProcessor`: Processes mathematical equations in LaTeX format
|
||||
- `GenericModalProcessor`: A base processor that can be extended for custom content types
|
||||
|
||||
> **Note**: A complete working example can be found in `examples/modalprocessors_example.py`. You can run it using:
|
||||
> ```bash
|
||||
> python examples/modalprocessors_example.py --api-key YOUR_API_KEY
|
||||
> ```
|
||||
|
||||
<details>
|
||||
<summary> Here's an example of how to use different modal processors: </summary>
|
||||
|
||||
```python
|
||||
from lightrag.modalprocessors import (
|
||||
ImageModalProcessor,
|
||||
TableModalProcessor,
|
||||
EquationModalProcessor,
|
||||
GenericModalProcessor
|
||||
)
|
||||
|
||||
# Initialize LightRAG
|
||||
lightrag = LightRAG(
|
||||
working_dir="./rag_storage",
|
||||
embedding_func=lambda texts: openai_embed(
|
||||
texts,
|
||||
model="text-embedding-3-large",
|
||||
api_key="your-api-key",
|
||||
base_url="your-base-url",
|
||||
),
|
||||
llm_model_func=lambda prompt, system_prompt=None, history_messages=[], **kwargs: openai_complete_if_cache(
|
||||
"gpt-4o-mini",
|
||||
prompt,
|
||||
system_prompt=system_prompt,
|
||||
history_messages=history_messages,
|
||||
api_key="your-api-key",
|
||||
base_url="your-base-url",
|
||||
**kwargs,
|
||||
),
|
||||
)
|
||||
|
||||
# Process an image
|
||||
image_processor = ImageModalProcessor(
|
||||
lightrag=lightrag,
|
||||
modal_caption_func=vision_model_func
|
||||
)
|
||||
|
||||
image_content = {
|
||||
"img_path": "image.jpg",
|
||||
"img_caption": ["Example image caption"],
|
||||
"img_footnote": ["Example image footnote"]
|
||||
}
|
||||
|
||||
description, entity_info = await image_processor.process_multimodal_content(
|
||||
modal_content=image_content,
|
||||
content_type="image",
|
||||
file_path="image_example.jpg",
|
||||
entity_name="Example Image"
|
||||
)
|
||||
|
||||
# Process a table
|
||||
table_processor = TableModalProcessor(
|
||||
lightrag=lightrag,
|
||||
modal_caption_func=llm_model_func
|
||||
)
|
||||
|
||||
table_content = {
|
||||
"table_body": """
|
||||
| Name | Age | Occupation |
|
||||
|------|-----|------------|
|
||||
| John | 25 | Engineer |
|
||||
| Mary | 30 | Designer |
|
||||
""",
|
||||
"table_caption": ["Employee Information Table"],
|
||||
"table_footnote": ["Data updated as of 2024"]
|
||||
}
|
||||
|
||||
description, entity_info = await table_processor.process_multimodal_content(
|
||||
modal_content=table_content,
|
||||
content_type="table",
|
||||
file_path="table_example.md",
|
||||
entity_name="Employee Table"
|
||||
)
|
||||
|
||||
# Process an equation
|
||||
equation_processor = EquationModalProcessor(
|
||||
lightrag=lightrag,
|
||||
modal_caption_func=llm_model_func
|
||||
)
|
||||
|
||||
equation_content = {
|
||||
"text": "E = mc^2",
|
||||
"text_format": "LaTeX"
|
||||
}
|
||||
|
||||
description, entity_info = await equation_processor.process_multimodal_content(
|
||||
modal_content=equation_content,
|
||||
content_type="equation",
|
||||
file_path="equation_example.txt",
|
||||
entity_name="Mass-Energy Equivalence"
|
||||
)
|
||||
```
|
||||
|
||||
</details>
|
||||
|
|
@ -1,358 +0,0 @@
|
|||
# MinerU 集成指南
|
||||
|
||||
### 关于 MinerU
|
||||
|
||||
MinerU 是一个强大的开源工具,用于从 PDF、图像和 Office 文档中提取高质量的结构化数据。它提供以下功能:
|
||||
|
||||
- 保留文档结构(标题、段落、列表等)的文本提取
|
||||
- 处理包括多列格式在内的复杂布局
|
||||
- 自动识别并将公式转换为 LaTeX 格式
|
||||
- 提取图像、表格和脚注
|
||||
- 自动检测扫描文档并应用 OCR
|
||||
- 支持多种输出格式(Markdown、JSON)
|
||||
|
||||
### 安装
|
||||
|
||||
#### 安装 MinerU 依赖
|
||||
|
||||
如果您已经安装了 LightRAG,但没有 MinerU 支持,您可以通过安装 magic-pdf 包来直接添加 MinerU 支持:
|
||||
|
||||
```bash
|
||||
pip install "magic-pdf[full]>=1.2.2" huggingface_hub
|
||||
```
|
||||
|
||||
这些是 LightRAG 所需的 MinerU 相关依赖项。
|
||||
|
||||
#### MinerU 模型权重
|
||||
|
||||
MinerU 需要模型权重文件才能正常运行。安装后,您需要下载所需的模型权重。您可以使用 Hugging Face 或 ModelScope 下载模型。
|
||||
|
||||
##### 选项 1:从 Hugging Face 下载
|
||||
|
||||
```bash
|
||||
pip install huggingface_hub
|
||||
wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
|
||||
python download_models_hf.py
|
||||
```
|
||||
|
||||
##### 选项 2:从 ModelScope 下载(推荐中国用户使用)
|
||||
|
||||
```bash
|
||||
pip install modelscope
|
||||
wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models.py -O download_models.py
|
||||
python download_models.py
|
||||
```
|
||||
|
||||
两种方法都会自动下载模型文件并在配置文件中配置模型目录。配置文件位于用户目录中,名为 `magic-pdf.json`。
|
||||
|
||||
> **Windows 用户注意**:用户目录位于 `C:\Users\用户名`
|
||||
> **Linux 用户注意**:用户目录位于 `/home/用户名`
|
||||
> **macOS 用户注意**:用户目录位于 `/Users/用户名`
|
||||
|
||||
#### 可选:安装 LibreOffice
|
||||
|
||||
要处理 Office 文档(DOC、DOCX、PPT、PPTX),您需要安装 LibreOffice:
|
||||
|
||||
**Linux/macOS:**
|
||||
```bash
|
||||
apt-get/yum/brew install libreoffice
|
||||
```
|
||||
|
||||
**Windows:**
|
||||
1. 安装 LibreOffice
|
||||
2. 将安装目录添加到 PATH 环境变量:`安装目录\LibreOffice\program`
|
||||
|
||||
### 使用 MinerU 解析器
|
||||
|
||||
#### 基本用法
|
||||
|
||||
```python
|
||||
from lightrag.mineru_parser import MineruParser
|
||||
|
||||
# 解析 PDF 文档
|
||||
content_list, md_content = MineruParser.parse_pdf('path/to/document.pdf', 'output_dir')
|
||||
|
||||
# 解析图像
|
||||
content_list, md_content = MineruParser.parse_image('path/to/image.jpg', 'output_dir')
|
||||
|
||||
# 解析 Office 文档
|
||||
content_list, md_content = MineruParser.parse_office_doc('path/to/document.docx', 'output_dir')
|
||||
|
||||
# 自动检测并解析任何支持的文档类型
|
||||
content_list, md_content = MineruParser.parse_document('path/to/file', 'auto', 'output_dir')
|
||||
```
|
||||
|
||||
#### RAGAnything 集成
|
||||
|
||||
在 RAGAnything 中,您可以直接使用文件路径作为 `process_document_complete` 方法的输入来处理文档。以下是一个完整的配置示例:
|
||||
|
||||
```python
|
||||
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
|
||||
from lightrag.raganything import RAGAnything
|
||||
|
||||
|
||||
# 初始化 RAGAnything
|
||||
rag = RAGAnything(
|
||||
working_dir="./rag_storage", # 工作目录
|
||||
llm_model_func=lambda prompt, system_prompt=None, history_messages=[], **kwargs: openai_complete_if_cache(
|
||||
"gpt-4o-mini", # 使用的模型
|
||||
prompt,
|
||||
system_prompt=system_prompt,
|
||||
history_messages=history_messages,
|
||||
api_key="your-api-key", # 替换为您的 API 密钥
|
||||
base_url="your-base-url", # 替换为您的 API 基础 URL
|
||||
**kwargs,
|
||||
),
|
||||
vision_model_func=lambda prompt, system_prompt=None, history_messages=[], image_data=None, **kwargs: openai_complete_if_cache(
|
||||
"gpt-4o", # 视觉模型
|
||||
"",
|
||||
system_prompt=None,
|
||||
history_messages=[],
|
||||
messages=[
|
||||
{"role": "system", "content": system_prompt} if system_prompt else None,
|
||||
{"role": "user", "content": [
|
||||
{"type": "text", "text": prompt},
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": f"data:image/jpeg;base64,{image_data}"
|
||||
}
|
||||
}
|
||||
]} if image_data else {"role": "user", "content": prompt}
|
||||
],
|
||||
api_key="your-api-key", # 替换为您的 API 密钥
|
||||
base_url="your-base-url", # 替换为您的 API 基础 URL
|
||||
**kwargs,
|
||||
) if image_data else openai_complete_if_cache(
|
||||
"gpt-4o-mini",
|
||||
prompt,
|
||||
system_prompt=system_prompt,
|
||||
history_messages=history_messages,
|
||||
api_key="your-api-key", # 替换为您的 API 密钥
|
||||
base_url="your-base-url", # 替换为您的 API 基础 URL
|
||||
**kwargs,
|
||||
),
|
||||
embedding_func=lambda texts: openai_embed(
|
||||
texts,
|
||||
model="text-embedding-3-large",
|
||||
api_key="your-api-key", # 替换为您的 API 密钥
|
||||
base_url="your-base-url", # 替换为您的 API 基础 URL
|
||||
),
|
||||
embedding_dim=3072,
|
||||
max_token_size=8192
|
||||
)
|
||||
|
||||
# 处理单个文件
|
||||
await rag.process_document_complete(
|
||||
file_path="path/to/document.pdf",
|
||||
output_dir="./output",
|
||||
parse_method="auto"
|
||||
)
|
||||
|
||||
# 查询处理后的文档
|
||||
result = await rag.query_with_multimodal(
|
||||
"What is the main content of the document?",
|
||||
mode="hybrid"
|
||||
)
|
||||
```
|
||||
|
||||
MinerU 会将文档内容分类为文本、公式、图像和表格,分别使用相应的摄入类型进行处理:
|
||||
- 文本内容:`ingestion_type='text'`
|
||||
- 图像内容:`ingestion_type='image'`
|
||||
- 表格内容:`ingestion_type='table'`
|
||||
- 公式内容:`ingestion_type='equation'`
|
||||
|
||||
#### 查询示例
|
||||
|
||||
以下是一些常见的查询示例:
|
||||
|
||||
```python
|
||||
# 查询文本内容
|
||||
result = await rag.query_with_multimodal(
|
||||
"What is the main topic of the document?",
|
||||
mode="hybrid"
|
||||
)
|
||||
|
||||
# 查询图片相关内容
|
||||
result = await rag.query_with_multimodal(
|
||||
"Describe the images and figures in the document",
|
||||
mode="hybrid"
|
||||
)
|
||||
|
||||
# 查询表格相关内容
|
||||
result = await rag.query_with_multimodal(
|
||||
"Tell me about the experimental results and data tables",
|
||||
mode="hybrid"
|
||||
)
|
||||
```
|
||||
|
||||
#### 命令行工具
|
||||
|
||||
我们还提供了一个用于文档解析的命令行工具:
|
||||
|
||||
```bash
|
||||
python examples/mineru_example.py path/to/document.pdf
|
||||
```
|
||||
|
||||
可选参数:
|
||||
- `--output` 或 `-o`:指定输出目录
|
||||
- `--method` 或 `-m`:选择解析方法(auto、ocr、txt)
|
||||
- `--stats`:显示内容统计信息
|
||||
|
||||
### 输出格式
|
||||
|
||||
MinerU 为每个解析的文档生成三个文件:
|
||||
|
||||
1. `{文件名}.md` - 文档的 Markdown 表示
|
||||
2. `{文件名}_content_list.json` - 结构化 JSON 内容
|
||||
3. `{文件名}_model.json` - 详细的模型解析结果
|
||||
|
||||
`content_list.json` 文件包含从文档中提取的所有结构化内容,包括:
|
||||
- 文本块(正文、标题等)
|
||||
- 图像(路径和可选的标题)
|
||||
- 表格(表格内容和可选的标题)
|
||||
- 列表
|
||||
- 公式
|
||||
|
||||
### 疑难解答
|
||||
|
||||
如果您在使用 MinerU 时遇到问题:
|
||||
|
||||
1. 检查模型权重是否正确下载
|
||||
2. 确保有足够的内存(建议 16GB+)
|
||||
3. 对于 CUDA 加速问题,请参阅 [MinerU 文档](https://mineru.readthedocs.io/en/latest/additional_notes/faq.html)
|
||||
4. 如果解析 Office 文档失败,请验证 LibreOffice 是否正确安装
|
||||
5. 如果遇到 `pickle.UnpicklingError: invalid load key, 'v'.`,可能是因为模型下载不完整。尝试重新下载模型。
|
||||
6. 对于使用较新显卡(H100 等)并出现 OCR 文本乱码的用户,请尝试升级 Paddle 使用的 CUDA 版本:
|
||||
```bash
|
||||
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
|
||||
```
|
||||
7. 如果遇到 "文件名太长" 错误,最新版本的 MineruParser 已经包含了自动处理此问题的逻辑。
|
||||
|
||||
#### 更新现有模型
|
||||
|
||||
如果您之前已经下载了模型并需要更新它们,只需再次运行下载脚本即可。脚本将更新模型目录到最新版本。
|
||||
|
||||
### 高级配置
|
||||
|
||||
MinerU 配置文件 `magic-pdf.json` 支持多种自定义选项,包括:
|
||||
|
||||
- 模型目录路径
|
||||
- OCR 引擎选择
|
||||
- GPU 加速设置
|
||||
- 缓存设置
|
||||
|
||||
有关完整的配置选项,请参阅 [MinerU 官方文档](https://mineru.readthedocs.io/)。
|
||||
|
||||
### 直接使用模态处理器
|
||||
|
||||
您也可以直接使用 LightRAG 的模态处理器,而不需要通过 MinerU。这在您想要处理特定类型的内容或对处理流程有更多控制时特别有用。
|
||||
|
||||
每个模态处理器都会返回一个包含以下内容的元组:
|
||||
1. 处理后内容的描述
|
||||
2. 可用于进一步处理或存储的实体信息
|
||||
|
||||
处理器支持不同类型的内容:
|
||||
- `ImageModalProcessor`:处理带有标题和脚注的图像
|
||||
- `TableModalProcessor`:处理带有标题和脚注的表格
|
||||
- `EquationModalProcessor`:处理 LaTeX 格式的数学公式
|
||||
- `GenericModalProcessor`:可用于扩展自定义内容类型的基础处理器
|
||||
|
||||
> **注意**:完整的可运行示例可以在 `examples/modalprocessors_example.py` 中找到。您可以使用以下命令运行它:
|
||||
> ```bash
|
||||
> python examples/modalprocessors_example.py --api-key YOUR_API_KEY
|
||||
> ```
|
||||
|
||||
<details>
|
||||
<summary> 使用不同模态处理器的示例 </summary>
|
||||
|
||||
```python
|
||||
from lightrag.modalprocessors import (
|
||||
ImageModalProcessor,
|
||||
TableModalProcessor,
|
||||
EquationModalProcessor,
|
||||
GenericModalProcessor
|
||||
)
|
||||
|
||||
# 初始化 LightRAG
|
||||
lightrag = LightRAG(
|
||||
working_dir="./rag_storage",
|
||||
embedding_func=lambda texts: openai_embed(
|
||||
texts,
|
||||
model="text-embedding-3-large",
|
||||
api_key="your-api-key",
|
||||
base_url="your-base-url",
|
||||
),
|
||||
llm_model_func=lambda prompt, system_prompt=None, history_messages=[], **kwargs: openai_complete_if_cache(
|
||||
"gpt-4o-mini",
|
||||
prompt,
|
||||
system_prompt=system_prompt,
|
||||
history_messages=history_messages,
|
||||
api_key="your-api-key",
|
||||
base_url="your-base-url",
|
||||
**kwargs,
|
||||
),
|
||||
)
|
||||
|
||||
# 处理图像
|
||||
image_processor = ImageModalProcessor(
|
||||
lightrag=lightrag,
|
||||
modal_caption_func=vision_model_func
|
||||
)
|
||||
|
||||
image_content = {
|
||||
"img_path": "image.jpg",
|
||||
"img_caption": ["示例图像标题"],
|
||||
"img_footnote": ["示例图像脚注"]
|
||||
}
|
||||
|
||||
description, entity_info = await image_processor.process_multimodal_content(
|
||||
modal_content=image_content,
|
||||
content_type="image",
|
||||
file_path="image_example.jpg",
|
||||
entity_name="示例图像"
|
||||
)
|
||||
|
||||
# 处理表格
|
||||
table_processor = TableModalProcessor(
|
||||
lightrag=lightrag,
|
||||
modal_caption_func=llm_model_func
|
||||
)
|
||||
|
||||
table_content = {
|
||||
"table_body": """
|
||||
| 姓名 | 年龄 | 职业 |
|
||||
|------|-----|------|
|
||||
| 张三 | 25 | 工程师 |
|
||||
| 李四 | 30 | 设计师 |
|
||||
""",
|
||||
"table_caption": ["员工信息表"],
|
||||
"table_footnote": ["数据更新至2024年"]
|
||||
}
|
||||
|
||||
description, entity_info = await table_processor.process_multimodal_content(
|
||||
modal_content=table_content,
|
||||
content_type="table",
|
||||
file_path="table_example.md",
|
||||
entity_name="员工表格"
|
||||
)
|
||||
|
||||
# 处理公式
|
||||
equation_processor = EquationModalProcessor(
|
||||
lightrag=lightrag,
|
||||
modal_caption_func=llm_model_func
|
||||
)
|
||||
|
||||
equation_content = {
|
||||
"text": "E = mc^2",
|
||||
"text_format": "LaTeX"
|
||||
}
|
||||
|
||||
description, entity_info = await equation_processor.process_multimodal_content(
|
||||
modal_content=equation_content,
|
||||
content_type="equation",
|
||||
file_path="equation_example.txt",
|
||||
entity_name="质能方程"
|
||||
)
|
||||
```
|
||||
</details>
|
||||
|
|
@ -1,85 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
"""
|
||||
Example script demonstrating the basic usage of MinerU parser
|
||||
|
||||
This example shows how to:
|
||||
1. Parse different types of documents (PDF, images, office documents)
|
||||
2. Use different parsing methods
|
||||
3. Display document statistics
|
||||
"""
|
||||
|
||||
import os
|
||||
import argparse
|
||||
from lightrag.mineru_parser import MineruParser
|
||||
|
||||
|
||||
def parse_document(
|
||||
file_path: str, output_dir: str = None, method: str = "auto", stats: bool = False
|
||||
):
|
||||
"""
|
||||
Parse a document using MinerU parser
|
||||
|
||||
Args:
|
||||
file_path: Path to the document
|
||||
output_dir: Output directory for parsed results
|
||||
method: Parsing method (auto, ocr, txt)
|
||||
stats: Whether to display content statistics
|
||||
"""
|
||||
try:
|
||||
# Parse the document
|
||||
content_list, md_content = MineruParser.parse_document(
|
||||
file_path=file_path, parse_method=method, output_dir=output_dir
|
||||
)
|
||||
|
||||
# Display statistics if requested
|
||||
if stats:
|
||||
print("\nDocument Statistics:")
|
||||
print(f"Total content blocks: {len(content_list)}")
|
||||
|
||||
# Count different types of content
|
||||
content_types = {}
|
||||
for item in content_list:
|
||||
content_type = item.get("type", "unknown")
|
||||
content_types[content_type] = content_types.get(content_type, 0) + 1
|
||||
|
||||
print("\nContent Type Distribution:")
|
||||
for content_type, count in content_types.items():
|
||||
print(f"- {content_type}: {count}")
|
||||
|
||||
return content_list, md_content
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error parsing document: {str(e)}")
|
||||
return None, None
|
||||
|
||||
|
||||
def main():
|
||||
"""Main function to run the example"""
|
||||
parser = argparse.ArgumentParser(description="MinerU Parser Example")
|
||||
parser.add_argument("file_path", help="Path to the document to parse")
|
||||
parser.add_argument("--output", "-o", help="Output directory path")
|
||||
parser.add_argument(
|
||||
"--method",
|
||||
"-m",
|
||||
choices=["auto", "ocr", "txt"],
|
||||
default="auto",
|
||||
help="Parsing method (auto, ocr, txt)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--stats", action="store_true", help="Display content statistics"
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Create output directory if specified
|
||||
if args.output:
|
||||
os.makedirs(args.output, exist_ok=True)
|
||||
|
||||
# Parse document
|
||||
content_list, md_content = parse_document(
|
||||
args.file_path, args.output, args.method, args.stats
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -9,7 +9,7 @@ import argparse
|
|||
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
|
||||
from lightrag.kg.shared_storage import initialize_pipeline_status
|
||||
from lightrag import LightRAG
|
||||
from lightrag.modalprocessors import (
|
||||
from raganything.modalprocessors import (
|
||||
ImageModalProcessor,
|
||||
TableModalProcessor,
|
||||
EquationModalProcessor,
|
||||
|
|
|
|||
|
|
@ -12,7 +12,7 @@ import os
|
|||
import argparse
|
||||
import asyncio
|
||||
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
|
||||
from lightrag.raganything import RAGAnything
|
||||
from raganything.raganything import RAGAnything
|
||||
|
||||
|
||||
async def process_with_rag(
|
||||
|
|
|
|||
|
|
@ -1 +1 @@
|
|||
__api_version__ = "0173"
|
||||
__api_version__ = "0174"
|
||||
|
|
|
|||
|
|
@ -355,7 +355,13 @@ def create_app(args):
|
|||
)
|
||||
|
||||
# Add routes
|
||||
app.include_router(create_document_routes(rag, doc_manager, api_key))
|
||||
app.include_router(
|
||||
create_document_routes(
|
||||
rag,
|
||||
doc_manager,
|
||||
api_key,
|
||||
)
|
||||
)
|
||||
app.include_router(create_query_routes(rag, api_key, args.top_k))
|
||||
app.include_router(create_graph_routes(rag, api_key))
|
||||
|
||||
|
|
|
|||
|
|
@ -12,11 +12,18 @@ import pipmaster as pm
|
|||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Optional, Any, Literal
|
||||
from fastapi import APIRouter, BackgroundTasks, Depends, File, HTTPException, UploadFile
|
||||
from fastapi import (
|
||||
APIRouter,
|
||||
BackgroundTasks,
|
||||
Depends,
|
||||
File,
|
||||
HTTPException,
|
||||
UploadFile,
|
||||
)
|
||||
from pydantic import BaseModel, Field, field_validator
|
||||
|
||||
from lightrag import LightRAG
|
||||
from lightrag.base import DocProcessingStatus, DocStatus
|
||||
from lightrag.base import DeletionResult, DocProcessingStatus, DocStatus
|
||||
from lightrag.api.utils_api import get_combined_auth_dependency
|
||||
from ..config import global_args
|
||||
|
||||
|
|
@ -252,6 +259,40 @@ Attributes:
|
|||
"""
|
||||
|
||||
|
||||
class DeleteDocRequest(BaseModel):
|
||||
doc_id: str = Field(..., description="The ID of the document to delete.")
|
||||
|
||||
@field_validator("doc_id", mode="after")
|
||||
@classmethod
|
||||
def validate_doc_id(cls, doc_id: str) -> str:
|
||||
if not doc_id or not doc_id.strip():
|
||||
raise ValueError("Document ID cannot be empty")
|
||||
return doc_id.strip()
|
||||
|
||||
|
||||
class DeleteEntityRequest(BaseModel):
|
||||
entity_name: str = Field(..., description="The name of the entity to delete.")
|
||||
|
||||
@field_validator("entity_name", mode="after")
|
||||
@classmethod
|
||||
def validate_entity_name(cls, entity_name: str) -> str:
|
||||
if not entity_name or not entity_name.strip():
|
||||
raise ValueError("Entity name cannot be empty")
|
||||
return entity_name.strip()
|
||||
|
||||
|
||||
class DeleteRelationRequest(BaseModel):
|
||||
source_entity: str = Field(..., description="The name of the source entity.")
|
||||
target_entity: str = Field(..., description="The name of the target entity.")
|
||||
|
||||
@field_validator("source_entity", "target_entity", mode="after")
|
||||
@classmethod
|
||||
def validate_entity_names(cls, entity_name: str) -> str:
|
||||
if not entity_name or not entity_name.strip():
|
||||
raise ValueError("Entity name cannot be empty")
|
||||
return entity_name.strip()
|
||||
|
||||
|
||||
class DocStatusResponse(BaseModel):
|
||||
id: str = Field(description="Document identifier")
|
||||
content_summary: str = Field(description="Summary of document content")
|
||||
|
|
@ -1318,6 +1359,119 @@ def create_document_routes(
|
|||
logger.error(traceback.format_exc())
|
||||
raise HTTPException(status_code=500, detail=str(e))
|
||||
|
||||
class DeleteDocByIdResponse(BaseModel):
|
||||
"""Response model for single document deletion operation."""
|
||||
|
||||
status: Literal["success", "fail", "not_found", "busy"] = Field(
|
||||
description="Status of the deletion operation"
|
||||
)
|
||||
message: str = Field(description="Message describing the operation result")
|
||||
doc_id: Optional[str] = Field(
|
||||
default=None, description="The ID of the document."
|
||||
)
|
||||
|
||||
@router.delete(
|
||||
"/delete_document",
|
||||
response_model=DeleteDocByIdResponse,
|
||||
dependencies=[Depends(combined_auth)],
|
||||
summary="Delete a document and all its associated data by its ID.",
|
||||
)
|
||||
|
||||
# TODO This method needs to be modified to be asynchronous (please do not use)
|
||||
async def delete_document(
|
||||
delete_request: DeleteDocRequest,
|
||||
) -> DeleteDocByIdResponse:
|
||||
"""
|
||||
This method needs to be modified to be asynchronous (please do not use)
|
||||
|
||||
Deletes a specific document and all its associated data, including its status,
|
||||
text chunks, vector embeddings, and any related graph data.
|
||||
It is disabled when llm cache for entity extraction is disabled.
|
||||
|
||||
This operation is irreversible and will interact with the pipeline status.
|
||||
|
||||
Args:
|
||||
delete_request (DeleteDocRequest): The request containing the document ID.
|
||||
|
||||
Returns:
|
||||
DeleteDocByIdResponse: The result of the deletion operation.
|
||||
- status="success": The document was successfully deleted.
|
||||
- status="not_found": The document with the specified ID was not found.
|
||||
- status="fail": The deletion operation failed.
|
||||
- status="busy": The pipeline is busy with another operation.
|
||||
|
||||
Raises:
|
||||
HTTPException:
|
||||
- 500: If an unexpected internal error occurs.
|
||||
"""
|
||||
# The rag object is initialized from the server startup args,
|
||||
# so we can access its properties here.
|
||||
if not rag.enable_llm_cache_for_entity_extract:
|
||||
raise HTTPException(
|
||||
status_code=403,
|
||||
detail="Operation not allowed when LLM cache for entity extraction is disabled.",
|
||||
)
|
||||
from lightrag.kg.shared_storage import (
|
||||
get_namespace_data,
|
||||
get_pipeline_status_lock,
|
||||
)
|
||||
|
||||
doc_id = delete_request.doc_id
|
||||
pipeline_status = await get_namespace_data("pipeline_status")
|
||||
pipeline_status_lock = get_pipeline_status_lock()
|
||||
|
||||
async with pipeline_status_lock:
|
||||
if pipeline_status.get("busy", False):
|
||||
return DeleteDocByIdResponse(
|
||||
status="busy",
|
||||
message="Cannot delete document while pipeline is busy",
|
||||
doc_id=doc_id,
|
||||
)
|
||||
pipeline_status.update(
|
||||
{
|
||||
"busy": True,
|
||||
"job_name": f"Deleting Document: {doc_id}",
|
||||
"job_start": datetime.now().isoformat(),
|
||||
"latest_message": "Starting document deletion process",
|
||||
}
|
||||
)
|
||||
# Use slice assignment to clear the list in place
|
||||
pipeline_status["history_messages"][:] = [
|
||||
f"Starting deletion for doc_id: {doc_id}"
|
||||
]
|
||||
|
||||
try:
|
||||
result = await rag.adelete_by_doc_id(doc_id)
|
||||
if "history_messages" in pipeline_status:
|
||||
pipeline_status["history_messages"].append(result.message)
|
||||
|
||||
if result.status == "not_found":
|
||||
raise HTTPException(status_code=404, detail=result.message)
|
||||
if result.status == "fail":
|
||||
raise HTTPException(status_code=500, detail=result.message)
|
||||
|
||||
return DeleteDocByIdResponse(
|
||||
doc_id=result.doc_id,
|
||||
message=result.message,
|
||||
status=result.status,
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
error_msg = f"Error deleting document {doc_id}: {str(e)}"
|
||||
logger.error(error_msg)
|
||||
logger.error(traceback.format_exc())
|
||||
if "history_messages" in pipeline_status:
|
||||
pipeline_status["history_messages"].append(error_msg)
|
||||
# Re-raise as HTTPException for consistent error handling by FastAPI
|
||||
raise HTTPException(status_code=500, detail=error_msg)
|
||||
finally:
|
||||
async with pipeline_status_lock:
|
||||
pipeline_status["busy"] = False
|
||||
completion_msg = f"Document deletion process for {doc_id} completed."
|
||||
pipeline_status["latest_message"] = completion_msg
|
||||
if "history_messages" in pipeline_status:
|
||||
pipeline_status["history_messages"].append(completion_msg)
|
||||
|
||||
@router.post(
|
||||
"/clear_cache",
|
||||
response_model=ClearCacheResponse,
|
||||
|
|
@ -1371,4 +1525,77 @@ def create_document_routes(
|
|||
logger.error(traceback.format_exc())
|
||||
raise HTTPException(status_code=500, detail=str(e))
|
||||
|
||||
@router.delete(
|
||||
"/delete_entity",
|
||||
response_model=DeletionResult,
|
||||
dependencies=[Depends(combined_auth)],
|
||||
)
|
||||
async def delete_entity(request: DeleteEntityRequest):
|
||||
"""
|
||||
Delete an entity and all its relationships from the knowledge graph.
|
||||
|
||||
Args:
|
||||
request (DeleteEntityRequest): The request body containing the entity name.
|
||||
|
||||
Returns:
|
||||
DeletionResult: An object containing the outcome of the deletion process.
|
||||
|
||||
Raises:
|
||||
HTTPException: If the entity is not found (404) or an error occurs (500).
|
||||
"""
|
||||
try:
|
||||
result = await rag.adelete_by_entity(entity_name=request.entity_name)
|
||||
if result.status == "not_found":
|
||||
raise HTTPException(status_code=404, detail=result.message)
|
||||
if result.status == "fail":
|
||||
raise HTTPException(status_code=500, detail=result.message)
|
||||
# Set doc_id to empty string since this is an entity operation, not document
|
||||
result.doc_id = ""
|
||||
return result
|
||||
except HTTPException:
|
||||
raise
|
||||
except Exception as e:
|
||||
error_msg = f"Error deleting entity '{request.entity_name}': {str(e)}"
|
||||
logger.error(error_msg)
|
||||
logger.error(traceback.format_exc())
|
||||
raise HTTPException(status_code=500, detail=error_msg)
|
||||
|
||||
@router.delete(
|
||||
"/delete_relation",
|
||||
response_model=DeletionResult,
|
||||
dependencies=[Depends(combined_auth)],
|
||||
)
|
||||
async def delete_relation(request: DeleteRelationRequest):
|
||||
"""
|
||||
Delete a relationship between two entities from the knowledge graph.
|
||||
|
||||
Args:
|
||||
request (DeleteRelationRequest): The request body containing the source and target entity names.
|
||||
|
||||
Returns:
|
||||
DeletionResult: An object containing the outcome of the deletion process.
|
||||
|
||||
Raises:
|
||||
HTTPException: If the relation is not found (404) or an error occurs (500).
|
||||
"""
|
||||
try:
|
||||
result = await rag.adelete_by_relation(
|
||||
source_entity=request.source_entity,
|
||||
target_entity=request.target_entity,
|
||||
)
|
||||
if result.status == "not_found":
|
||||
raise HTTPException(status_code=404, detail=result.message)
|
||||
if result.status == "fail":
|
||||
raise HTTPException(status_code=500, detail=result.message)
|
||||
# Set doc_id to empty string since this is a relation operation, not document
|
||||
result.doc_id = ""
|
||||
return result
|
||||
except HTTPException:
|
||||
raise
|
||||
except Exception as e:
|
||||
error_msg = f"Error deleting relation from '{request.source_entity}' to '{request.target_entity}': {str(e)}"
|
||||
logger.error(error_msg)
|
||||
logger.error(traceback.format_exc())
|
||||
raise HTTPException(status_code=500, detail=error_msg)
|
||||
|
||||
return router
|
||||
|
|
|
|||
|
|
@ -278,6 +278,21 @@ class BaseKVStorage(StorageNameSpace, ABC):
|
|||
False: if the cache drop failed, or the cache mode is not supported
|
||||
"""
|
||||
|
||||
# async def drop_cache_by_chunk_ids(self, chunk_ids: list[str] | None = None) -> bool:
|
||||
# """Delete specific cache records from storage by chunk IDs
|
||||
|
||||
# Importance notes for in-memory storage:
|
||||
# 1. Changes will be persisted to disk during the next index_done_callback
|
||||
# 2. update flags to notify other processes that data persistence is needed
|
||||
|
||||
# Args:
|
||||
# chunk_ids (list[str]): List of chunk IDs to be dropped from storage
|
||||
|
||||
# Returns:
|
||||
# True: if the cache drop successfully
|
||||
# False: if the cache drop failed, or the operation is not supported
|
||||
# """
|
||||
|
||||
|
||||
@dataclass
|
||||
class BaseGraphStorage(StorageNameSpace, ABC):
|
||||
|
|
@ -598,3 +613,13 @@ class StoragesStatus(str, Enum):
|
|||
CREATED = "created"
|
||||
INITIALIZED = "initialized"
|
||||
FINALIZED = "finalized"
|
||||
|
||||
|
||||
@dataclass
|
||||
class DeletionResult:
|
||||
"""Represents the result of a deletion operation."""
|
||||
|
||||
status: Literal["success", "not_found", "fail"]
|
||||
doc_id: str
|
||||
message: str
|
||||
status_code: int = 200
|
||||
|
|
|
|||
|
|
@ -172,6 +172,53 @@ class JsonKVStorage(BaseKVStorage):
|
|||
except Exception:
|
||||
return False
|
||||
|
||||
# async def drop_cache_by_chunk_ids(self, chunk_ids: list[str] | None = None) -> bool:
|
||||
# """Delete specific cache records from storage by chunk IDs
|
||||
|
||||
# Importance notes for in-memory storage:
|
||||
# 1. Changes will be persisted to disk during the next index_done_callback
|
||||
# 2. update flags to notify other processes that data persistence is needed
|
||||
|
||||
# Args:
|
||||
# chunk_ids (list[str]): List of chunk IDs to be dropped from storage
|
||||
|
||||
# Returns:
|
||||
# True: if the cache drop successfully
|
||||
# False: if the cache drop failed
|
||||
# """
|
||||
# if not chunk_ids:
|
||||
# return False
|
||||
|
||||
# try:
|
||||
# async with self._storage_lock:
|
||||
# # Iterate through all cache modes to find entries with matching chunk_ids
|
||||
# for mode_key, mode_data in list(self._data.items()):
|
||||
# if isinstance(mode_data, dict):
|
||||
# # Check each cached entry in this mode
|
||||
# for cache_key, cache_entry in list(mode_data.items()):
|
||||
# if (
|
||||
# isinstance(cache_entry, dict)
|
||||
# and cache_entry.get("chunk_id") in chunk_ids
|
||||
# ):
|
||||
# # Remove this cache entry
|
||||
# del mode_data[cache_key]
|
||||
# logger.debug(
|
||||
# f"Removed cache entry {cache_key} for chunk {cache_entry.get('chunk_id')}"
|
||||
# )
|
||||
|
||||
# # If the mode is now empty, remove it entirely
|
||||
# if not mode_data:
|
||||
# del self._data[mode_key]
|
||||
|
||||
# # Set update flags to notify persistence is needed
|
||||
# await set_all_update_flags(self.namespace)
|
||||
|
||||
# logger.info(f"Cleared cache for {len(chunk_ids)} chunk IDs")
|
||||
# return True
|
||||
# except Exception as e:
|
||||
# logger.error(f"Error clearing cache by chunk IDs: {e}")
|
||||
# return False
|
||||
|
||||
async def drop(self) -> dict[str, str]:
|
||||
"""Drop all data from storage and clean up resources
|
||||
This action will persistent the data to disk immediately.
|
||||
|
|
|
|||
|
|
@ -1,4 +1,3 @@
|
|||
import inspect
|
||||
import os
|
||||
import re
|
||||
from dataclasses import dataclass
|
||||
|
|
@ -307,7 +306,7 @@ class Neo4JStorage(BaseGraphStorage):
|
|||
for label in node_dict["labels"]
|
||||
if label != "base"
|
||||
]
|
||||
logger.debug(f"Neo4j query node {query} return: {node_dict}")
|
||||
# logger.debug(f"Neo4j query node {query} return: {node_dict}")
|
||||
return node_dict
|
||||
return None
|
||||
finally:
|
||||
|
|
@ -382,9 +381,9 @@ class Neo4JStorage(BaseGraphStorage):
|
|||
return 0
|
||||
|
||||
degree = record["degree"]
|
||||
logger.debug(
|
||||
f"Neo4j query node degree for {node_id} return: {degree}"
|
||||
)
|
||||
# logger.debug(
|
||||
# f"Neo4j query node degree for {node_id} return: {degree}"
|
||||
# )
|
||||
return degree
|
||||
finally:
|
||||
await result.consume() # Ensure result is fully consumed
|
||||
|
|
@ -424,7 +423,7 @@ class Neo4JStorage(BaseGraphStorage):
|
|||
logger.warning(f"No node found with label '{nid}'")
|
||||
degrees[nid] = 0
|
||||
|
||||
logger.debug(f"Neo4j batch node degree query returned: {degrees}")
|
||||
# logger.debug(f"Neo4j batch node degree query returned: {degrees}")
|
||||
return degrees
|
||||
|
||||
async def edge_degree(self, src_id: str, tgt_id: str) -> int:
|
||||
|
|
@ -512,7 +511,7 @@ class Neo4JStorage(BaseGraphStorage):
|
|||
if records:
|
||||
try:
|
||||
edge_result = dict(records[0]["edge_properties"])
|
||||
logger.debug(f"Result: {edge_result}")
|
||||
# logger.debug(f"Result: {edge_result}")
|
||||
# Ensure required keys exist with defaults
|
||||
required_keys = {
|
||||
"weight": 0.0,
|
||||
|
|
@ -528,9 +527,9 @@ class Neo4JStorage(BaseGraphStorage):
|
|||
f"missing {key}, using default: {default_value}"
|
||||
)
|
||||
|
||||
logger.debug(
|
||||
f"{inspect.currentframe().f_code.co_name}:query:{query}:result:{edge_result}"
|
||||
)
|
||||
# logger.debug(
|
||||
# f"{inspect.currentframe().f_code.co_name}:query:{query}:result:{edge_result}"
|
||||
# )
|
||||
return edge_result
|
||||
except (KeyError, TypeError, ValueError) as e:
|
||||
logger.error(
|
||||
|
|
@ -545,9 +544,9 @@ class Neo4JStorage(BaseGraphStorage):
|
|||
"keywords": None,
|
||||
}
|
||||
|
||||
logger.debug(
|
||||
f"{inspect.currentframe().f_code.co_name}: No edge found between {source_node_id} and {target_node_id}"
|
||||
)
|
||||
# logger.debug(
|
||||
# f"{inspect.currentframe().f_code.co_name}: No edge found between {source_node_id} and {target_node_id}"
|
||||
# )
|
||||
# Return None when no edge found
|
||||
return None
|
||||
finally:
|
||||
|
|
@ -766,9 +765,6 @@ class Neo4JStorage(BaseGraphStorage):
|
|||
result = await tx.run(
|
||||
query, entity_id=node_id, properties=properties
|
||||
)
|
||||
logger.debug(
|
||||
f"Upserted node with entity_id '{node_id}' and properties: {properties}"
|
||||
)
|
||||
await result.consume() # Ensure result is fully consumed
|
||||
|
||||
await session.execute_write(execute_upsert)
|
||||
|
|
@ -824,12 +820,7 @@ class Neo4JStorage(BaseGraphStorage):
|
|||
properties=edge_properties,
|
||||
)
|
||||
try:
|
||||
records = await result.fetch(2)
|
||||
if records:
|
||||
logger.debug(
|
||||
f"Upserted edge from '{source_node_id}' to '{target_node_id}'"
|
||||
f"with properties: {edge_properties}"
|
||||
)
|
||||
await result.fetch(2)
|
||||
finally:
|
||||
await result.consume() # Ensure result is consumed
|
||||
|
||||
|
|
|
|||
|
|
@ -106,6 +106,35 @@ class PostgreSQLDB:
|
|||
):
|
||||
pass
|
||||
|
||||
async def _migrate_llm_cache_add_chunk_id(self):
|
||||
"""Add chunk_id column to LIGHTRAG_LLM_CACHE table if it doesn't exist"""
|
||||
try:
|
||||
# Check if chunk_id column exists
|
||||
check_column_sql = """
|
||||
SELECT column_name
|
||||
FROM information_schema.columns
|
||||
WHERE table_name = 'lightrag_llm_cache'
|
||||
AND column_name = 'chunk_id'
|
||||
"""
|
||||
|
||||
column_info = await self.query(check_column_sql)
|
||||
if not column_info:
|
||||
logger.info("Adding chunk_id column to LIGHTRAG_LLM_CACHE table")
|
||||
add_column_sql = """
|
||||
ALTER TABLE LIGHTRAG_LLM_CACHE
|
||||
ADD COLUMN chunk_id VARCHAR(255) NULL
|
||||
"""
|
||||
await self.execute(add_column_sql)
|
||||
logger.info(
|
||||
"Successfully added chunk_id column to LIGHTRAG_LLM_CACHE table"
|
||||
)
|
||||
else:
|
||||
logger.info(
|
||||
"chunk_id column already exists in LIGHTRAG_LLM_CACHE table"
|
||||
)
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to add chunk_id column to LIGHTRAG_LLM_CACHE: {e}")
|
||||
|
||||
async def _migrate_timestamp_columns(self):
|
||||
"""Migrate timestamp columns in tables to timezone-aware types, assuming original data is in UTC time"""
|
||||
# Tables and columns that need migration
|
||||
|
|
@ -203,6 +232,13 @@ class PostgreSQLDB:
|
|||
logger.error(f"PostgreSQL, Failed to migrate timestamp columns: {e}")
|
||||
# Don't throw an exception, allow the initialization process to continue
|
||||
|
||||
# Migrate LLM cache table to add chunk_id field if needed
|
||||
try:
|
||||
await self._migrate_llm_cache_add_chunk_id()
|
||||
except Exception as e:
|
||||
logger.error(f"PostgreSQL, Failed to migrate LLM cache chunk_id field: {e}")
|
||||
# Don't throw an exception, allow the initialization process to continue
|
||||
|
||||
async def query(
|
||||
self,
|
||||
sql: str,
|
||||
|
|
@ -253,25 +289,31 @@ class PostgreSQLDB:
|
|||
sql: str,
|
||||
data: dict[str, Any] | None = None,
|
||||
upsert: bool = False,
|
||||
ignore_if_exists: bool = False,
|
||||
with_age: bool = False,
|
||||
graph_name: str | None = None,
|
||||
):
|
||||
try:
|
||||
async with self.pool.acquire() as connection: # type: ignore
|
||||
if with_age and graph_name:
|
||||
await self.configure_age(connection, graph_name) # type: ignore
|
||||
await self.configure_age(connection, graph_name)
|
||||
elif with_age and not graph_name:
|
||||
raise ValueError("Graph name is required when with_age is True")
|
||||
|
||||
if data is None:
|
||||
await connection.execute(sql) # type: ignore
|
||||
await connection.execute(sql)
|
||||
else:
|
||||
await connection.execute(sql, *data.values()) # type: ignore
|
||||
await connection.execute(sql, *data.values())
|
||||
except (
|
||||
asyncpg.exceptions.UniqueViolationError,
|
||||
asyncpg.exceptions.DuplicateTableError,
|
||||
asyncpg.exceptions.DuplicateObjectError, # Catch "already exists" error
|
||||
asyncpg.exceptions.InvalidSchemaNameError, # Also catch for AGE extension "already exists"
|
||||
) as e:
|
||||
if upsert:
|
||||
if ignore_if_exists:
|
||||
# If the flag is set, just ignore these specific errors
|
||||
pass
|
||||
elif upsert:
|
||||
print("Key value duplicate, but upsert succeeded.")
|
||||
else:
|
||||
logger.error(f"Upsert error: {e}")
|
||||
|
|
@ -497,6 +539,7 @@ class PGKVStorage(BaseKVStorage):
|
|||
"original_prompt": v["original_prompt"],
|
||||
"return_value": v["return"],
|
||||
"mode": mode,
|
||||
"chunk_id": v.get("chunk_id"),
|
||||
}
|
||||
|
||||
await self.db.execute(upsert_sql, _data)
|
||||
|
|
@ -1175,16 +1218,15 @@ class PGGraphStorage(BaseGraphStorage):
|
|||
]
|
||||
|
||||
for query in queries:
|
||||
try:
|
||||
await self.db.execute(
|
||||
query,
|
||||
upsert=True,
|
||||
with_age=True,
|
||||
graph_name=self.graph_name,
|
||||
)
|
||||
# logger.info(f"Successfully executed: {query}")
|
||||
except Exception:
|
||||
continue
|
||||
# Use the new flag to silently ignore "already exists" errors
|
||||
# at the source, preventing log spam.
|
||||
await self.db.execute(
|
||||
query,
|
||||
upsert=True,
|
||||
ignore_if_exists=True, # Pass the new flag
|
||||
with_age=True,
|
||||
graph_name=self.graph_name,
|
||||
)
|
||||
|
||||
async def finalize(self):
|
||||
if self.db is not None:
|
||||
|
|
@ -2357,6 +2399,7 @@ TABLES = {
|
|||
mode varchar(32) NOT NULL,
|
||||
original_prompt TEXT,
|
||||
return_value TEXT,
|
||||
chunk_id VARCHAR(255) NULL,
|
||||
create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
update_time TIMESTAMP,
|
||||
CONSTRAINT LIGHTRAG_LLM_CACHE_PK PRIMARY KEY (workspace, mode, id)
|
||||
|
|
@ -2389,10 +2432,10 @@ SQL_TEMPLATES = {
|
|||
chunk_order_index, full_doc_id, file_path
|
||||
FROM LIGHTRAG_DOC_CHUNKS WHERE workspace=$1 AND id=$2
|
||||
""",
|
||||
"get_by_id_llm_response_cache": """SELECT id, original_prompt, COALESCE(return_value, '') as "return", mode
|
||||
"get_by_id_llm_response_cache": """SELECT id, original_prompt, COALESCE(return_value, '') as "return", mode, chunk_id
|
||||
FROM LIGHTRAG_LLM_CACHE WHERE workspace=$1 AND mode=$2
|
||||
""",
|
||||
"get_by_mode_id_llm_response_cache": """SELECT id, original_prompt, COALESCE(return_value, '') as "return", mode
|
||||
"get_by_mode_id_llm_response_cache": """SELECT id, original_prompt, COALESCE(return_value, '') as "return", mode, chunk_id
|
||||
FROM LIGHTRAG_LLM_CACHE WHERE workspace=$1 AND mode=$2 AND id=$3
|
||||
""",
|
||||
"get_by_ids_full_docs": """SELECT id, COALESCE(content, '') as content
|
||||
|
|
@ -2402,7 +2445,7 @@ SQL_TEMPLATES = {
|
|||
chunk_order_index, full_doc_id, file_path
|
||||
FROM LIGHTRAG_DOC_CHUNKS WHERE workspace=$1 AND id IN ({ids})
|
||||
""",
|
||||
"get_by_ids_llm_response_cache": """SELECT id, original_prompt, COALESCE(return_value, '') as "return", mode
|
||||
"get_by_ids_llm_response_cache": """SELECT id, original_prompt, COALESCE(return_value, '') as "return", mode, chunk_id
|
||||
FROM LIGHTRAG_LLM_CACHE WHERE workspace=$1 AND mode= IN ({ids})
|
||||
""",
|
||||
"filter_keys": "SELECT id FROM {table_name} WHERE workspace=$1 AND id IN ({ids})",
|
||||
|
|
@ -2411,12 +2454,13 @@ SQL_TEMPLATES = {
|
|||
ON CONFLICT (workspace,id) DO UPDATE
|
||||
SET content = $2, update_time = CURRENT_TIMESTAMP
|
||||
""",
|
||||
"upsert_llm_response_cache": """INSERT INTO LIGHTRAG_LLM_CACHE(workspace,id,original_prompt,return_value,mode)
|
||||
VALUES ($1, $2, $3, $4, $5)
|
||||
"upsert_llm_response_cache": """INSERT INTO LIGHTRAG_LLM_CACHE(workspace,id,original_prompt,return_value,mode,chunk_id)
|
||||
VALUES ($1, $2, $3, $4, $5, $6)
|
||||
ON CONFLICT (workspace,mode,id) DO UPDATE
|
||||
SET original_prompt = EXCLUDED.original_prompt,
|
||||
return_value=EXCLUDED.return_value,
|
||||
mode=EXCLUDED.mode,
|
||||
chunk_id=EXCLUDED.chunk_id,
|
||||
update_time = CURRENT_TIMESTAMP
|
||||
""",
|
||||
"upsert_chunk": """INSERT INTO LIGHTRAG_DOC_CHUNKS (workspace, id, tokens,
|
||||
|
|
|
|||
|
|
@ -35,6 +35,7 @@ from lightrag.kg import (
|
|||
from lightrag.kg.shared_storage import (
|
||||
get_namespace_data,
|
||||
get_pipeline_status_lock,
|
||||
get_graph_db_lock,
|
||||
)
|
||||
|
||||
from .base import (
|
||||
|
|
@ -47,6 +48,7 @@ from .base import (
|
|||
QueryParam,
|
||||
StorageNameSpace,
|
||||
StoragesStatus,
|
||||
DeletionResult,
|
||||
)
|
||||
from .namespace import NameSpace, make_namespace
|
||||
from .operate import (
|
||||
|
|
@ -56,6 +58,7 @@ from .operate import (
|
|||
kg_query,
|
||||
naive_query,
|
||||
query_with_keywords,
|
||||
_rebuild_knowledge_from_chunks,
|
||||
)
|
||||
from .prompt import GRAPH_FIELD_SEP
|
||||
from .utils import (
|
||||
|
|
@ -1207,6 +1210,7 @@ class LightRAG:
|
|||
cast(StorageNameSpace, storage_inst).index_done_callback()
|
||||
for storage_inst in [ # type: ignore
|
||||
self.full_docs,
|
||||
self.doc_status,
|
||||
self.text_chunks,
|
||||
self.llm_response_cache,
|
||||
self.entities_vdb,
|
||||
|
|
@ -1674,24 +1678,45 @@ class LightRAG:
|
|||
# Return the dictionary containing statuses only for the found document IDs
|
||||
return found_statuses
|
||||
|
||||
# TODO: Deprecated (Deleting documents can cause hallucinations in RAG.)
|
||||
# Document delete is not working properly for most of the storage implementations.
|
||||
async def adelete_by_doc_id(self, doc_id: str) -> None:
|
||||
"""Delete a document and all its related data
|
||||
async def adelete_by_doc_id(self, doc_id: str) -> DeletionResult:
|
||||
"""Delete a document and all its related data, including chunks, graph elements, and cached entries.
|
||||
|
||||
This method orchestrates a comprehensive deletion process for a given document ID.
|
||||
It ensures that not only the document itself but also all its derived and associated
|
||||
data across different storage layers are removed. This includes:
|
||||
1. **Document and Status**: Deletes the document from `full_docs` and its status from `doc_status`.
|
||||
2. **Chunks**: Removes all associated text chunks from `chunks_vdb`.
|
||||
3. **Graph Data**:
|
||||
- Deletes related entities from `entities_vdb`.
|
||||
- Deletes related relationships from `relationships_vdb`.
|
||||
- Removes corresponding nodes and edges from the `chunk_entity_relation_graph`.
|
||||
4. **Graph Reconstruction**: If entities or relationships are partially affected, it triggers
|
||||
a reconstruction of their data from the remaining chunks to ensure consistency.
|
||||
|
||||
Args:
|
||||
doc_id: Document ID to delete
|
||||
doc_id (str): The unique identifier of the document to be deleted.
|
||||
|
||||
Returns:
|
||||
DeletionResult: An object containing the outcome of the deletion process.
|
||||
- `status` (str): "success", "not_found", or "failure".
|
||||
- `doc_id` (str): The ID of the document attempted to be deleted.
|
||||
- `message` (str): A summary of the operation's result.
|
||||
- `status_code` (int): HTTP status code (e.g., 200, 404, 500).
|
||||
"""
|
||||
try:
|
||||
# 1. Get the document status and related data
|
||||
if not await self.doc_status.get_by_id(doc_id):
|
||||
logger.warning(f"Document {doc_id} not found")
|
||||
return
|
||||
return DeletionResult(
|
||||
status="not_found",
|
||||
doc_id=doc_id,
|
||||
message=f"Document {doc_id} not found.",
|
||||
status_code=404,
|
||||
)
|
||||
|
||||
logger.debug(f"Starting deletion for document {doc_id}")
|
||||
logger.info(f"Starting optimized deletion for document {doc_id}")
|
||||
|
||||
# 2. Get all chunks related to this document
|
||||
# Find all chunks where full_doc_id equals the current doc_id
|
||||
all_chunks = await self.text_chunks.get_all()
|
||||
related_chunks = {
|
||||
chunk_id: chunk_data
|
||||
|
|
@ -1702,241 +1727,197 @@ class LightRAG:
|
|||
|
||||
if not related_chunks:
|
||||
logger.warning(f"No chunks found for document {doc_id}")
|
||||
return
|
||||
# Still need to delete the doc status and full doc
|
||||
await self.full_docs.delete([doc_id])
|
||||
await self.doc_status.delete([doc_id])
|
||||
return DeletionResult(
|
||||
status="success",
|
||||
doc_id=doc_id,
|
||||
message=f"Document {doc_id} found but had no associated chunks. Document entry deleted.",
|
||||
status_code=200,
|
||||
)
|
||||
|
||||
# Get all related chunk IDs
|
||||
chunk_ids = set(related_chunks.keys())
|
||||
logger.debug(f"Found {len(chunk_ids)} chunks to delete")
|
||||
logger.info(f"Found {len(chunk_ids)} chunks to delete")
|
||||
|
||||
# TODO: self.entities_vdb.client_storage only works for local storage, need to fix this
|
||||
# # 3. **OPTIMIZATION 1**: Clear LLM cache for related chunks
|
||||
# logger.info("Clearing LLM cache for related chunks...")
|
||||
# cache_cleared = await self.llm_response_cache.drop_cache_by_chunk_ids(
|
||||
# list(chunk_ids)
|
||||
# )
|
||||
# if cache_cleared:
|
||||
# logger.info(f"Successfully cleared cache for {len(chunk_ids)} chunks")
|
||||
# else:
|
||||
# logger.warning(
|
||||
# "Failed to clear chunk cache or cache clearing not supported"
|
||||
# )
|
||||
|
||||
# 3. Before deleting, check the related entities and relationships for these chunks
|
||||
for chunk_id in chunk_ids:
|
||||
# Check entities
|
||||
entities_storage = await self.entities_vdb.client_storage
|
||||
entities = [
|
||||
dp
|
||||
for dp in entities_storage["data"]
|
||||
if chunk_id in dp.get("source_id")
|
||||
]
|
||||
logger.debug(f"Chunk {chunk_id} has {len(entities)} related entities")
|
||||
|
||||
# Check relationships
|
||||
relationships_storage = await self.relationships_vdb.client_storage
|
||||
relations = [
|
||||
dp
|
||||
for dp in relationships_storage["data"]
|
||||
if chunk_id in dp.get("source_id")
|
||||
]
|
||||
logger.debug(f"Chunk {chunk_id} has {len(relations)} related relations")
|
||||
|
||||
# Continue with the original deletion process...
|
||||
|
||||
# 4. Delete chunks from vector database
|
||||
if chunk_ids:
|
||||
await self.chunks_vdb.delete(chunk_ids)
|
||||
await self.text_chunks.delete(chunk_ids)
|
||||
|
||||
# 5. Find and process entities and relationships that have these chunks as source
|
||||
# Get all nodes and edges from the graph storage using storage-agnostic methods
|
||||
# 4. Analyze entities and relationships that will be affected
|
||||
entities_to_delete = set()
|
||||
entities_to_update = {} # entity_name -> new_source_id
|
||||
entities_to_rebuild = {} # entity_name -> remaining_chunk_ids
|
||||
relationships_to_delete = set()
|
||||
relationships_to_update = {} # (src, tgt) -> new_source_id
|
||||
relationships_to_rebuild = {} # (src, tgt) -> remaining_chunk_ids
|
||||
|
||||
# Process entities - use storage-agnostic methods
|
||||
all_labels = await self.chunk_entity_relation_graph.get_all_labels()
|
||||
for node_label in all_labels:
|
||||
node_data = await self.chunk_entity_relation_graph.get_node(node_label)
|
||||
if node_data and "source_id" in node_data:
|
||||
# Split source_id using GRAPH_FIELD_SEP
|
||||
sources = set(node_data["source_id"].split(GRAPH_FIELD_SEP))
|
||||
sources.difference_update(chunk_ids)
|
||||
if not sources:
|
||||
entities_to_delete.add(node_label)
|
||||
logger.debug(
|
||||
f"Entity {node_label} marked for deletion - no remaining sources"
|
||||
)
|
||||
else:
|
||||
new_source_id = GRAPH_FIELD_SEP.join(sources)
|
||||
entities_to_update[node_label] = new_source_id
|
||||
logger.debug(
|
||||
f"Entity {node_label} will be updated with new source_id: {new_source_id}"
|
||||
)
|
||||
# Use graph database lock to ensure atomic merges and updates
|
||||
graph_db_lock = get_graph_db_lock(enable_logging=False)
|
||||
async with graph_db_lock:
|
||||
# Process entities
|
||||
# TODO There is performance when iterating get_all_labels for PostgresSQL
|
||||
all_labels = await self.chunk_entity_relation_graph.get_all_labels()
|
||||
for node_label in all_labels:
|
||||
node_data = await self.chunk_entity_relation_graph.get_node(
|
||||
node_label
|
||||
)
|
||||
if node_data and "source_id" in node_data:
|
||||
# Split source_id using GRAPH_FIELD_SEP
|
||||
sources = set(node_data["source_id"].split(GRAPH_FIELD_SEP))
|
||||
remaining_sources = sources - chunk_ids
|
||||
|
||||
# Process relationships
|
||||
for node_label in all_labels:
|
||||
node_edges = await self.chunk_entity_relation_graph.get_node_edges(
|
||||
node_label
|
||||
)
|
||||
if node_edges:
|
||||
for src, tgt in node_edges:
|
||||
edge_data = await self.chunk_entity_relation_graph.get_edge(
|
||||
src, tgt
|
||||
)
|
||||
if edge_data and "source_id" in edge_data:
|
||||
# Split source_id using GRAPH_FIELD_SEP
|
||||
sources = set(edge_data["source_id"].split(GRAPH_FIELD_SEP))
|
||||
sources.difference_update(chunk_ids)
|
||||
if not sources:
|
||||
relationships_to_delete.add((src, tgt))
|
||||
logger.debug(
|
||||
f"Relationship {src}-{tgt} marked for deletion - no remaining sources"
|
||||
)
|
||||
else:
|
||||
new_source_id = GRAPH_FIELD_SEP.join(sources)
|
||||
relationships_to_update[(src, tgt)] = new_source_id
|
||||
logger.debug(
|
||||
f"Relationship {src}-{tgt} will be updated with new source_id: {new_source_id}"
|
||||
if not remaining_sources:
|
||||
entities_to_delete.add(node_label)
|
||||
logger.debug(
|
||||
f"Entity {node_label} marked for deletion - no remaining sources"
|
||||
)
|
||||
elif remaining_sources != sources:
|
||||
# Entity needs to be rebuilt from remaining chunks
|
||||
entities_to_rebuild[node_label] = remaining_sources
|
||||
logger.debug(
|
||||
f"Entity {node_label} will be rebuilt from {len(remaining_sources)} remaining chunks"
|
||||
)
|
||||
|
||||
# Process relationships
|
||||
# TODO There is performance when iterating get_all_labels for PostgresSQL
|
||||
for node_label in all_labels:
|
||||
node_edges = await self.chunk_entity_relation_graph.get_node_edges(
|
||||
node_label
|
||||
)
|
||||
if node_edges:
|
||||
for src, tgt in node_edges:
|
||||
# To avoid processing the same edge twice in an undirected graph
|
||||
if (tgt, src) in relationships_to_delete or (
|
||||
tgt,
|
||||
src,
|
||||
) in relationships_to_rebuild:
|
||||
continue
|
||||
|
||||
edge_data = await self.chunk_entity_relation_graph.get_edge(
|
||||
src, tgt
|
||||
)
|
||||
if edge_data and "source_id" in edge_data:
|
||||
# Split source_id using GRAPH_FIELD_SEP
|
||||
sources = set(
|
||||
edge_data["source_id"].split(GRAPH_FIELD_SEP)
|
||||
)
|
||||
remaining_sources = sources - chunk_ids
|
||||
|
||||
# Delete entities
|
||||
if entities_to_delete:
|
||||
for entity in entities_to_delete:
|
||||
await self.entities_vdb.delete_entity(entity)
|
||||
logger.debug(f"Deleted entity {entity} from vector DB")
|
||||
await self.chunk_entity_relation_graph.remove_nodes(
|
||||
list(entities_to_delete)
|
||||
)
|
||||
logger.debug(f"Deleted {len(entities_to_delete)} entities from graph")
|
||||
if not remaining_sources:
|
||||
relationships_to_delete.add((src, tgt))
|
||||
logger.debug(
|
||||
f"Relationship {src}-{tgt} marked for deletion - no remaining sources"
|
||||
)
|
||||
elif remaining_sources != sources:
|
||||
# Relationship needs to be rebuilt from remaining chunks
|
||||
relationships_to_rebuild[(src, tgt)] = (
|
||||
remaining_sources
|
||||
)
|
||||
logger.debug(
|
||||
f"Relationship {src}-{tgt} will be rebuilt from {len(remaining_sources)} remaining chunks"
|
||||
)
|
||||
|
||||
# Update entities
|
||||
for entity, new_source_id in entities_to_update.items():
|
||||
node_data = await self.chunk_entity_relation_graph.get_node(entity)
|
||||
if node_data:
|
||||
node_data["source_id"] = new_source_id
|
||||
await self.chunk_entity_relation_graph.upsert_node(
|
||||
entity, node_data
|
||||
# 5. Delete chunks from storage
|
||||
if chunk_ids:
|
||||
await self.chunks_vdb.delete(chunk_ids)
|
||||
await self.text_chunks.delete(chunk_ids)
|
||||
logger.info(f"Deleted {len(chunk_ids)} chunks from storage")
|
||||
|
||||
# 6. Delete entities that have no remaining sources
|
||||
if entities_to_delete:
|
||||
# Delete from vector database
|
||||
entity_vdb_ids = [
|
||||
compute_mdhash_id(entity, prefix="ent-")
|
||||
for entity in entities_to_delete
|
||||
]
|
||||
await self.entities_vdb.delete(entity_vdb_ids)
|
||||
|
||||
# Delete from graph
|
||||
await self.chunk_entity_relation_graph.remove_nodes(
|
||||
list(entities_to_delete)
|
||||
)
|
||||
logger.debug(
|
||||
f"Updated entity {entity} with new source_id: {new_source_id}"
|
||||
logger.info(f"Deleted {len(entities_to_delete)} entities")
|
||||
|
||||
# 7. Delete relationships that have no remaining sources
|
||||
if relationships_to_delete:
|
||||
# Delete from vector database
|
||||
rel_ids_to_delete = []
|
||||
for src, tgt in relationships_to_delete:
|
||||
rel_ids_to_delete.extend(
|
||||
[
|
||||
compute_mdhash_id(src + tgt, prefix="rel-"),
|
||||
compute_mdhash_id(tgt + src, prefix="rel-"),
|
||||
]
|
||||
)
|
||||
await self.relationships_vdb.delete(rel_ids_to_delete)
|
||||
|
||||
# Delete from graph
|
||||
await self.chunk_entity_relation_graph.remove_edges(
|
||||
list(relationships_to_delete)
|
||||
)
|
||||
logger.info(f"Deleted {len(relationships_to_delete)} relationships")
|
||||
|
||||
# 8. **OPTIMIZATION 2**: Rebuild entities and relationships from remaining chunks
|
||||
if entities_to_rebuild or relationships_to_rebuild:
|
||||
logger.info(
|
||||
f"Rebuilding {len(entities_to_rebuild)} entities and {len(relationships_to_rebuild)} relationships..."
|
||||
)
|
||||
await _rebuild_knowledge_from_chunks(
|
||||
entities_to_rebuild=entities_to_rebuild,
|
||||
relationships_to_rebuild=relationships_to_rebuild,
|
||||
knowledge_graph_inst=self.chunk_entity_relation_graph,
|
||||
entities_vdb=self.entities_vdb,
|
||||
relationships_vdb=self.relationships_vdb,
|
||||
text_chunks=self.text_chunks,
|
||||
llm_response_cache=self.llm_response_cache,
|
||||
global_config=asdict(self),
|
||||
)
|
||||
|
||||
# Delete relationships
|
||||
if relationships_to_delete:
|
||||
for src, tgt in relationships_to_delete:
|
||||
rel_id_0 = compute_mdhash_id(src + tgt, prefix="rel-")
|
||||
rel_id_1 = compute_mdhash_id(tgt + src, prefix="rel-")
|
||||
await self.relationships_vdb.delete([rel_id_0, rel_id_1])
|
||||
logger.debug(f"Deleted relationship {src}-{tgt} from vector DB")
|
||||
await self.chunk_entity_relation_graph.remove_edges(
|
||||
list(relationships_to_delete)
|
||||
)
|
||||
logger.debug(
|
||||
f"Deleted {len(relationships_to_delete)} relationships from graph"
|
||||
)
|
||||
|
||||
# Update relationships
|
||||
for (src, tgt), new_source_id in relationships_to_update.items():
|
||||
edge_data = await self.chunk_entity_relation_graph.get_edge(src, tgt)
|
||||
if edge_data:
|
||||
edge_data["source_id"] = new_source_id
|
||||
await self.chunk_entity_relation_graph.upsert_edge(
|
||||
src, tgt, edge_data
|
||||
)
|
||||
logger.debug(
|
||||
f"Updated relationship {src}-{tgt} with new source_id: {new_source_id}"
|
||||
)
|
||||
|
||||
# 6. Delete original document and status
|
||||
# 9. Delete original document and status
|
||||
await self.full_docs.delete([doc_id])
|
||||
await self.doc_status.delete([doc_id])
|
||||
|
||||
# 7. Ensure all indexes are updated
|
||||
# 10. Ensure all indexes are updated
|
||||
await self._insert_done()
|
||||
|
||||
logger.info(
|
||||
f"Successfully deleted document {doc_id} and related data. "
|
||||
f"Deleted {len(entities_to_delete)} entities and {len(relationships_to_delete)} relationships. "
|
||||
f"Updated {len(entities_to_update)} entities and {len(relationships_to_update)} relationships."
|
||||
success_message = f"""Successfully deleted document {doc_id}.
|
||||
Deleted: {len(entities_to_delete)} entities, {len(relationships_to_delete)} relationships.
|
||||
Rebuilt: {len(entities_to_rebuild)} entities, {len(relationships_to_rebuild)} relationships."""
|
||||
|
||||
logger.info(success_message)
|
||||
return DeletionResult(
|
||||
status="success",
|
||||
doc_id=doc_id,
|
||||
message=success_message,
|
||||
status_code=200,
|
||||
)
|
||||
|
||||
async def process_data(data_type, vdb, chunk_id):
|
||||
# Check data (entities or relationships)
|
||||
storage = await vdb.client_storage
|
||||
data_with_chunk = [
|
||||
dp
|
||||
for dp in storage["data"]
|
||||
if chunk_id in (dp.get("source_id") or "").split(GRAPH_FIELD_SEP)
|
||||
]
|
||||
|
||||
data_for_vdb = {}
|
||||
if data_with_chunk:
|
||||
logger.warning(
|
||||
f"found {len(data_with_chunk)} {data_type} still referencing chunk {chunk_id}"
|
||||
)
|
||||
|
||||
for item in data_with_chunk:
|
||||
old_sources = item["source_id"].split(GRAPH_FIELD_SEP)
|
||||
new_sources = [src for src in old_sources if src != chunk_id]
|
||||
|
||||
if not new_sources:
|
||||
logger.info(
|
||||
f"{data_type} {item.get('entity_name', 'N/A')} is deleted because source_id is not exists"
|
||||
)
|
||||
await vdb.delete_entity(item)
|
||||
else:
|
||||
item["source_id"] = GRAPH_FIELD_SEP.join(new_sources)
|
||||
item_id = item["__id__"]
|
||||
data_for_vdb[item_id] = item.copy()
|
||||
if data_type == "entities":
|
||||
data_for_vdb[item_id]["content"] = data_for_vdb[
|
||||
item_id
|
||||
].get("content") or (
|
||||
item.get("entity_name", "")
|
||||
+ (item.get("description") or "")
|
||||
)
|
||||
else: # relationships
|
||||
data_for_vdb[item_id]["content"] = data_for_vdb[
|
||||
item_id
|
||||
].get("content") or (
|
||||
(item.get("keywords") or "")
|
||||
+ (item.get("src_id") or "")
|
||||
+ (item.get("tgt_id") or "")
|
||||
+ (item.get("description") or "")
|
||||
)
|
||||
|
||||
if data_for_vdb:
|
||||
await vdb.upsert(data_for_vdb)
|
||||
logger.info(f"Successfully updated {data_type} in vector DB")
|
||||
|
||||
# Add verification step
|
||||
async def verify_deletion():
|
||||
# Verify if the document has been deleted
|
||||
if await self.full_docs.get_by_id(doc_id):
|
||||
logger.warning(f"Document {doc_id} still exists in full_docs")
|
||||
|
||||
# Verify if chunks have been deleted
|
||||
all_remaining_chunks = await self.text_chunks.get_all()
|
||||
remaining_related_chunks = {
|
||||
chunk_id: chunk_data
|
||||
for chunk_id, chunk_data in all_remaining_chunks.items()
|
||||
if isinstance(chunk_data, dict)
|
||||
and chunk_data.get("full_doc_id") == doc_id
|
||||
}
|
||||
|
||||
if remaining_related_chunks:
|
||||
logger.warning(
|
||||
f"Found {len(remaining_related_chunks)} remaining chunks"
|
||||
)
|
||||
|
||||
# Verify entities and relationships
|
||||
for chunk_id in chunk_ids:
|
||||
await process_data("entities", self.entities_vdb, chunk_id)
|
||||
await process_data(
|
||||
"relationships", self.relationships_vdb, chunk_id
|
||||
)
|
||||
|
||||
await verify_deletion()
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error while deleting document {doc_id}: {e}")
|
||||
error_message = f"Error while deleting document {doc_id}: {e}"
|
||||
logger.error(error_message)
|
||||
logger.error(traceback.format_exc())
|
||||
return DeletionResult(
|
||||
status="fail",
|
||||
doc_id=doc_id,
|
||||
message=error_message,
|
||||
status_code=500,
|
||||
)
|
||||
|
||||
async def adelete_by_entity(self, entity_name: str) -> None:
|
||||
async def adelete_by_entity(self, entity_name: str) -> DeletionResult:
|
||||
"""Asynchronously delete an entity and all its relationships.
|
||||
|
||||
Args:
|
||||
entity_name: Name of the entity to delete
|
||||
entity_name: Name of the entity to delete.
|
||||
|
||||
Returns:
|
||||
DeletionResult: An object containing the outcome of the deletion process.
|
||||
"""
|
||||
from .utils_graph import adelete_by_entity
|
||||
|
||||
|
|
@ -1947,16 +1928,29 @@ class LightRAG:
|
|||
entity_name,
|
||||
)
|
||||
|
||||
def delete_by_entity(self, entity_name: str) -> None:
|
||||
def delete_by_entity(self, entity_name: str) -> DeletionResult:
|
||||
"""Synchronously delete an entity and all its relationships.
|
||||
|
||||
Args:
|
||||
entity_name: Name of the entity to delete.
|
||||
|
||||
Returns:
|
||||
DeletionResult: An object containing the outcome of the deletion process.
|
||||
"""
|
||||
loop = always_get_an_event_loop()
|
||||
return loop.run_until_complete(self.adelete_by_entity(entity_name))
|
||||
|
||||
async def adelete_by_relation(self, source_entity: str, target_entity: str) -> None:
|
||||
async def adelete_by_relation(
|
||||
self, source_entity: str, target_entity: str
|
||||
) -> DeletionResult:
|
||||
"""Asynchronously delete a relation between two entities.
|
||||
|
||||
Args:
|
||||
source_entity: Name of the source entity
|
||||
target_entity: Name of the target entity
|
||||
source_entity: Name of the source entity.
|
||||
target_entity: Name of the target entity.
|
||||
|
||||
Returns:
|
||||
DeletionResult: An object containing the outcome of the deletion process.
|
||||
"""
|
||||
from .utils_graph import adelete_by_relation
|
||||
|
||||
|
|
@ -1967,7 +1961,18 @@ class LightRAG:
|
|||
target_entity,
|
||||
)
|
||||
|
||||
def delete_by_relation(self, source_entity: str, target_entity: str) -> None:
|
||||
def delete_by_relation(
|
||||
self, source_entity: str, target_entity: str
|
||||
) -> DeletionResult:
|
||||
"""Synchronously delete a relation between two entities.
|
||||
|
||||
Args:
|
||||
source_entity: Name of the source entity.
|
||||
target_entity: Name of the target entity.
|
||||
|
||||
Returns:
|
||||
DeletionResult: An object containing the outcome of the deletion process.
|
||||
"""
|
||||
loop = always_get_an_event_loop()
|
||||
return loop.run_until_complete(
|
||||
self.adelete_by_relation(source_entity, target_entity)
|
||||
|
|
|
|||
|
|
@ -50,7 +50,7 @@ async def _ollama_model_if_cache(
|
|||
kwargs.pop("max_tokens", None)
|
||||
# kwargs.pop("response_format", None) # allow json
|
||||
host = kwargs.pop("host", None)
|
||||
timeout = kwargs.pop("timeout", None) or 300 # Default timeout 300s
|
||||
timeout = kwargs.pop("timeout", None) or 600 # Default timeout 600s
|
||||
kwargs.pop("hashing_kv", None)
|
||||
api_key = kwargs.pop("api_key", None)
|
||||
headers = {
|
||||
|
|
@ -146,7 +146,7 @@ async def ollama_embed(texts: list[str], embed_model, **kwargs) -> np.ndarray:
|
|||
headers["Authorization"] = f"Bearer {api_key}"
|
||||
|
||||
host = kwargs.pop("host", None)
|
||||
timeout = kwargs.pop("timeout", None) or 90 # Default time out 90s
|
||||
timeout = kwargs.pop("timeout", None) or 300 # Default time out 300s
|
||||
|
||||
ollama_client = ollama.AsyncClient(host=host, timeout=timeout, headers=headers)
|
||||
|
||||
|
|
|
|||
|
|
@ -1,513 +0,0 @@
|
|||
# type: ignore
|
||||
"""
|
||||
MinerU Document Parser Utility
|
||||
|
||||
This module provides functionality for parsing PDF, image and office documents using MinerU library,
|
||||
and converts the parsing results into markdown and JSON formats
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
__all__ = ["MineruParser"]
|
||||
|
||||
import os
|
||||
import json
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
from typing import (
|
||||
Dict,
|
||||
List,
|
||||
Optional,
|
||||
Union,
|
||||
Tuple,
|
||||
Any,
|
||||
TypeVar,
|
||||
cast,
|
||||
TYPE_CHECKING,
|
||||
ClassVar,
|
||||
)
|
||||
|
||||
# Type stubs for magic_pdf
|
||||
FileBasedDataWriter = Any
|
||||
FileBasedDataReader = Any
|
||||
PymuDocDataset = Any
|
||||
InferResult = Any
|
||||
PipeResult = Any
|
||||
SupportedPdfParseMethod = Any
|
||||
doc_analyze = Any
|
||||
read_local_office = Any
|
||||
read_local_images = Any
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from magic_pdf.data.data_reader_writer import (
|
||||
FileBasedDataWriter,
|
||||
FileBasedDataReader,
|
||||
)
|
||||
from magic_pdf.data.dataset import PymuDocDataset
|
||||
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
|
||||
from magic_pdf.config.enums import SupportedPdfParseMethod
|
||||
from magic_pdf.data.read_api import read_local_office, read_local_images
|
||||
else:
|
||||
# MinerU imports
|
||||
from magic_pdf.data.data_reader_writer import (
|
||||
FileBasedDataWriter,
|
||||
FileBasedDataReader,
|
||||
)
|
||||
from magic_pdf.data.dataset import PymuDocDataset
|
||||
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
|
||||
from magic_pdf.config.enums import SupportedPdfParseMethod
|
||||
from magic_pdf.data.read_api import read_local_office, read_local_images
|
||||
|
||||
T = TypeVar("T")
|
||||
|
||||
|
||||
class MineruParser:
|
||||
"""
|
||||
MinerU document parsing utility class
|
||||
|
||||
Supports parsing PDF, image and office documents (like Word, PPT, etc.),
|
||||
converting the content into structured data and generating markdown and JSON output
|
||||
"""
|
||||
|
||||
__slots__: ClassVar[Tuple[str, ...]] = ()
|
||||
|
||||
def __init__(self) -> None:
|
||||
"""Initialize MineruParser"""
|
||||
pass
|
||||
|
||||
@staticmethod
|
||||
def safe_write(
|
||||
writer: Any,
|
||||
content: Union[str, bytes, Dict[str, Any], List[Any]],
|
||||
filename: str,
|
||||
) -> None:
|
||||
"""
|
||||
Safely write content to a file, ensuring the filename is valid
|
||||
|
||||
Args:
|
||||
writer: The writer object to use
|
||||
content: The content to write
|
||||
filename: The filename to write to
|
||||
"""
|
||||
# Ensure the filename isn't too long
|
||||
if len(filename) > 200: # Most filesystems have limits around 255 characters
|
||||
# Truncate the filename while keeping the extension
|
||||
base, ext = os.path.splitext(filename)
|
||||
filename = base[:190] + ext # Leave room for the extension and some margin
|
||||
|
||||
# Handle specific content types
|
||||
if isinstance(content, str):
|
||||
# Ensure str content is encoded to bytes if required
|
||||
try:
|
||||
writer.write(content, filename)
|
||||
except TypeError:
|
||||
# If the writer expects bytes, convert string to bytes
|
||||
writer.write(content.encode("utf-8"), filename)
|
||||
else:
|
||||
# For dict/list content, always encode as JSON string first
|
||||
if isinstance(content, (dict, list)):
|
||||
try:
|
||||
writer.write(
|
||||
json.dumps(content, ensure_ascii=False, indent=4), filename
|
||||
)
|
||||
except TypeError:
|
||||
# If the writer expects bytes, convert JSON string to bytes
|
||||
writer.write(
|
||||
json.dumps(content, ensure_ascii=False, indent=4).encode(
|
||||
"utf-8"
|
||||
),
|
||||
filename,
|
||||
)
|
||||
else:
|
||||
# Regular content (assumed to be bytes or compatible)
|
||||
writer.write(content, filename)
|
||||
|
||||
@staticmethod
|
||||
def parse_pdf(
|
||||
pdf_path: Union[str, Path],
|
||||
output_dir: Optional[str] = None,
|
||||
use_ocr: bool = False,
|
||||
) -> Tuple[List[Dict[str, Any]], str]:
|
||||
"""
|
||||
Parse PDF document
|
||||
|
||||
Args:
|
||||
pdf_path: Path to the PDF file
|
||||
output_dir: Output directory path
|
||||
use_ocr: Whether to force OCR parsing
|
||||
|
||||
Returns:
|
||||
Tuple[List[Dict[str, Any]], str]: Tuple containing (content list JSON, Markdown text)
|
||||
"""
|
||||
try:
|
||||
# Convert to Path object for easier handling
|
||||
pdf_path = Path(pdf_path)
|
||||
name_without_suff = pdf_path.stem
|
||||
|
||||
# Prepare output directories - ensure file name is in path
|
||||
if output_dir:
|
||||
base_output_dir = Path(output_dir)
|
||||
local_md_dir = base_output_dir / name_without_suff
|
||||
else:
|
||||
local_md_dir = pdf_path.parent / name_without_suff
|
||||
|
||||
local_image_dir = local_md_dir / "images"
|
||||
image_dir = local_image_dir.name
|
||||
|
||||
# Create directories
|
||||
os.makedirs(local_image_dir, exist_ok=True)
|
||||
os.makedirs(local_md_dir, exist_ok=True)
|
||||
|
||||
# Initialize writers and reader
|
||||
image_writer = FileBasedDataWriter(str(local_image_dir)) # type: ignore
|
||||
md_writer = FileBasedDataWriter(str(local_md_dir)) # type: ignore
|
||||
reader = FileBasedDataReader("") # type: ignore
|
||||
|
||||
# Read PDF bytes
|
||||
pdf_bytes = reader.read(str(pdf_path)) # type: ignore
|
||||
|
||||
# Create dataset instance
|
||||
ds = PymuDocDataset(pdf_bytes) # type: ignore
|
||||
|
||||
# Process based on PDF type and user preference
|
||||
if use_ocr or ds.classify() == SupportedPdfParseMethod.OCR: # type: ignore
|
||||
infer_result = ds.apply(doc_analyze, ocr=True) # type: ignore
|
||||
pipe_result = infer_result.pipe_ocr_mode(image_writer) # type: ignore
|
||||
else:
|
||||
infer_result = ds.apply(doc_analyze, ocr=False) # type: ignore
|
||||
pipe_result = infer_result.pipe_txt_mode(image_writer) # type: ignore
|
||||
|
||||
# Draw visualizations
|
||||
try:
|
||||
infer_result.draw_model(
|
||||
os.path.join(local_md_dir, f"{name_without_suff}_model.pdf")
|
||||
) # type: ignore
|
||||
pipe_result.draw_layout(
|
||||
os.path.join(local_md_dir, f"{name_without_suff}_layout.pdf")
|
||||
) # type: ignore
|
||||
pipe_result.draw_span(
|
||||
os.path.join(local_md_dir, f"{name_without_suff}_spans.pdf")
|
||||
) # type: ignore
|
||||
except Exception as e:
|
||||
print(f"Warning: Failed to draw visualizations: {str(e)}")
|
||||
|
||||
# Get data using API methods
|
||||
md_content = pipe_result.get_markdown(image_dir) # type: ignore
|
||||
content_list = pipe_result.get_content_list(image_dir) # type: ignore
|
||||
|
||||
# Save files using dump methods (consistent with API)
|
||||
pipe_result.dump_md(md_writer, f"{name_without_suff}.md", image_dir) # type: ignore
|
||||
pipe_result.dump_content_list(
|
||||
md_writer, f"{name_without_suff}_content_list.json", image_dir
|
||||
) # type: ignore
|
||||
pipe_result.dump_middle_json(md_writer, f"{name_without_suff}_middle.json") # type: ignore
|
||||
|
||||
# Save model result - convert JSON string to bytes before writing
|
||||
model_inference_result = infer_result.get_infer_res() # type: ignore
|
||||
json_str = json.dumps(model_inference_result, ensure_ascii=False, indent=4)
|
||||
|
||||
try:
|
||||
# Try to write to a file manually to avoid FileBasedDataWriter issues
|
||||
model_file_path = os.path.join(
|
||||
local_md_dir, f"{name_without_suff}_model.json"
|
||||
)
|
||||
with open(model_file_path, "w", encoding="utf-8") as f:
|
||||
f.write(json_str)
|
||||
except Exception as e:
|
||||
print(
|
||||
f"Warning: Failed to save model result using file write: {str(e)}"
|
||||
)
|
||||
try:
|
||||
# If direct file write fails, try using the writer with bytes encoding
|
||||
md_writer.write(
|
||||
json_str.encode("utf-8"), f"{name_without_suff}_model.json"
|
||||
) # type: ignore
|
||||
except Exception as e2:
|
||||
print(
|
||||
f"Warning: Failed to save model result using writer: {str(e2)}"
|
||||
)
|
||||
|
||||
return cast(Tuple[List[Dict[str, Any]], str], (content_list, md_content))
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error in parse_pdf: {str(e)}")
|
||||
raise
|
||||
|
||||
@staticmethod
|
||||
def parse_office_doc(
|
||||
doc_path: Union[str, Path], output_dir: Optional[str] = None
|
||||
) -> Tuple[List[Dict[str, Any]], str]:
|
||||
"""
|
||||
Parse office document (Word, PPT, etc.)
|
||||
|
||||
Args:
|
||||
doc_path: Path to the document file
|
||||
output_dir: Output directory path
|
||||
|
||||
Returns:
|
||||
Tuple[List[Dict[str, Any]], str]: Tuple containing (content list JSON, Markdown text)
|
||||
"""
|
||||
try:
|
||||
# Convert to Path object for easier handling
|
||||
doc_path = Path(doc_path)
|
||||
name_without_suff = doc_path.stem
|
||||
|
||||
# Prepare output directories - ensure file name is in path
|
||||
if output_dir:
|
||||
base_output_dir = Path(output_dir)
|
||||
local_md_dir = base_output_dir / name_without_suff
|
||||
else:
|
||||
local_md_dir = doc_path.parent / name_without_suff
|
||||
|
||||
local_image_dir = local_md_dir / "images"
|
||||
image_dir = local_image_dir.name
|
||||
|
||||
# Create directories
|
||||
os.makedirs(local_image_dir, exist_ok=True)
|
||||
os.makedirs(local_md_dir, exist_ok=True)
|
||||
|
||||
# Initialize writers
|
||||
image_writer = FileBasedDataWriter(str(local_image_dir)) # type: ignore
|
||||
md_writer = FileBasedDataWriter(str(local_md_dir)) # type: ignore
|
||||
|
||||
# Read office document
|
||||
ds = read_local_office(str(doc_path))[0] # type: ignore
|
||||
|
||||
# Apply chain of operations according to API documentation
|
||||
# This follows the pattern shown in MS-Office example in the API docs
|
||||
ds.apply(doc_analyze, ocr=True).pipe_txt_mode(image_writer).dump_md(
|
||||
md_writer, f"{name_without_suff}.md", image_dir
|
||||
) # type: ignore
|
||||
|
||||
# Re-execute for getting the content data
|
||||
infer_result = ds.apply(doc_analyze, ocr=True) # type: ignore
|
||||
pipe_result = infer_result.pipe_txt_mode(image_writer) # type: ignore
|
||||
|
||||
# Get data for return values and additional outputs
|
||||
md_content = pipe_result.get_markdown(image_dir) # type: ignore
|
||||
content_list = pipe_result.get_content_list(image_dir) # type: ignore
|
||||
|
||||
# Save additional output files
|
||||
pipe_result.dump_content_list(
|
||||
md_writer, f"{name_without_suff}_content_list.json", image_dir
|
||||
) # type: ignore
|
||||
pipe_result.dump_middle_json(md_writer, f"{name_without_suff}_middle.json") # type: ignore
|
||||
|
||||
# Save model result - convert JSON string to bytes before writing
|
||||
model_inference_result = infer_result.get_infer_res() # type: ignore
|
||||
json_str = json.dumps(model_inference_result, ensure_ascii=False, indent=4)
|
||||
|
||||
try:
|
||||
# Try to write to a file manually to avoid FileBasedDataWriter issues
|
||||
model_file_path = os.path.join(
|
||||
local_md_dir, f"{name_without_suff}_model.json"
|
||||
)
|
||||
with open(model_file_path, "w", encoding="utf-8") as f:
|
||||
f.write(json_str)
|
||||
except Exception as e:
|
||||
print(
|
||||
f"Warning: Failed to save model result using file write: {str(e)}"
|
||||
)
|
||||
try:
|
||||
# If direct file write fails, try using the writer with bytes encoding
|
||||
md_writer.write(
|
||||
json_str.encode("utf-8"), f"{name_without_suff}_model.json"
|
||||
) # type: ignore
|
||||
except Exception as e2:
|
||||
print(
|
||||
f"Warning: Failed to save model result using writer: {str(e2)}"
|
||||
)
|
||||
|
||||
return cast(Tuple[List[Dict[str, Any]], str], (content_list, md_content))
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error in parse_office_doc: {str(e)}")
|
||||
raise
|
||||
|
||||
@staticmethod
|
||||
def parse_image(
|
||||
image_path: Union[str, Path], output_dir: Optional[str] = None
|
||||
) -> Tuple[List[Dict[str, Any]], str]:
|
||||
"""
|
||||
Parse image document
|
||||
|
||||
Args:
|
||||
image_path: Path to the image file
|
||||
output_dir: Output directory path
|
||||
|
||||
Returns:
|
||||
Tuple[List[Dict[str, Any]], str]: Tuple containing (content list JSON, Markdown text)
|
||||
"""
|
||||
try:
|
||||
# Convert to Path object for easier handling
|
||||
image_path = Path(image_path)
|
||||
name_without_suff = image_path.stem
|
||||
|
||||
# Prepare output directories - ensure file name is in path
|
||||
if output_dir:
|
||||
base_output_dir = Path(output_dir)
|
||||
local_md_dir = base_output_dir / name_without_suff
|
||||
else:
|
||||
local_md_dir = image_path.parent / name_without_suff
|
||||
|
||||
local_image_dir = local_md_dir / "images"
|
||||
image_dir = local_image_dir.name
|
||||
|
||||
# Create directories
|
||||
os.makedirs(local_image_dir, exist_ok=True)
|
||||
os.makedirs(local_md_dir, exist_ok=True)
|
||||
|
||||
# Initialize writers
|
||||
image_writer = FileBasedDataWriter(str(local_image_dir)) # type: ignore
|
||||
md_writer = FileBasedDataWriter(str(local_md_dir)) # type: ignore
|
||||
|
||||
# Read image
|
||||
ds = read_local_images(str(image_path))[0] # type: ignore
|
||||
|
||||
# Apply chain of operations according to API documentation
|
||||
# This follows the pattern shown in Image example in the API docs
|
||||
ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
|
||||
md_writer, f"{name_without_suff}.md", image_dir
|
||||
) # type: ignore
|
||||
|
||||
# Re-execute for getting the content data
|
||||
infer_result = ds.apply(doc_analyze, ocr=True) # type: ignore
|
||||
pipe_result = infer_result.pipe_ocr_mode(image_writer) # type: ignore
|
||||
|
||||
# Get data for return values and additional outputs
|
||||
md_content = pipe_result.get_markdown(image_dir) # type: ignore
|
||||
content_list = pipe_result.get_content_list(image_dir) # type: ignore
|
||||
|
||||
# Save additional output files
|
||||
pipe_result.dump_content_list(
|
||||
md_writer, f"{name_without_suff}_content_list.json", image_dir
|
||||
) # type: ignore
|
||||
pipe_result.dump_middle_json(md_writer, f"{name_without_suff}_middle.json") # type: ignore
|
||||
|
||||
# Save model result - convert JSON string to bytes before writing
|
||||
model_inference_result = infer_result.get_infer_res() # type: ignore
|
||||
json_str = json.dumps(model_inference_result, ensure_ascii=False, indent=4)
|
||||
|
||||
try:
|
||||
# Try to write to a file manually to avoid FileBasedDataWriter issues
|
||||
model_file_path = os.path.join(
|
||||
local_md_dir, f"{name_without_suff}_model.json"
|
||||
)
|
||||
with open(model_file_path, "w", encoding="utf-8") as f:
|
||||
f.write(json_str)
|
||||
except Exception as e:
|
||||
print(
|
||||
f"Warning: Failed to save model result using file write: {str(e)}"
|
||||
)
|
||||
try:
|
||||
# If direct file write fails, try using the writer with bytes encoding
|
||||
md_writer.write(
|
||||
json_str.encode("utf-8"), f"{name_without_suff}_model.json"
|
||||
) # type: ignore
|
||||
except Exception as e2:
|
||||
print(
|
||||
f"Warning: Failed to save model result using writer: {str(e2)}"
|
||||
)
|
||||
|
||||
return cast(Tuple[List[Dict[str, Any]], str], (content_list, md_content))
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error in parse_image: {str(e)}")
|
||||
raise
|
||||
|
||||
@staticmethod
|
||||
def parse_document(
|
||||
file_path: Union[str, Path],
|
||||
parse_method: str = "auto",
|
||||
output_dir: Optional[str] = None,
|
||||
save_results: bool = True,
|
||||
) -> Tuple[List[Dict[str, Any]], str]:
|
||||
"""
|
||||
Parse document using MinerU based on file extension
|
||||
|
||||
Args:
|
||||
file_path: Path to the file to be parsed
|
||||
parse_method: Parsing method, supports "auto", "ocr", "txt", default is "auto"
|
||||
output_dir: Output directory path, if None, use the directory of the input file
|
||||
save_results: Whether to save parsing results to files
|
||||
|
||||
Returns:
|
||||
Tuple[List[Dict[str, Any]], str]: Tuple containing (content list JSON, Markdown text)
|
||||
"""
|
||||
# Convert to Path object
|
||||
file_path = Path(file_path)
|
||||
if not file_path.exists():
|
||||
raise FileNotFoundError(f"File does not exist: {file_path}")
|
||||
|
||||
# Get file extension
|
||||
ext = file_path.suffix.lower()
|
||||
|
||||
# Choose appropriate parser based on file type
|
||||
if ext in [".pdf"]:
|
||||
return MineruParser.parse_pdf(
|
||||
file_path, output_dir, use_ocr=(parse_method == "ocr")
|
||||
)
|
||||
elif ext in [".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif"]:
|
||||
return MineruParser.parse_image(file_path, output_dir)
|
||||
elif ext in [".doc", ".docx", ".ppt", ".pptx"]:
|
||||
return MineruParser.parse_office_doc(file_path, output_dir)
|
||||
else:
|
||||
# For unsupported file types, default to PDF parsing
|
||||
print(
|
||||
f"Warning: Unsupported file extension '{ext}', trying generic PDF parser"
|
||||
)
|
||||
return MineruParser.parse_pdf(
|
||||
file_path, output_dir, use_ocr=(parse_method == "ocr")
|
||||
)
|
||||
|
||||
|
||||
def main():
|
||||
"""
|
||||
Main function to run the MinerU parser from command line
|
||||
"""
|
||||
parser = argparse.ArgumentParser(description="Parse documents using MinerU")
|
||||
parser.add_argument("file_path", help="Path to the document to parse")
|
||||
parser.add_argument("--output", "-o", help="Output directory path")
|
||||
parser.add_argument(
|
||||
"--method",
|
||||
"-m",
|
||||
choices=["auto", "ocr", "txt"],
|
||||
default="auto",
|
||||
help="Parsing method (auto, ocr, txt)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--stats", action="store_true", help="Display content statistics"
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
try:
|
||||
# Parse the document
|
||||
content_list, md_content = MineruParser.parse_document(
|
||||
file_path=args.file_path, parse_method=args.method, output_dir=args.output
|
||||
)
|
||||
|
||||
# Display statistics if requested
|
||||
if args.stats:
|
||||
print("\nDocument Statistics:")
|
||||
print(f"Total content blocks: {len(content_list)}")
|
||||
|
||||
# Count different types of content
|
||||
content_types = {}
|
||||
for item in content_list:
|
||||
content_type = item.get("type", "unknown")
|
||||
content_types[content_type] = content_types.get(content_type, 0) + 1
|
||||
|
||||
print("\nContent Type Distribution:")
|
||||
for content_type, count in content_types.items():
|
||||
print(f"- {content_type}: {count}")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error: {str(e)}")
|
||||
return 1
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
exit(main())
|
||||
|
|
@ -1,699 +0,0 @@
|
|||
"""
|
||||
Specialized processors for different modalities
|
||||
|
||||
Includes:
|
||||
- ImageModalProcessor: Specialized processor for image content
|
||||
- TableModalProcessor: Specialized processor for table content
|
||||
- EquationModalProcessor: Specialized processor for equation content
|
||||
- GenericModalProcessor: Processor for other modal content
|
||||
"""
|
||||
|
||||
import re
|
||||
import json
|
||||
import time
|
||||
import asyncio
|
||||
import base64
|
||||
from typing import Dict, Any, Tuple, cast
|
||||
from pathlib import Path
|
||||
|
||||
from lightrag.base import StorageNameSpace
|
||||
from lightrag.utils import (
|
||||
logger,
|
||||
compute_mdhash_id,
|
||||
)
|
||||
from lightrag.lightrag import LightRAG
|
||||
from dataclasses import asdict
|
||||
from lightrag.kg.shared_storage import get_namespace_data, get_pipeline_status_lock
|
||||
|
||||
|
||||
class BaseModalProcessor:
|
||||
"""Base class for modal processors"""
|
||||
|
||||
def __init__(self, lightrag: LightRAG, modal_caption_func):
|
||||
"""Initialize base processor
|
||||
|
||||
Args:
|
||||
lightrag: LightRAG instance
|
||||
modal_caption_func: Function for generating descriptions
|
||||
"""
|
||||
self.lightrag = lightrag
|
||||
self.modal_caption_func = modal_caption_func
|
||||
|
||||
# Use LightRAG's storage instances
|
||||
self.text_chunks_db = lightrag.text_chunks
|
||||
self.chunks_vdb = lightrag.chunks_vdb
|
||||
self.entities_vdb = lightrag.entities_vdb
|
||||
self.relationships_vdb = lightrag.relationships_vdb
|
||||
self.knowledge_graph_inst = lightrag.chunk_entity_relation_graph
|
||||
|
||||
# Use LightRAG's configuration and functions
|
||||
self.embedding_func = lightrag.embedding_func
|
||||
self.llm_model_func = lightrag.llm_model_func
|
||||
self.global_config = asdict(lightrag)
|
||||
self.hashing_kv = lightrag.llm_response_cache
|
||||
self.tokenizer = lightrag.tokenizer
|
||||
|
||||
async def process_multimodal_content(
|
||||
self,
|
||||
modal_content,
|
||||
content_type: str,
|
||||
file_path: str = "manual_creation",
|
||||
entity_name: str = None,
|
||||
) -> Tuple[str, Dict[str, Any]]:
|
||||
"""Process multimodal content"""
|
||||
# Subclasses need to implement specific processing logic
|
||||
raise NotImplementedError("Subclasses must implement this method")
|
||||
|
||||
async def _create_entity_and_chunk(
|
||||
self, modal_chunk: str, entity_info: Dict[str, Any], file_path: str
|
||||
) -> Tuple[str, Dict[str, Any]]:
|
||||
"""Create entity and text chunk"""
|
||||
# Create chunk
|
||||
chunk_id = compute_mdhash_id(str(modal_chunk), prefix="chunk-")
|
||||
tokens = len(self.tokenizer.encode(modal_chunk))
|
||||
|
||||
chunk_data = {
|
||||
"tokens": tokens,
|
||||
"content": modal_chunk,
|
||||
"chunk_order_index": 0,
|
||||
"full_doc_id": chunk_id,
|
||||
"file_path": file_path,
|
||||
}
|
||||
|
||||
# Store chunk
|
||||
await self.text_chunks_db.upsert({chunk_id: chunk_data})
|
||||
|
||||
# Create entity node
|
||||
node_data = {
|
||||
"entity_id": entity_info["entity_name"],
|
||||
"entity_type": entity_info["entity_type"],
|
||||
"description": entity_info["summary"],
|
||||
"source_id": chunk_id,
|
||||
"file_path": file_path,
|
||||
"created_at": int(time.time()),
|
||||
}
|
||||
|
||||
await self.knowledge_graph_inst.upsert_node(
|
||||
entity_info["entity_name"], node_data
|
||||
)
|
||||
|
||||
# Insert entity into vector database
|
||||
entity_vdb_data = {
|
||||
compute_mdhash_id(entity_info["entity_name"], prefix="ent-"): {
|
||||
"entity_name": entity_info["entity_name"],
|
||||
"entity_type": entity_info["entity_type"],
|
||||
"content": f"{entity_info['entity_name']}\n{entity_info['summary']}",
|
||||
"source_id": chunk_id,
|
||||
"file_path": file_path,
|
||||
}
|
||||
}
|
||||
await self.entities_vdb.upsert(entity_vdb_data)
|
||||
|
||||
# Process entity and relationship extraction
|
||||
await self._process_chunk_for_extraction(chunk_id, entity_info["entity_name"])
|
||||
|
||||
# Ensure all storage updates are complete
|
||||
await self._insert_done()
|
||||
|
||||
return entity_info["summary"], {
|
||||
"entity_name": entity_info["entity_name"],
|
||||
"entity_type": entity_info["entity_type"],
|
||||
"description": entity_info["summary"],
|
||||
"chunk_id": chunk_id,
|
||||
}
|
||||
|
||||
async def _process_chunk_for_extraction(
|
||||
self, chunk_id: str, modal_entity_name: str
|
||||
):
|
||||
"""Process chunk for entity and relationship extraction"""
|
||||
chunk_data = await self.text_chunks_db.get_by_id(chunk_id)
|
||||
if not chunk_data:
|
||||
logger.error(f"Chunk {chunk_id} not found")
|
||||
return
|
||||
|
||||
# Create text chunk for vector database
|
||||
chunk_vdb_data = {
|
||||
chunk_id: {
|
||||
"content": chunk_data["content"],
|
||||
"full_doc_id": chunk_id,
|
||||
"tokens": chunk_data["tokens"],
|
||||
"chunk_order_index": chunk_data["chunk_order_index"],
|
||||
"file_path": chunk_data["file_path"],
|
||||
}
|
||||
}
|
||||
|
||||
await self.chunks_vdb.upsert(chunk_vdb_data)
|
||||
|
||||
# Trigger extraction process
|
||||
from lightrag.operate import extract_entities, merge_nodes_and_edges
|
||||
|
||||
pipeline_status = await get_namespace_data("pipeline_status")
|
||||
pipeline_status_lock = get_pipeline_status_lock()
|
||||
|
||||
# Prepare chunk for extraction
|
||||
chunks = {chunk_id: chunk_data}
|
||||
|
||||
# Extract entities and relationships
|
||||
chunk_results = await extract_entities(
|
||||
chunks=chunks,
|
||||
global_config=self.global_config,
|
||||
pipeline_status=pipeline_status,
|
||||
pipeline_status_lock=pipeline_status_lock,
|
||||
llm_response_cache=self.hashing_kv,
|
||||
)
|
||||
|
||||
# Add "belongs_to" relationships for all extracted entities
|
||||
for maybe_nodes, _ in chunk_results:
|
||||
for entity_name in maybe_nodes.keys():
|
||||
if entity_name != modal_entity_name: # Skip self-relationship
|
||||
# Create belongs_to relationship
|
||||
relation_data = {
|
||||
"description": f"Entity {entity_name} belongs to {modal_entity_name}",
|
||||
"keywords": "belongs_to,part_of,contained_in",
|
||||
"source_id": chunk_id,
|
||||
"weight": 10.0,
|
||||
"file_path": chunk_data.get("file_path", "manual_creation"),
|
||||
}
|
||||
await self.knowledge_graph_inst.upsert_edge(
|
||||
entity_name, modal_entity_name, relation_data
|
||||
)
|
||||
|
||||
relation_id = compute_mdhash_id(
|
||||
entity_name + modal_entity_name, prefix="rel-"
|
||||
)
|
||||
relation_vdb_data = {
|
||||
relation_id: {
|
||||
"src_id": entity_name,
|
||||
"tgt_id": modal_entity_name,
|
||||
"keywords": relation_data["keywords"],
|
||||
"content": f"{relation_data['keywords']}\t{entity_name}\n{modal_entity_name}\n{relation_data['description']}",
|
||||
"source_id": chunk_id,
|
||||
"file_path": chunk_data.get("file_path", "manual_creation"),
|
||||
}
|
||||
}
|
||||
await self.relationships_vdb.upsert(relation_vdb_data)
|
||||
|
||||
await merge_nodes_and_edges(
|
||||
chunk_results=chunk_results,
|
||||
knowledge_graph_inst=self.knowledge_graph_inst,
|
||||
entity_vdb=self.entities_vdb,
|
||||
relationships_vdb=self.relationships_vdb,
|
||||
global_config=self.global_config,
|
||||
pipeline_status=pipeline_status,
|
||||
pipeline_status_lock=pipeline_status_lock,
|
||||
llm_response_cache=self.hashing_kv,
|
||||
)
|
||||
|
||||
async def _insert_done(self) -> None:
|
||||
await asyncio.gather(
|
||||
*[
|
||||
cast(StorageNameSpace, storage_inst).index_done_callback()
|
||||
for storage_inst in [
|
||||
self.text_chunks_db,
|
||||
self.chunks_vdb,
|
||||
self.entities_vdb,
|
||||
self.relationships_vdb,
|
||||
self.knowledge_graph_inst,
|
||||
]
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
class ImageModalProcessor(BaseModalProcessor):
|
||||
"""Processor specialized for image content"""
|
||||
|
||||
def __init__(self, lightrag: LightRAG, modal_caption_func):
|
||||
"""Initialize image processor
|
||||
|
||||
Args:
|
||||
lightrag: LightRAG instance
|
||||
modal_caption_func: Function for generating descriptions (supporting image understanding)
|
||||
"""
|
||||
super().__init__(lightrag, modal_caption_func)
|
||||
|
||||
def _encode_image_to_base64(self, image_path: str) -> str:
|
||||
"""Encode image to base64"""
|
||||
try:
|
||||
with open(image_path, "rb") as image_file:
|
||||
encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
|
||||
return encoded_string
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to encode image {image_path}: {e}")
|
||||
return ""
|
||||
|
||||
async def process_multimodal_content(
|
||||
self,
|
||||
modal_content,
|
||||
content_type: str,
|
||||
file_path: str = "manual_creation",
|
||||
entity_name: str = None,
|
||||
) -> Tuple[str, Dict[str, Any]]:
|
||||
"""Process image content"""
|
||||
try:
|
||||
# Parse image content
|
||||
if isinstance(modal_content, str):
|
||||
try:
|
||||
content_data = json.loads(modal_content)
|
||||
except json.JSONDecodeError:
|
||||
content_data = {"description": modal_content}
|
||||
else:
|
||||
content_data = modal_content
|
||||
|
||||
image_path = content_data.get("img_path")
|
||||
captions = content_data.get("img_caption", [])
|
||||
footnotes = content_data.get("img_footnote", [])
|
||||
|
||||
# Build detailed visual analysis prompt
|
||||
vision_prompt = f"""Please analyze this image in detail and provide a JSON response with the following structure:
|
||||
|
||||
{{
|
||||
"detailed_description": "A comprehensive and detailed visual description of the image following these guidelines:
|
||||
- Describe the overall composition and layout
|
||||
- Identify all objects, people, text, and visual elements
|
||||
- Explain relationships between elements
|
||||
- Note colors, lighting, and visual style
|
||||
- Describe any actions or activities shown
|
||||
- Include technical details if relevant (charts, diagrams, etc.)
|
||||
- Always use specific names instead of pronouns",
|
||||
"entity_info": {{
|
||||
"entity_name": "{entity_name if entity_name else 'unique descriptive name for this image'}",
|
||||
"entity_type": "image",
|
||||
"summary": "concise summary of the image content and its significance (max 100 words)"
|
||||
}}
|
||||
}}
|
||||
|
||||
Additional context:
|
||||
- Image Path: {image_path}
|
||||
- Captions: {captions if captions else 'None'}
|
||||
- Footnotes: {footnotes if footnotes else 'None'}
|
||||
|
||||
Focus on providing accurate, detailed visual analysis that would be useful for knowledge retrieval."""
|
||||
|
||||
# If image path exists, try to encode image
|
||||
image_base64 = ""
|
||||
if image_path and Path(image_path).exists():
|
||||
image_base64 = self._encode_image_to_base64(image_path)
|
||||
|
||||
# Call vision model
|
||||
if image_base64:
|
||||
# Use real image for analysis
|
||||
response = await self.modal_caption_func(
|
||||
vision_prompt,
|
||||
image_data=image_base64,
|
||||
system_prompt="You are an expert image analyst. Provide detailed, accurate descriptions.",
|
||||
)
|
||||
else:
|
||||
# Analyze based on existing text information
|
||||
text_prompt = f"""Based on the following image information, provide analysis:
|
||||
|
||||
Image Path: {image_path}
|
||||
Captions: {captions}
|
||||
Footnotes: {footnotes}
|
||||
|
||||
{vision_prompt}"""
|
||||
|
||||
response = await self.modal_caption_func(
|
||||
text_prompt,
|
||||
system_prompt="You are an expert image analyst. Provide detailed analysis based on available information.",
|
||||
)
|
||||
|
||||
# Parse response
|
||||
enhanced_caption, entity_info = self._parse_response(response, entity_name)
|
||||
|
||||
# Build complete image content
|
||||
modal_chunk = f"""
|
||||
Image Content Analysis:
|
||||
Image Path: {image_path}
|
||||
Captions: {', '.join(captions) if captions else 'None'}
|
||||
Footnotes: {', '.join(footnotes) if footnotes else 'None'}
|
||||
|
||||
Visual Analysis: {enhanced_caption}"""
|
||||
|
||||
return await self._create_entity_and_chunk(
|
||||
modal_chunk, entity_info, file_path
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing image content: {e}")
|
||||
# Fallback processing
|
||||
fallback_entity = {
|
||||
"entity_name": entity_name
|
||||
if entity_name
|
||||
else f"image_{compute_mdhash_id(str(modal_content))}",
|
||||
"entity_type": "image",
|
||||
"summary": f"Image content: {str(modal_content)[:100]}",
|
||||
}
|
||||
return str(modal_content), fallback_entity
|
||||
|
||||
def _parse_response(
|
||||
self, response: str, entity_name: str = None
|
||||
) -> Tuple[str, Dict[str, Any]]:
|
||||
"""Parse model response"""
|
||||
try:
|
||||
response_data = json.loads(
|
||||
re.search(r"\{.*\}", response, re.DOTALL).group(0)
|
||||
)
|
||||
|
||||
description = response_data.get("detailed_description", "")
|
||||
entity_data = response_data.get("entity_info", {})
|
||||
|
||||
if not description or not entity_data:
|
||||
raise ValueError("Missing required fields in response")
|
||||
|
||||
if not all(
|
||||
key in entity_data for key in ["entity_name", "entity_type", "summary"]
|
||||
):
|
||||
raise ValueError("Missing required fields in entity_info")
|
||||
|
||||
entity_data["entity_name"] = (
|
||||
entity_data["entity_name"] + f" ({entity_data['entity_type']})"
|
||||
)
|
||||
if entity_name:
|
||||
entity_data["entity_name"] = entity_name
|
||||
|
||||
return description, entity_data
|
||||
|
||||
except (json.JSONDecodeError, AttributeError, ValueError) as e:
|
||||
logger.error(f"Error parsing image analysis response: {e}")
|
||||
fallback_entity = {
|
||||
"entity_name": entity_name
|
||||
if entity_name
|
||||
else f"image_{compute_mdhash_id(response)}",
|
||||
"entity_type": "image",
|
||||
"summary": response[:100] + "..." if len(response) > 100 else response,
|
||||
}
|
||||
return response, fallback_entity
|
||||
|
||||
|
||||
class TableModalProcessor(BaseModalProcessor):
|
||||
"""Processor specialized for table content"""
|
||||
|
||||
async def process_multimodal_content(
|
||||
self,
|
||||
modal_content,
|
||||
content_type: str,
|
||||
file_path: str = "manual_creation",
|
||||
entity_name: str = None,
|
||||
) -> Tuple[str, Dict[str, Any]]:
|
||||
"""Process table content"""
|
||||
# Parse table content
|
||||
if isinstance(modal_content, str):
|
||||
try:
|
||||
content_data = json.loads(modal_content)
|
||||
except json.JSONDecodeError:
|
||||
content_data = {"table_body": modal_content}
|
||||
else:
|
||||
content_data = modal_content
|
||||
|
||||
table_img_path = content_data.get("img_path")
|
||||
table_caption = content_data.get("table_caption", [])
|
||||
table_body = content_data.get("table_body", "")
|
||||
table_footnote = content_data.get("table_footnote", [])
|
||||
|
||||
# Build table analysis prompt
|
||||
table_prompt = f"""Please analyze this table content and provide a JSON response with the following structure:
|
||||
|
||||
{{
|
||||
"detailed_description": "A comprehensive analysis of the table including:
|
||||
- Table structure and organization
|
||||
- Column headers and their meanings
|
||||
- Key data points and patterns
|
||||
- Statistical insights and trends
|
||||
- Relationships between data elements
|
||||
- Significance of the data presented
|
||||
Always use specific names and values instead of general references.",
|
||||
"entity_info": {{
|
||||
"entity_name": "{entity_name if entity_name else 'descriptive name for this table'}",
|
||||
"entity_type": "table",
|
||||
"summary": "concise summary of the table's purpose and key findings (max 100 words)"
|
||||
}}
|
||||
}}
|
||||
|
||||
Table Information:
|
||||
Image Path: {table_img_path}
|
||||
Caption: {table_caption if table_caption else 'None'}
|
||||
Body: {table_body}
|
||||
Footnotes: {table_footnote if table_footnote else 'None'}
|
||||
|
||||
Focus on extracting meaningful insights and relationships from the tabular data."""
|
||||
|
||||
response = await self.modal_caption_func(
|
||||
table_prompt,
|
||||
system_prompt="You are an expert data analyst. Provide detailed table analysis with specific insights.",
|
||||
)
|
||||
|
||||
# Parse response
|
||||
enhanced_caption, entity_info = self._parse_table_response(
|
||||
response, entity_name
|
||||
)
|
||||
|
||||
# TODO: Add Retry Mechanism
|
||||
|
||||
# Build complete table content
|
||||
modal_chunk = f"""Table Analysis:
|
||||
Image Path: {table_img_path}
|
||||
Caption: {', '.join(table_caption) if table_caption else 'None'}
|
||||
Structure: {table_body}
|
||||
Footnotes: {', '.join(table_footnote) if table_footnote else 'None'}
|
||||
|
||||
Analysis: {enhanced_caption}"""
|
||||
|
||||
return await self._create_entity_and_chunk(modal_chunk, entity_info, file_path)
|
||||
|
||||
def _parse_table_response(
|
||||
self, response: str, entity_name: str = None
|
||||
) -> Tuple[str, Dict[str, Any]]:
|
||||
"""Parse table analysis response"""
|
||||
try:
|
||||
response_data = json.loads(
|
||||
re.search(r"\{.*\}", response, re.DOTALL).group(0)
|
||||
)
|
||||
|
||||
description = response_data.get("detailed_description", "")
|
||||
entity_data = response_data.get("entity_info", {})
|
||||
|
||||
if not description or not entity_data:
|
||||
raise ValueError("Missing required fields in response")
|
||||
|
||||
if not all(
|
||||
key in entity_data for key in ["entity_name", "entity_type", "summary"]
|
||||
):
|
||||
raise ValueError("Missing required fields in entity_info")
|
||||
|
||||
entity_data["entity_name"] = (
|
||||
entity_data["entity_name"] + f" ({entity_data['entity_type']})"
|
||||
)
|
||||
if entity_name:
|
||||
entity_data["entity_name"] = entity_name
|
||||
|
||||
return description, entity_data
|
||||
|
||||
except (json.JSONDecodeError, AttributeError, ValueError) as e:
|
||||
logger.error(f"Error parsing table analysis response: {e}")
|
||||
fallback_entity = {
|
||||
"entity_name": entity_name
|
||||
if entity_name
|
||||
else f"table_{compute_mdhash_id(response)}",
|
||||
"entity_type": "table",
|
||||
"summary": response[:100] + "..." if len(response) > 100 else response,
|
||||
}
|
||||
return response, fallback_entity
|
||||
|
||||
|
||||
class EquationModalProcessor(BaseModalProcessor):
|
||||
"""Processor specialized for equation content"""
|
||||
|
||||
async def process_multimodal_content(
|
||||
self,
|
||||
modal_content,
|
||||
content_type: str,
|
||||
file_path: str = "manual_creation",
|
||||
entity_name: str = None,
|
||||
) -> Tuple[str, Dict[str, Any]]:
|
||||
"""Process equation content"""
|
||||
# Parse equation content
|
||||
if isinstance(modal_content, str):
|
||||
try:
|
||||
content_data = json.loads(modal_content)
|
||||
except json.JSONDecodeError:
|
||||
content_data = {"equation": modal_content}
|
||||
else:
|
||||
content_data = modal_content
|
||||
|
||||
equation_text = content_data.get("text")
|
||||
equation_format = content_data.get("text_format", "")
|
||||
|
||||
# Build equation analysis prompt
|
||||
equation_prompt = f"""Please analyze this mathematical equation and provide a JSON response with the following structure:
|
||||
|
||||
{{
|
||||
"detailed_description": "A comprehensive analysis of the equation including:
|
||||
- Mathematical meaning and interpretation
|
||||
- Variables and their definitions
|
||||
- Mathematical operations and functions used
|
||||
- Application domain and context
|
||||
- Physical or theoretical significance
|
||||
- Relationship to other mathematical concepts
|
||||
- Practical applications or use cases
|
||||
Always use specific mathematical terminology.",
|
||||
"entity_info": {{
|
||||
"entity_name": "{entity_name if entity_name else 'descriptive name for this equation'}",
|
||||
"entity_type": "equation",
|
||||
"summary": "concise summary of the equation's purpose and significance (max 100 words)"
|
||||
}}
|
||||
}}
|
||||
|
||||
Equation Information:
|
||||
Equation: {equation_text}
|
||||
Format: {equation_format}
|
||||
|
||||
Focus on providing mathematical insights and explaining the equation's significance."""
|
||||
|
||||
response = await self.modal_caption_func(
|
||||
equation_prompt,
|
||||
system_prompt="You are an expert mathematician. Provide detailed mathematical analysis.",
|
||||
)
|
||||
|
||||
# Parse response
|
||||
enhanced_caption, entity_info = self._parse_equation_response(
|
||||
response, entity_name
|
||||
)
|
||||
|
||||
# Build complete equation content
|
||||
modal_chunk = f"""Mathematical Equation Analysis:
|
||||
Equation: {equation_text}
|
||||
Format: {equation_format}
|
||||
|
||||
Mathematical Analysis: {enhanced_caption}"""
|
||||
|
||||
return await self._create_entity_and_chunk(modal_chunk, entity_info, file_path)
|
||||
|
||||
def _parse_equation_response(
|
||||
self, response: str, entity_name: str = None
|
||||
) -> Tuple[str, Dict[str, Any]]:
|
||||
"""Parse equation analysis response"""
|
||||
try:
|
||||
response_data = json.loads(
|
||||
re.search(r"\{.*\}", response, re.DOTALL).group(0)
|
||||
)
|
||||
|
||||
description = response_data.get("detailed_description", "")
|
||||
entity_data = response_data.get("entity_info", {})
|
||||
|
||||
if not description or not entity_data:
|
||||
raise ValueError("Missing required fields in response")
|
||||
|
||||
if not all(
|
||||
key in entity_data for key in ["entity_name", "entity_type", "summary"]
|
||||
):
|
||||
raise ValueError("Missing required fields in entity_info")
|
||||
|
||||
entity_data["entity_name"] = (
|
||||
entity_data["entity_name"] + f" ({entity_data['entity_type']})"
|
||||
)
|
||||
if entity_name:
|
||||
entity_data["entity_name"] = entity_name
|
||||
|
||||
return description, entity_data
|
||||
|
||||
except (json.JSONDecodeError, AttributeError, ValueError) as e:
|
||||
logger.error(f"Error parsing equation analysis response: {e}")
|
||||
fallback_entity = {
|
||||
"entity_name": entity_name
|
||||
if entity_name
|
||||
else f"equation_{compute_mdhash_id(response)}",
|
||||
"entity_type": "equation",
|
||||
"summary": response[:100] + "..." if len(response) > 100 else response,
|
||||
}
|
||||
return response, fallback_entity
|
||||
|
||||
|
||||
class GenericModalProcessor(BaseModalProcessor):
|
||||
"""Generic processor for other types of modal content"""
|
||||
|
||||
async def process_multimodal_content(
|
||||
self,
|
||||
modal_content,
|
||||
content_type: str,
|
||||
file_path: str = "manual_creation",
|
||||
entity_name: str = None,
|
||||
) -> Tuple[str, Dict[str, Any]]:
|
||||
"""Process generic modal content"""
|
||||
# Build generic analysis prompt
|
||||
generic_prompt = f"""Please analyze this {content_type} content and provide a JSON response with the following structure:
|
||||
|
||||
{{
|
||||
"detailed_description": "A comprehensive analysis of the content including:
|
||||
- Content structure and organization
|
||||
- Key information and elements
|
||||
- Relationships between components
|
||||
- Context and significance
|
||||
- Relevant details for knowledge retrieval
|
||||
Always use specific terminology appropriate for {content_type} content.",
|
||||
"entity_info": {{
|
||||
"entity_name": "{entity_name if entity_name else f'descriptive name for this {content_type}'}",
|
||||
"entity_type": "{content_type}",
|
||||
"summary": "concise summary of the content's purpose and key points (max 100 words)"
|
||||
}}
|
||||
}}
|
||||
|
||||
Content: {str(modal_content)}
|
||||
|
||||
Focus on extracting meaningful information that would be useful for knowledge retrieval."""
|
||||
|
||||
response = await self.modal_caption_func(
|
||||
generic_prompt,
|
||||
system_prompt=f"You are an expert content analyst specializing in {content_type} content.",
|
||||
)
|
||||
|
||||
# Parse response
|
||||
enhanced_caption, entity_info = self._parse_generic_response(
|
||||
response, entity_name, content_type
|
||||
)
|
||||
|
||||
# Build complete content
|
||||
modal_chunk = f"""{content_type.title()} Content Analysis:
|
||||
Content: {str(modal_content)}
|
||||
|
||||
Analysis: {enhanced_caption}"""
|
||||
|
||||
return await self._create_entity_and_chunk(modal_chunk, entity_info, file_path)
|
||||
|
||||
def _parse_generic_response(
|
||||
self, response: str, entity_name: str = None, content_type: str = "content"
|
||||
) -> Tuple[str, Dict[str, Any]]:
|
||||
"""Parse generic analysis response"""
|
||||
try:
|
||||
response_data = json.loads(
|
||||
re.search(r"\{.*\}", response, re.DOTALL).group(0)
|
||||
)
|
||||
|
||||
description = response_data.get("detailed_description", "")
|
||||
entity_data = response_data.get("entity_info", {})
|
||||
|
||||
if not description or not entity_data:
|
||||
raise ValueError("Missing required fields in response")
|
||||
|
||||
if not all(
|
||||
key in entity_data for key in ["entity_name", "entity_type", "summary"]
|
||||
):
|
||||
raise ValueError("Missing required fields in entity_info")
|
||||
|
||||
entity_data["entity_name"] = (
|
||||
entity_data["entity_name"] + f" ({entity_data['entity_type']})"
|
||||
)
|
||||
if entity_name:
|
||||
entity_data["entity_name"] = entity_name
|
||||
|
||||
return description, entity_data
|
||||
|
||||
except (json.JSONDecodeError, AttributeError, ValueError) as e:
|
||||
logger.error(f"Error parsing generic analysis response: {e}")
|
||||
fallback_entity = {
|
||||
"entity_name": entity_name
|
||||
if entity_name
|
||||
else f"{content_type}_{compute_mdhash_id(response)}",
|
||||
"entity_type": content_type,
|
||||
"summary": response[:100] + "..." if len(response) > 100 else response,
|
||||
}
|
||||
return response, fallback_entity
|
||||
|
|
@ -240,6 +240,466 @@ async def _handle_single_relationship_extraction(
|
|||
)
|
||||
|
||||
|
||||
async def _rebuild_knowledge_from_chunks(
|
||||
entities_to_rebuild: dict[str, set[str]],
|
||||
relationships_to_rebuild: dict[tuple[str, str], set[str]],
|
||||
knowledge_graph_inst: BaseGraphStorage,
|
||||
entities_vdb: BaseVectorStorage,
|
||||
relationships_vdb: BaseVectorStorage,
|
||||
text_chunks: BaseKVStorage,
|
||||
llm_response_cache: BaseKVStorage,
|
||||
global_config: dict[str, str],
|
||||
) -> None:
|
||||
"""Rebuild entity and relationship descriptions from cached extraction results
|
||||
|
||||
This method uses cached LLM extraction results instead of calling LLM again,
|
||||
following the same approach as the insert process.
|
||||
|
||||
Args:
|
||||
entities_to_rebuild: Dict mapping entity_name -> set of remaining chunk_ids
|
||||
relationships_to_rebuild: Dict mapping (src, tgt) -> set of remaining chunk_ids
|
||||
"""
|
||||
if not entities_to_rebuild and not relationships_to_rebuild:
|
||||
return
|
||||
|
||||
# Get all referenced chunk IDs
|
||||
all_referenced_chunk_ids = set()
|
||||
for chunk_ids in entities_to_rebuild.values():
|
||||
all_referenced_chunk_ids.update(chunk_ids)
|
||||
for chunk_ids in relationships_to_rebuild.values():
|
||||
all_referenced_chunk_ids.update(chunk_ids)
|
||||
|
||||
logger.info(
|
||||
f"Rebuilding knowledge from {len(all_referenced_chunk_ids)} cached chunk extractions"
|
||||
)
|
||||
|
||||
# Get cached extraction results for these chunks
|
||||
cached_results = await _get_cached_extraction_results(
|
||||
llm_response_cache, all_referenced_chunk_ids
|
||||
)
|
||||
|
||||
if not cached_results:
|
||||
logger.warning("No cached extraction results found, cannot rebuild")
|
||||
return
|
||||
|
||||
# Process cached results to get entities and relationships for each chunk
|
||||
chunk_entities = {} # chunk_id -> {entity_name: [entity_data]}
|
||||
chunk_relationships = {} # chunk_id -> {(src, tgt): [relationship_data]}
|
||||
|
||||
for chunk_id, extraction_result in cached_results.items():
|
||||
try:
|
||||
entities, relationships = await _parse_extraction_result(
|
||||
text_chunks=text_chunks,
|
||||
extraction_result=extraction_result,
|
||||
chunk_id=chunk_id,
|
||||
)
|
||||
chunk_entities[chunk_id] = entities
|
||||
chunk_relationships[chunk_id] = relationships
|
||||
except Exception as e:
|
||||
logger.error(
|
||||
f"Failed to parse cached extraction result for chunk {chunk_id}: {e}"
|
||||
)
|
||||
continue
|
||||
|
||||
# Rebuild entities
|
||||
for entity_name, chunk_ids in entities_to_rebuild.items():
|
||||
try:
|
||||
await _rebuild_single_entity(
|
||||
knowledge_graph_inst=knowledge_graph_inst,
|
||||
entities_vdb=entities_vdb,
|
||||
entity_name=entity_name,
|
||||
chunk_ids=chunk_ids,
|
||||
chunk_entities=chunk_entities,
|
||||
llm_response_cache=llm_response_cache,
|
||||
global_config=global_config,
|
||||
)
|
||||
logger.debug(
|
||||
f"Rebuilt entity {entity_name} from {len(chunk_ids)} cached extractions"
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to rebuild entity {entity_name}: {e}")
|
||||
|
||||
# Rebuild relationships
|
||||
for (src, tgt), chunk_ids in relationships_to_rebuild.items():
|
||||
try:
|
||||
await _rebuild_single_relationship(
|
||||
knowledge_graph_inst=knowledge_graph_inst,
|
||||
relationships_vdb=relationships_vdb,
|
||||
src=src,
|
||||
tgt=tgt,
|
||||
chunk_ids=chunk_ids,
|
||||
chunk_relationships=chunk_relationships,
|
||||
llm_response_cache=llm_response_cache,
|
||||
global_config=global_config,
|
||||
)
|
||||
logger.debug(
|
||||
f"Rebuilt relationship {src}-{tgt} from {len(chunk_ids)} cached extractions"
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to rebuild relationship {src}-{tgt}: {e}")
|
||||
|
||||
logger.info("Completed rebuilding knowledge from cached extractions")
|
||||
|
||||
|
||||
async def _get_cached_extraction_results(
|
||||
llm_response_cache: BaseKVStorage, chunk_ids: set[str]
|
||||
) -> dict[str, str]:
|
||||
"""Get cached extraction results for specific chunk IDs
|
||||
|
||||
Args:
|
||||
chunk_ids: Set of chunk IDs to get cached results for
|
||||
|
||||
Returns:
|
||||
Dict mapping chunk_id -> extraction_result_text
|
||||
"""
|
||||
cached_results = {}
|
||||
|
||||
# Get all cached data for "default" mode (entity extraction cache)
|
||||
default_cache = await llm_response_cache.get_by_id("default") or {}
|
||||
|
||||
for cache_key, cache_entry in default_cache.items():
|
||||
if (
|
||||
isinstance(cache_entry, dict)
|
||||
and cache_entry.get("cache_type") == "extract"
|
||||
and cache_entry.get("chunk_id") in chunk_ids
|
||||
):
|
||||
chunk_id = cache_entry["chunk_id"]
|
||||
extraction_result = cache_entry["return"]
|
||||
cached_results[chunk_id] = extraction_result
|
||||
|
||||
logger.info(
|
||||
f"Found {len(cached_results)} cached extraction results for {len(chunk_ids)} chunk IDs"
|
||||
)
|
||||
return cached_results
|
||||
|
||||
|
||||
async def _parse_extraction_result(
|
||||
text_chunks: BaseKVStorage, extraction_result: str, chunk_id: str
|
||||
) -> tuple[dict, dict]:
|
||||
"""Parse cached extraction result using the same logic as extract_entities
|
||||
|
||||
Args:
|
||||
extraction_result: The cached LLM extraction result
|
||||
chunk_id: The chunk ID for source tracking
|
||||
|
||||
Returns:
|
||||
Tuple of (entities_dict, relationships_dict)
|
||||
"""
|
||||
|
||||
# Get chunk data for file_path
|
||||
chunk_data = await text_chunks.get_by_id(chunk_id)
|
||||
file_path = (
|
||||
chunk_data.get("file_path", "unknown_source")
|
||||
if chunk_data
|
||||
else "unknown_source"
|
||||
)
|
||||
context_base = dict(
|
||||
tuple_delimiter=PROMPTS["DEFAULT_TUPLE_DELIMITER"],
|
||||
record_delimiter=PROMPTS["DEFAULT_RECORD_DELIMITER"],
|
||||
completion_delimiter=PROMPTS["DEFAULT_COMPLETION_DELIMITER"],
|
||||
)
|
||||
maybe_nodes = defaultdict(list)
|
||||
maybe_edges = defaultdict(list)
|
||||
|
||||
# Parse the extraction result using the same logic as in extract_entities
|
||||
records = split_string_by_multi_markers(
|
||||
extraction_result,
|
||||
[context_base["record_delimiter"], context_base["completion_delimiter"]],
|
||||
)
|
||||
for record in records:
|
||||
record = re.search(r"\((.*)\)", record)
|
||||
if record is None:
|
||||
continue
|
||||
record = record.group(1)
|
||||
record_attributes = split_string_by_multi_markers(
|
||||
record, [context_base["tuple_delimiter"]]
|
||||
)
|
||||
|
||||
# Try to parse as entity
|
||||
entity_data = await _handle_single_entity_extraction(
|
||||
record_attributes, chunk_id, file_path
|
||||
)
|
||||
if entity_data is not None:
|
||||
maybe_nodes[entity_data["entity_name"]].append(entity_data)
|
||||
continue
|
||||
|
||||
# Try to parse as relationship
|
||||
relationship_data = await _handle_single_relationship_extraction(
|
||||
record_attributes, chunk_id, file_path
|
||||
)
|
||||
if relationship_data is not None:
|
||||
maybe_edges[
|
||||
(relationship_data["src_id"], relationship_data["tgt_id"])
|
||||
].append(relationship_data)
|
||||
|
||||
return dict(maybe_nodes), dict(maybe_edges)
|
||||
|
||||
|
||||
async def _rebuild_single_entity(
|
||||
knowledge_graph_inst: BaseGraphStorage,
|
||||
entities_vdb: BaseVectorStorage,
|
||||
entity_name: str,
|
||||
chunk_ids: set[str],
|
||||
chunk_entities: dict,
|
||||
llm_response_cache: BaseKVStorage,
|
||||
global_config: dict[str, str],
|
||||
) -> None:
|
||||
"""Rebuild a single entity from cached extraction results"""
|
||||
|
||||
# Get current entity data
|
||||
current_entity = await knowledge_graph_inst.get_node(entity_name)
|
||||
if not current_entity:
|
||||
return
|
||||
|
||||
# Helper function to update entity in both graph and vector storage
|
||||
async def _update_entity_storage(
|
||||
final_description: str, entity_type: str, file_paths: set[str]
|
||||
):
|
||||
# Update entity in graph storage
|
||||
updated_entity_data = {
|
||||
**current_entity,
|
||||
"description": final_description,
|
||||
"entity_type": entity_type,
|
||||
"source_id": GRAPH_FIELD_SEP.join(chunk_ids),
|
||||
"file_path": GRAPH_FIELD_SEP.join(file_paths)
|
||||
if file_paths
|
||||
else current_entity.get("file_path", "unknown_source"),
|
||||
}
|
||||
await knowledge_graph_inst.upsert_node(entity_name, updated_entity_data)
|
||||
|
||||
# Update entity in vector database
|
||||
entity_vdb_id = compute_mdhash_id(entity_name, prefix="ent-")
|
||||
|
||||
# Delete old vector record first
|
||||
try:
|
||||
await entities_vdb.delete([entity_vdb_id])
|
||||
except Exception as e:
|
||||
logger.debug(
|
||||
f"Could not delete old entity vector record {entity_vdb_id}: {e}"
|
||||
)
|
||||
|
||||
# Insert new vector record
|
||||
entity_content = f"{entity_name}\n{final_description}"
|
||||
await entities_vdb.upsert(
|
||||
{
|
||||
entity_vdb_id: {
|
||||
"content": entity_content,
|
||||
"entity_name": entity_name,
|
||||
"source_id": updated_entity_data["source_id"],
|
||||
"description": final_description,
|
||||
"entity_type": entity_type,
|
||||
"file_path": updated_entity_data["file_path"],
|
||||
}
|
||||
}
|
||||
)
|
||||
|
||||
# Helper function to generate final description with optional LLM summary
|
||||
async def _generate_final_description(combined_description: str) -> str:
|
||||
if len(combined_description) > global_config["summary_to_max_tokens"]:
|
||||
return await _handle_entity_relation_summary(
|
||||
entity_name,
|
||||
combined_description,
|
||||
global_config,
|
||||
llm_response_cache=llm_response_cache,
|
||||
)
|
||||
else:
|
||||
return combined_description
|
||||
|
||||
# Collect all entity data from relevant chunks
|
||||
all_entity_data = []
|
||||
for chunk_id in chunk_ids:
|
||||
if chunk_id in chunk_entities and entity_name in chunk_entities[chunk_id]:
|
||||
all_entity_data.extend(chunk_entities[chunk_id][entity_name])
|
||||
|
||||
if not all_entity_data:
|
||||
logger.warning(
|
||||
f"No cached entity data found for {entity_name}, trying to rebuild from relationships"
|
||||
)
|
||||
|
||||
# Get all edges connected to this entity
|
||||
edges = await knowledge_graph_inst.get_node_edges(entity_name)
|
||||
if not edges:
|
||||
logger.warning(f"No relationships found for entity {entity_name}")
|
||||
return
|
||||
|
||||
# Collect relationship data to extract entity information
|
||||
relationship_descriptions = []
|
||||
file_paths = set()
|
||||
|
||||
# Get edge data for all connected relationships
|
||||
for src_id, tgt_id in edges:
|
||||
edge_data = await knowledge_graph_inst.get_edge(src_id, tgt_id)
|
||||
if edge_data:
|
||||
if edge_data.get("description"):
|
||||
relationship_descriptions.append(edge_data["description"])
|
||||
|
||||
if edge_data.get("file_path"):
|
||||
edge_file_paths = edge_data["file_path"].split(GRAPH_FIELD_SEP)
|
||||
file_paths.update(edge_file_paths)
|
||||
|
||||
# Generate description from relationships or fallback to current
|
||||
if relationship_descriptions:
|
||||
combined_description = GRAPH_FIELD_SEP.join(relationship_descriptions)
|
||||
final_description = await _generate_final_description(combined_description)
|
||||
else:
|
||||
final_description = current_entity.get("description", "")
|
||||
|
||||
entity_type = current_entity.get("entity_type", "UNKNOWN")
|
||||
await _update_entity_storage(final_description, entity_type, file_paths)
|
||||
return
|
||||
|
||||
# Process cached entity data
|
||||
descriptions = []
|
||||
entity_types = []
|
||||
file_paths = set()
|
||||
|
||||
for entity_data in all_entity_data:
|
||||
if entity_data.get("description"):
|
||||
descriptions.append(entity_data["description"])
|
||||
if entity_data.get("entity_type"):
|
||||
entity_types.append(entity_data["entity_type"])
|
||||
if entity_data.get("file_path"):
|
||||
file_paths.add(entity_data["file_path"])
|
||||
|
||||
# Combine all descriptions
|
||||
combined_description = (
|
||||
GRAPH_FIELD_SEP.join(descriptions)
|
||||
if descriptions
|
||||
else current_entity.get("description", "")
|
||||
)
|
||||
|
||||
# Get most common entity type
|
||||
entity_type = (
|
||||
max(set(entity_types), key=entity_types.count)
|
||||
if entity_types
|
||||
else current_entity.get("entity_type", "UNKNOWN")
|
||||
)
|
||||
|
||||
# Generate final description and update storage
|
||||
final_description = await _generate_final_description(combined_description)
|
||||
await _update_entity_storage(final_description, entity_type, file_paths)
|
||||
|
||||
|
||||
async def _rebuild_single_relationship(
|
||||
knowledge_graph_inst: BaseGraphStorage,
|
||||
relationships_vdb: BaseVectorStorage,
|
||||
src: str,
|
||||
tgt: str,
|
||||
chunk_ids: set[str],
|
||||
chunk_relationships: dict,
|
||||
llm_response_cache: BaseKVStorage,
|
||||
global_config: dict[str, str],
|
||||
) -> None:
|
||||
"""Rebuild a single relationship from cached extraction results"""
|
||||
|
||||
# Get current relationship data
|
||||
current_relationship = await knowledge_graph_inst.get_edge(src, tgt)
|
||||
if not current_relationship:
|
||||
return
|
||||
|
||||
# Collect all relationship data from relevant chunks
|
||||
all_relationship_data = []
|
||||
for chunk_id in chunk_ids:
|
||||
if chunk_id in chunk_relationships:
|
||||
# Check both (src, tgt) and (tgt, src) since relationships can be bidirectional
|
||||
for edge_key in [(src, tgt), (tgt, src)]:
|
||||
if edge_key in chunk_relationships[chunk_id]:
|
||||
all_relationship_data.extend(
|
||||
chunk_relationships[chunk_id][edge_key]
|
||||
)
|
||||
|
||||
if not all_relationship_data:
|
||||
logger.warning(f"No cached relationship data found for {src}-{tgt}")
|
||||
return
|
||||
|
||||
# Merge descriptions and keywords
|
||||
descriptions = []
|
||||
keywords = []
|
||||
weights = []
|
||||
file_paths = set()
|
||||
|
||||
for rel_data in all_relationship_data:
|
||||
if rel_data.get("description"):
|
||||
descriptions.append(rel_data["description"])
|
||||
if rel_data.get("keywords"):
|
||||
keywords.append(rel_data["keywords"])
|
||||
if rel_data.get("weight"):
|
||||
weights.append(rel_data["weight"])
|
||||
if rel_data.get("file_path"):
|
||||
file_paths.add(rel_data["file_path"])
|
||||
|
||||
# Combine descriptions and keywords
|
||||
combined_description = (
|
||||
GRAPH_FIELD_SEP.join(descriptions)
|
||||
if descriptions
|
||||
else current_relationship.get("description", "")
|
||||
)
|
||||
combined_keywords = (
|
||||
", ".join(set(keywords))
|
||||
if keywords
|
||||
else current_relationship.get("keywords", "")
|
||||
)
|
||||
# weight = (
|
||||
# sum(weights) / len(weights)
|
||||
# if weights
|
||||
# else current_relationship.get("weight", 1.0)
|
||||
# )
|
||||
weight = sum(weights) if weights else current_relationship.get("weight", 1.0)
|
||||
|
||||
# Use summary if description is too long
|
||||
if len(combined_description) > global_config["summary_to_max_tokens"]:
|
||||
final_description = await _handle_entity_relation_summary(
|
||||
f"{src}-{tgt}",
|
||||
combined_description,
|
||||
global_config,
|
||||
llm_response_cache=llm_response_cache,
|
||||
)
|
||||
else:
|
||||
final_description = combined_description
|
||||
|
||||
# Update relationship in graph storage
|
||||
updated_relationship_data = {
|
||||
**current_relationship,
|
||||
"description": final_description,
|
||||
"keywords": combined_keywords,
|
||||
"weight": weight,
|
||||
"source_id": GRAPH_FIELD_SEP.join(chunk_ids),
|
||||
"file_path": GRAPH_FIELD_SEP.join(file_paths)
|
||||
if file_paths
|
||||
else current_relationship.get("file_path", "unknown_source"),
|
||||
}
|
||||
await knowledge_graph_inst.upsert_edge(src, tgt, updated_relationship_data)
|
||||
|
||||
# Update relationship in vector database
|
||||
rel_vdb_id = compute_mdhash_id(src + tgt, prefix="rel-")
|
||||
rel_vdb_id_reverse = compute_mdhash_id(tgt + src, prefix="rel-")
|
||||
|
||||
# Delete old vector records first (both directions to be safe)
|
||||
try:
|
||||
await relationships_vdb.delete([rel_vdb_id, rel_vdb_id_reverse])
|
||||
except Exception as e:
|
||||
logger.debug(
|
||||
f"Could not delete old relationship vector records {rel_vdb_id}, {rel_vdb_id_reverse}: {e}"
|
||||
)
|
||||
|
||||
# Insert new vector record
|
||||
rel_content = f"{combined_keywords}\t{src}\n{tgt}\n{final_description}"
|
||||
await relationships_vdb.upsert(
|
||||
{
|
||||
rel_vdb_id: {
|
||||
"src_id": src,
|
||||
"tgt_id": tgt,
|
||||
"source_id": updated_relationship_data["source_id"],
|
||||
"content": rel_content,
|
||||
"keywords": combined_keywords,
|
||||
"description": final_description,
|
||||
"weight": weight,
|
||||
"file_path": updated_relationship_data["file_path"],
|
||||
}
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
async def _merge_nodes_then_upsert(
|
||||
entity_name: str,
|
||||
nodes_data: list[dict],
|
||||
|
|
@ -757,6 +1217,7 @@ async def extract_entities(
|
|||
use_llm_func,
|
||||
llm_response_cache=llm_response_cache,
|
||||
cache_type="extract",
|
||||
chunk_id=chunk_key,
|
||||
)
|
||||
history = pack_user_ass_to_openai_messages(hint_prompt, final_result)
|
||||
|
||||
|
|
@ -773,6 +1234,7 @@ async def extract_entities(
|
|||
llm_response_cache=llm_response_cache,
|
||||
history_messages=history,
|
||||
cache_type="extract",
|
||||
chunk_id=chunk_key,
|
||||
)
|
||||
|
||||
history += pack_user_ass_to_openai_messages(continue_prompt, glean_result)
|
||||
|
|
|
|||
|
|
@ -1,686 +0,0 @@
|
|||
"""
|
||||
Complete MinerU parsing + multimodal content insertion Pipeline
|
||||
|
||||
This script integrates:
|
||||
1. MinerU document parsing
|
||||
2. Pure text content LightRAG insertion
|
||||
3. Specialized processing for multimodal content (using different processors)
|
||||
"""
|
||||
|
||||
import os
|
||||
import asyncio
|
||||
import logging
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Any, Tuple, Optional, Callable
|
||||
import sys
|
||||
|
||||
# Add project root directory to Python path
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent))
|
||||
|
||||
from lightrag import LightRAG, QueryParam
|
||||
from lightrag.utils import EmbeddingFunc, setup_logger
|
||||
|
||||
# Import parser and multimodal processors
|
||||
from lightrag.mineru_parser import MineruParser
|
||||
|
||||
# Import specialized processors
|
||||
from lightrag.modalprocessors import (
|
||||
ImageModalProcessor,
|
||||
TableModalProcessor,
|
||||
EquationModalProcessor,
|
||||
GenericModalProcessor,
|
||||
)
|
||||
|
||||
|
||||
class RAGAnything:
|
||||
"""Multimodal Document Processing Pipeline - Complete document parsing and insertion pipeline"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
lightrag: Optional[LightRAG] = None,
|
||||
llm_model_func: Optional[Callable] = None,
|
||||
vision_model_func: Optional[Callable] = None,
|
||||
embedding_func: Optional[Callable] = None,
|
||||
working_dir: str = "./rag_storage",
|
||||
embedding_dim: int = 3072,
|
||||
max_token_size: int = 8192,
|
||||
):
|
||||
"""
|
||||
Initialize Multimodal Document Processing Pipeline
|
||||
|
||||
Args:
|
||||
lightrag: Optional pre-initialized LightRAG instance
|
||||
llm_model_func: LLM model function for text analysis
|
||||
vision_model_func: Vision model function for image analysis
|
||||
embedding_func: Embedding function for text vectorization
|
||||
working_dir: Working directory for storage (used when creating new RAG)
|
||||
embedding_dim: Embedding dimension (used when creating new RAG)
|
||||
max_token_size: Maximum token size for embeddings (used when creating new RAG)
|
||||
"""
|
||||
self.working_dir = working_dir
|
||||
self.llm_model_func = llm_model_func
|
||||
self.vision_model_func = vision_model_func
|
||||
self.embedding_func = embedding_func
|
||||
self.embedding_dim = embedding_dim
|
||||
self.max_token_size = max_token_size
|
||||
|
||||
# Set up logging
|
||||
setup_logger("RAGAnything")
|
||||
self.logger = logging.getLogger("RAGAnything")
|
||||
|
||||
# Create working directory if needed
|
||||
if not os.path.exists(working_dir):
|
||||
os.makedirs(working_dir)
|
||||
|
||||
# Use provided LightRAG or mark for later initialization
|
||||
self.lightrag = lightrag
|
||||
self.modal_processors = {}
|
||||
|
||||
# If LightRAG is provided, initialize processors immediately
|
||||
if self.lightrag is not None:
|
||||
self._initialize_processors()
|
||||
|
||||
def _initialize_processors(self):
|
||||
"""Initialize multimodal processors with appropriate model functions"""
|
||||
if self.lightrag is None:
|
||||
raise ValueError(
|
||||
"LightRAG instance must be initialized before creating processors"
|
||||
)
|
||||
|
||||
# Create different multimodal processors
|
||||
self.modal_processors = {
|
||||
"image": ImageModalProcessor(
|
||||
lightrag=self.lightrag,
|
||||
modal_caption_func=self.vision_model_func or self.llm_model_func,
|
||||
),
|
||||
"table": TableModalProcessor(
|
||||
lightrag=self.lightrag, modal_caption_func=self.llm_model_func
|
||||
),
|
||||
"equation": EquationModalProcessor(
|
||||
lightrag=self.lightrag, modal_caption_func=self.llm_model_func
|
||||
),
|
||||
"generic": GenericModalProcessor(
|
||||
lightrag=self.lightrag, modal_caption_func=self.llm_model_func
|
||||
),
|
||||
}
|
||||
|
||||
self.logger.info("Multimodal processors initialized")
|
||||
self.logger.info(f"Available processors: {list(self.modal_processors.keys())}")
|
||||
|
||||
async def _ensure_lightrag_initialized(self):
|
||||
"""Ensure LightRAG instance is initialized, create if necessary"""
|
||||
if self.lightrag is not None:
|
||||
return
|
||||
|
||||
# Validate required functions
|
||||
if self.llm_model_func is None:
|
||||
raise ValueError(
|
||||
"llm_model_func must be provided when LightRAG is not pre-initialized"
|
||||
)
|
||||
if self.embedding_func is None:
|
||||
raise ValueError(
|
||||
"embedding_func must be provided when LightRAG is not pre-initialized"
|
||||
)
|
||||
|
||||
from lightrag.kg.shared_storage import initialize_pipeline_status
|
||||
|
||||
# Create LightRAG instance with provided functions
|
||||
self.lightrag = LightRAG(
|
||||
working_dir=self.working_dir,
|
||||
llm_model_func=self.llm_model_func,
|
||||
embedding_func=EmbeddingFunc(
|
||||
embedding_dim=self.embedding_dim,
|
||||
max_token_size=self.max_token_size,
|
||||
func=self.embedding_func,
|
||||
),
|
||||
)
|
||||
|
||||
await self.lightrag.initialize_storages()
|
||||
await initialize_pipeline_status()
|
||||
|
||||
# Initialize processors after LightRAG is ready
|
||||
self._initialize_processors()
|
||||
|
||||
self.logger.info("LightRAG and multimodal processors initialized")
|
||||
|
||||
def parse_document(
|
||||
self,
|
||||
file_path: str,
|
||||
output_dir: str = "./output",
|
||||
parse_method: str = "auto",
|
||||
display_stats: bool = True,
|
||||
) -> Tuple[List[Dict[str, Any]], str]:
|
||||
"""
|
||||
Parse document using MinerU
|
||||
|
||||
Args:
|
||||
file_path: Path to the file to parse
|
||||
output_dir: Output directory
|
||||
parse_method: Parse method ("auto", "ocr", "txt")
|
||||
display_stats: Whether to display content statistics
|
||||
|
||||
Returns:
|
||||
(content_list, md_content): Content list and markdown text
|
||||
"""
|
||||
self.logger.info(f"Starting document parsing: {file_path}")
|
||||
|
||||
file_path = Path(file_path)
|
||||
if not file_path.exists():
|
||||
raise FileNotFoundError(f"File not found: {file_path}")
|
||||
|
||||
# Choose appropriate parsing method based on file extension
|
||||
ext = file_path.suffix.lower()
|
||||
|
||||
try:
|
||||
if ext in [".pdf"]:
|
||||
self.logger.info(
|
||||
f"Detected PDF file, using PDF parser (OCR={parse_method == 'ocr'})..."
|
||||
)
|
||||
content_list, md_content = MineruParser.parse_pdf(
|
||||
file_path, output_dir, use_ocr=(parse_method == "ocr")
|
||||
)
|
||||
elif ext in [".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif"]:
|
||||
self.logger.info("Detected image file, using image parser...")
|
||||
content_list, md_content = MineruParser.parse_image(
|
||||
file_path, output_dir
|
||||
)
|
||||
elif ext in [".doc", ".docx", ".ppt", ".pptx"]:
|
||||
self.logger.info("Detected Office document, using Office parser...")
|
||||
content_list, md_content = MineruParser.parse_office_doc(
|
||||
file_path, output_dir
|
||||
)
|
||||
else:
|
||||
# For other or unknown formats, use generic parser
|
||||
self.logger.info(
|
||||
f"Using generic parser for {ext} file (method={parse_method})..."
|
||||
)
|
||||
content_list, md_content = MineruParser.parse_document(
|
||||
file_path, parse_method=parse_method, output_dir=output_dir
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error during parsing with specific parser: {str(e)}")
|
||||
self.logger.warning("Falling back to generic parser...")
|
||||
# If specific parser fails, fall back to generic parser
|
||||
content_list, md_content = MineruParser.parse_document(
|
||||
file_path, parse_method=parse_method, output_dir=output_dir
|
||||
)
|
||||
|
||||
self.logger.info(
|
||||
f"Parsing complete! Extracted {len(content_list)} content blocks"
|
||||
)
|
||||
self.logger.info(f"Markdown text length: {len(md_content)} characters")
|
||||
|
||||
# Display content statistics if requested
|
||||
if display_stats:
|
||||
self.logger.info("\nContent Information:")
|
||||
self.logger.info(f"* Total blocks in content_list: {len(content_list)}")
|
||||
self.logger.info(f"* Markdown content length: {len(md_content)} characters")
|
||||
|
||||
# Count elements by type
|
||||
block_types: Dict[str, int] = {}
|
||||
for block in content_list:
|
||||
if isinstance(block, dict):
|
||||
block_type = block.get("type", "unknown")
|
||||
if isinstance(block_type, str):
|
||||
block_types[block_type] = block_types.get(block_type, 0) + 1
|
||||
|
||||
self.logger.info("* Content block types:")
|
||||
for block_type, count in block_types.items():
|
||||
self.logger.info(f" - {block_type}: {count}")
|
||||
|
||||
return content_list, md_content
|
||||
|
||||
def _separate_content(
|
||||
self, content_list: List[Dict[str, Any]]
|
||||
) -> Tuple[str, List[Dict[str, Any]]]:
|
||||
"""
|
||||
Separate text content and multimodal content
|
||||
|
||||
Args:
|
||||
content_list: Content list from MinerU parsing
|
||||
|
||||
Returns:
|
||||
(text_content, multimodal_items): Pure text content and multimodal items list
|
||||
"""
|
||||
text_parts = []
|
||||
multimodal_items = []
|
||||
|
||||
for item in content_list:
|
||||
content_type = item.get("type", "text")
|
||||
|
||||
if content_type == "text":
|
||||
# Text content
|
||||
text = item.get("text", "")
|
||||
if text.strip():
|
||||
text_parts.append(text)
|
||||
else:
|
||||
# Multimodal content (image, table, equation, etc.)
|
||||
multimodal_items.append(item)
|
||||
|
||||
# Merge all text content
|
||||
text_content = "\n\n".join(text_parts)
|
||||
|
||||
self.logger.info("Content separation complete:")
|
||||
self.logger.info(f" - Text content length: {len(text_content)} characters")
|
||||
self.logger.info(f" - Multimodal items count: {len(multimodal_items)}")
|
||||
|
||||
# Count multimodal types
|
||||
modal_types = {}
|
||||
for item in multimodal_items:
|
||||
modal_type = item.get("type", "unknown")
|
||||
modal_types[modal_type] = modal_types.get(modal_type, 0) + 1
|
||||
|
||||
if modal_types:
|
||||
self.logger.info(f" - Multimodal type distribution: {modal_types}")
|
||||
|
||||
return text_content, multimodal_items
|
||||
|
||||
async def _insert_text_content(
|
||||
self,
|
||||
input: str | list[str],
|
||||
split_by_character: str | None = None,
|
||||
split_by_character_only: bool = False,
|
||||
ids: str | list[str] | None = None,
|
||||
file_paths: str | list[str] | None = None,
|
||||
):
|
||||
"""
|
||||
Insert pure text content into LightRAG
|
||||
|
||||
Args:
|
||||
input: Single document string or list of document strings
|
||||
split_by_character: if split_by_character is not None, split the string by character, if chunk longer than
|
||||
chunk_token_size, it will be split again by token size.
|
||||
split_by_character_only: if split_by_character_only is True, split the string by character only, when
|
||||
split_by_character is None, this parameter is ignored.
|
||||
ids: single string of the document ID or list of unique document IDs, if not provided, MD5 hash IDs will be generated
|
||||
file_paths: single string of the file path or list of file paths, used for citation
|
||||
"""
|
||||
self.logger.info("Starting text content insertion into LightRAG...")
|
||||
|
||||
# Use LightRAG's insert method with all parameters
|
||||
await self.lightrag.ainsert(
|
||||
input=input,
|
||||
file_paths=file_paths,
|
||||
split_by_character=split_by_character,
|
||||
split_by_character_only=split_by_character_only,
|
||||
ids=ids,
|
||||
)
|
||||
|
||||
self.logger.info("Text content insertion complete")
|
||||
|
||||
async def _process_multimodal_content(
|
||||
self, multimodal_items: List[Dict[str, Any]], file_path: str
|
||||
):
|
||||
"""
|
||||
Process multimodal content (using specialized processors)
|
||||
|
||||
Args:
|
||||
multimodal_items: List of multimodal items
|
||||
file_path: File path (for reference)
|
||||
"""
|
||||
if not multimodal_items:
|
||||
self.logger.debug("No multimodal content to process")
|
||||
return
|
||||
|
||||
self.logger.info("Starting multimodal content processing...")
|
||||
|
||||
file_name = os.path.basename(file_path)
|
||||
|
||||
for i, item in enumerate(multimodal_items):
|
||||
try:
|
||||
content_type = item.get("type", "unknown")
|
||||
self.logger.info(
|
||||
f"Processing item {i+1}/{len(multimodal_items)}: {content_type} content"
|
||||
)
|
||||
|
||||
# Select appropriate processor
|
||||
processor = self._get_processor_for_type(content_type)
|
||||
|
||||
if processor:
|
||||
(
|
||||
enhanced_caption,
|
||||
entity_info,
|
||||
) = await processor.process_multimodal_content(
|
||||
modal_content=item,
|
||||
content_type=content_type,
|
||||
file_path=file_name,
|
||||
)
|
||||
self.logger.info(
|
||||
f"{content_type} processing complete: {entity_info.get('entity_name', 'Unknown')}"
|
||||
)
|
||||
else:
|
||||
self.logger.warning(
|
||||
f"No suitable processor found for {content_type} type content"
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error processing multimodal content: {str(e)}")
|
||||
self.logger.debug("Exception details:", exc_info=True)
|
||||
continue
|
||||
|
||||
self.logger.info("Multimodal content processing complete")
|
||||
|
||||
def _get_processor_for_type(self, content_type: str):
|
||||
"""
|
||||
Get appropriate processor based on content type
|
||||
|
||||
Args:
|
||||
content_type: Content type
|
||||
|
||||
Returns:
|
||||
Corresponding processor instance
|
||||
"""
|
||||
# Direct mapping to corresponding processor
|
||||
if content_type == "image":
|
||||
return self.modal_processors.get("image")
|
||||
elif content_type == "table":
|
||||
return self.modal_processors.get("table")
|
||||
elif content_type == "equation":
|
||||
return self.modal_processors.get("equation")
|
||||
else:
|
||||
# For other types, use generic processor
|
||||
return self.modal_processors.get("generic")
|
||||
|
||||
async def process_document_complete(
|
||||
self,
|
||||
file_path: str,
|
||||
output_dir: str = "./output",
|
||||
parse_method: str = "auto",
|
||||
display_stats: bool = True,
|
||||
split_by_character: str | None = None,
|
||||
split_by_character_only: bool = False,
|
||||
doc_id: str | None = None,
|
||||
):
|
||||
"""
|
||||
Complete document processing workflow
|
||||
|
||||
Args:
|
||||
file_path: Path to the file to process
|
||||
output_dir: MinerU output directory
|
||||
parse_method: Parse method
|
||||
display_stats: Whether to display content statistics
|
||||
split_by_character: Optional character to split the text by
|
||||
split_by_character_only: If True, split only by the specified character
|
||||
doc_id: Optional document ID, if not provided MD5 hash will be generated
|
||||
"""
|
||||
# Ensure LightRAG is initialized
|
||||
await self._ensure_lightrag_initialized()
|
||||
|
||||
self.logger.info(f"Starting complete document processing: {file_path}")
|
||||
|
||||
# Step 1: Parse document using MinerU
|
||||
content_list, md_content = self.parse_document(
|
||||
file_path, output_dir, parse_method, display_stats
|
||||
)
|
||||
|
||||
# Step 2: Separate text and multimodal content
|
||||
text_content, multimodal_items = self._separate_content(content_list)
|
||||
|
||||
# Step 3: Insert pure text content with all parameters
|
||||
if text_content.strip():
|
||||
file_name = os.path.basename(file_path)
|
||||
await self._insert_text_content(
|
||||
text_content,
|
||||
file_paths=file_name,
|
||||
split_by_character=split_by_character,
|
||||
split_by_character_only=split_by_character_only,
|
||||
ids=doc_id,
|
||||
)
|
||||
|
||||
# Step 4: Process multimodal content (using specialized processors)
|
||||
if multimodal_items:
|
||||
await self._process_multimodal_content(multimodal_items, file_path)
|
||||
|
||||
self.logger.info(f"Document {file_path} processing complete!")
|
||||
|
||||
async def process_folder_complete(
|
||||
self,
|
||||
folder_path: str,
|
||||
output_dir: str = "./output",
|
||||
parse_method: str = "auto",
|
||||
display_stats: bool = False,
|
||||
split_by_character: str | None = None,
|
||||
split_by_character_only: bool = False,
|
||||
file_extensions: Optional[List[str]] = None,
|
||||
recursive: bool = True,
|
||||
max_workers: int = 1,
|
||||
):
|
||||
"""
|
||||
Process all files in a folder in batch
|
||||
|
||||
Args:
|
||||
folder_path: Path to the folder to process
|
||||
output_dir: MinerU output directory
|
||||
parse_method: Parse method
|
||||
display_stats: Whether to display content statistics for each file (recommended False for batch processing)
|
||||
split_by_character: Optional character to split text by
|
||||
split_by_character_only: If True, split only by the specified character
|
||||
file_extensions: List of file extensions to process, e.g. [".pdf", ".docx"]. If None, process all supported formats
|
||||
recursive: Whether to recursively process subfolders
|
||||
max_workers: Maximum number of concurrent workers
|
||||
"""
|
||||
# Ensure LightRAG is initialized
|
||||
await self._ensure_lightrag_initialized()
|
||||
|
||||
folder_path = Path(folder_path)
|
||||
if not folder_path.exists() or not folder_path.is_dir():
|
||||
raise ValueError(
|
||||
f"Folder does not exist or is not a valid directory: {folder_path}"
|
||||
)
|
||||
|
||||
# Supported file formats
|
||||
supported_extensions = {
|
||||
".pdf",
|
||||
".jpg",
|
||||
".jpeg",
|
||||
".png",
|
||||
".bmp",
|
||||
".tiff",
|
||||
".tif",
|
||||
".doc",
|
||||
".docx",
|
||||
".ppt",
|
||||
".pptx",
|
||||
".txt",
|
||||
".md",
|
||||
}
|
||||
|
||||
# Use specified extensions or all supported formats
|
||||
if file_extensions:
|
||||
target_extensions = set(ext.lower() for ext in file_extensions)
|
||||
# Validate if all are supported formats
|
||||
unsupported = target_extensions - supported_extensions
|
||||
if unsupported:
|
||||
self.logger.warning(
|
||||
f"The following file formats may not be fully supported: {unsupported}"
|
||||
)
|
||||
else:
|
||||
target_extensions = supported_extensions
|
||||
|
||||
# Collect all files to process
|
||||
files_to_process = []
|
||||
|
||||
if recursive:
|
||||
# Recursively traverse all subfolders
|
||||
for file_path in folder_path.rglob("*"):
|
||||
if (
|
||||
file_path.is_file()
|
||||
and file_path.suffix.lower() in target_extensions
|
||||
):
|
||||
files_to_process.append(file_path)
|
||||
else:
|
||||
# Process only current folder
|
||||
for file_path in folder_path.glob("*"):
|
||||
if (
|
||||
file_path.is_file()
|
||||
and file_path.suffix.lower() in target_extensions
|
||||
):
|
||||
files_to_process.append(file_path)
|
||||
|
||||
if not files_to_process:
|
||||
self.logger.info(f"No files to process found in {folder_path}")
|
||||
return
|
||||
|
||||
self.logger.info(f"Found {len(files_to_process)} files to process")
|
||||
self.logger.info("File type distribution:")
|
||||
|
||||
# Count file types
|
||||
file_type_count = {}
|
||||
for file_path in files_to_process:
|
||||
ext = file_path.suffix.lower()
|
||||
file_type_count[ext] = file_type_count.get(ext, 0) + 1
|
||||
|
||||
for ext, count in sorted(file_type_count.items()):
|
||||
self.logger.info(f" {ext}: {count} files")
|
||||
|
||||
# Create progress tracking
|
||||
processed_count = 0
|
||||
failed_files = []
|
||||
|
||||
# Use semaphore to control concurrency
|
||||
semaphore = asyncio.Semaphore(max_workers)
|
||||
|
||||
async def process_single_file(file_path: Path, index: int) -> None:
|
||||
"""Process a single file"""
|
||||
async with semaphore:
|
||||
nonlocal processed_count
|
||||
try:
|
||||
self.logger.info(
|
||||
f"[{index}/{len(files_to_process)}] Processing: {file_path}"
|
||||
)
|
||||
|
||||
# Create separate output directory for each file
|
||||
file_output_dir = Path(output_dir) / file_path.stem
|
||||
file_output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Process file
|
||||
await self.process_document_complete(
|
||||
file_path=str(file_path),
|
||||
output_dir=str(file_output_dir),
|
||||
parse_method=parse_method,
|
||||
display_stats=display_stats,
|
||||
split_by_character=split_by_character,
|
||||
split_by_character_only=split_by_character_only,
|
||||
)
|
||||
|
||||
processed_count += 1
|
||||
self.logger.info(
|
||||
f"[{index}/{len(files_to_process)}] Successfully processed: {file_path}"
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(
|
||||
f"[{index}/{len(files_to_process)}] Failed to process: {file_path}"
|
||||
)
|
||||
self.logger.error(f"Error: {str(e)}")
|
||||
failed_files.append((file_path, str(e)))
|
||||
|
||||
# Create all processing tasks
|
||||
tasks = []
|
||||
for index, file_path in enumerate(files_to_process, 1):
|
||||
task = process_single_file(file_path, index)
|
||||
tasks.append(task)
|
||||
|
||||
# Wait for all tasks to complete
|
||||
await asyncio.gather(*tasks, return_exceptions=True)
|
||||
|
||||
# Output processing statistics
|
||||
self.logger.info("\n===== Batch Processing Complete =====")
|
||||
self.logger.info(f"Total files: {len(files_to_process)}")
|
||||
self.logger.info(f"Successfully processed: {processed_count}")
|
||||
self.logger.info(f"Failed: {len(failed_files)}")
|
||||
|
||||
if failed_files:
|
||||
self.logger.info("\nFailed files:")
|
||||
for file_path, error in failed_files:
|
||||
self.logger.info(f" - {file_path}: {error}")
|
||||
|
||||
return {
|
||||
"total": len(files_to_process),
|
||||
"success": processed_count,
|
||||
"failed": len(failed_files),
|
||||
"failed_files": failed_files,
|
||||
}
|
||||
|
||||
async def query_with_multimodal(self, query: str, mode: str = "hybrid") -> str:
|
||||
"""
|
||||
Query with multimodal content support
|
||||
|
||||
Args:
|
||||
query: Query content
|
||||
mode: Query mode
|
||||
|
||||
Returns:
|
||||
Query result
|
||||
"""
|
||||
if self.lightrag is None:
|
||||
raise ValueError(
|
||||
"No LightRAG instance available. "
|
||||
"Please either:\n"
|
||||
"1. Provide a pre-initialized LightRAG instance when creating RAGAnything, or\n"
|
||||
"2. Process documents first using process_document_complete() or process_folder_complete() "
|
||||
"to create and populate the LightRAG instance."
|
||||
)
|
||||
|
||||
result = await self.lightrag.aquery(query, param=QueryParam(mode=mode))
|
||||
|
||||
return result
|
||||
|
||||
def get_processor_info(self) -> Dict[str, Any]:
|
||||
"""Get processor information"""
|
||||
if not self.modal_processors:
|
||||
return {"status": "Not initialized"}
|
||||
|
||||
info = {
|
||||
"status": "Initialized",
|
||||
"processors": {},
|
||||
"models": {
|
||||
"llm_model": "External function"
|
||||
if self.llm_model_func
|
||||
else "Not provided",
|
||||
"vision_model": "External function"
|
||||
if self.vision_model_func
|
||||
else "Not provided",
|
||||
"embedding_model": "External function"
|
||||
if self.embedding_func
|
||||
else "Not provided",
|
||||
},
|
||||
}
|
||||
|
||||
for proc_type, processor in self.modal_processors.items():
|
||||
info["processors"][proc_type] = {
|
||||
"class": processor.__class__.__name__,
|
||||
"supports": self._get_processor_supports(proc_type),
|
||||
}
|
||||
|
||||
return info
|
||||
|
||||
def _get_processor_supports(self, proc_type: str) -> List[str]:
|
||||
"""Get processor supported features"""
|
||||
supports_map = {
|
||||
"image": [
|
||||
"Image content analysis",
|
||||
"Visual understanding",
|
||||
"Image description generation",
|
||||
"Image entity extraction",
|
||||
],
|
||||
"table": [
|
||||
"Table structure analysis",
|
||||
"Data statistics",
|
||||
"Trend identification",
|
||||
"Table entity extraction",
|
||||
],
|
||||
"equation": [
|
||||
"Mathematical formula parsing",
|
||||
"Variable identification",
|
||||
"Formula meaning explanation",
|
||||
"Formula entity extraction",
|
||||
],
|
||||
"generic": [
|
||||
"General content analysis",
|
||||
"Structured processing",
|
||||
"Entity extraction",
|
||||
],
|
||||
}
|
||||
return supports_map.get(proc_type, ["Basic processing"])
|
||||
|
|
@ -990,6 +990,7 @@ class CacheData:
|
|||
max_val: float | None = None
|
||||
mode: str = "default"
|
||||
cache_type: str = "query"
|
||||
chunk_id: str | None = None
|
||||
|
||||
|
||||
async def save_to_cache(hashing_kv, cache_data: CacheData):
|
||||
|
|
@ -1030,6 +1031,7 @@ async def save_to_cache(hashing_kv, cache_data: CacheData):
|
|||
mode_cache[cache_data.args_hash] = {
|
||||
"return": cache_data.content,
|
||||
"cache_type": cache_data.cache_type,
|
||||
"chunk_id": cache_data.chunk_id if cache_data.chunk_id is not None else None,
|
||||
"embedding": cache_data.quantized.tobytes().hex()
|
||||
if cache_data.quantized is not None
|
||||
else None,
|
||||
|
|
@ -1534,6 +1536,7 @@ async def use_llm_func_with_cache(
|
|||
max_tokens: int = None,
|
||||
history_messages: list[dict[str, str]] = None,
|
||||
cache_type: str = "extract",
|
||||
chunk_id: str | None = None,
|
||||
) -> str:
|
||||
"""Call LLM function with cache support
|
||||
|
||||
|
|
@ -1547,6 +1550,7 @@ async def use_llm_func_with_cache(
|
|||
max_tokens: Maximum tokens for generation
|
||||
history_messages: History messages list
|
||||
cache_type: Type of cache
|
||||
chunk_id: Chunk identifier to store in cache
|
||||
|
||||
Returns:
|
||||
LLM response text
|
||||
|
|
@ -1589,6 +1593,7 @@ async def use_llm_func_with_cache(
|
|||
content=res,
|
||||
prompt=_prompt,
|
||||
cache_type=cache_type,
|
||||
chunk_id=chunk_id,
|
||||
),
|
||||
)
|
||||
|
||||
|
|
|
|||
|
|
@ -4,6 +4,7 @@ import time
|
|||
import asyncio
|
||||
from typing import Any, cast
|
||||
|
||||
from .base import DeletionResult
|
||||
from .kg.shared_storage import get_graph_db_lock
|
||||
from .prompt import GRAPH_FIELD_SEP
|
||||
from .utils import compute_mdhash_id, logger
|
||||
|
|
@ -12,7 +13,7 @@ from .base import StorageNameSpace
|
|||
|
||||
async def adelete_by_entity(
|
||||
chunk_entity_relation_graph, entities_vdb, relationships_vdb, entity_name: str
|
||||
) -> None:
|
||||
) -> DeletionResult:
|
||||
"""Asynchronously delete an entity and all its relationships.
|
||||
|
||||
Args:
|
||||
|
|
@ -25,18 +26,43 @@ async def adelete_by_entity(
|
|||
# Use graph database lock to ensure atomic graph and vector db operations
|
||||
async with graph_db_lock:
|
||||
try:
|
||||
# Check if the entity exists
|
||||
if not await chunk_entity_relation_graph.has_node(entity_name):
|
||||
logger.warning(f"Entity '{entity_name}' not found.")
|
||||
return DeletionResult(
|
||||
status="not_found",
|
||||
doc_id=entity_name,
|
||||
message=f"Entity '{entity_name}' not found.",
|
||||
status_code=404,
|
||||
)
|
||||
# Retrieve related relationships before deleting the node
|
||||
edges = await chunk_entity_relation_graph.get_node_edges(entity_name)
|
||||
related_relations_count = len(edges) if edges else 0
|
||||
|
||||
await entities_vdb.delete_entity(entity_name)
|
||||
await relationships_vdb.delete_entity_relation(entity_name)
|
||||
await chunk_entity_relation_graph.delete_node(entity_name)
|
||||
|
||||
logger.info(
|
||||
f"Entity '{entity_name}' and its relationships have been deleted."
|
||||
)
|
||||
message = f"Entity '{entity_name}' and its {related_relations_count} relationships have been deleted."
|
||||
logger.info(message)
|
||||
await _delete_by_entity_done(
|
||||
entities_vdb, relationships_vdb, chunk_entity_relation_graph
|
||||
)
|
||||
return DeletionResult(
|
||||
status="success",
|
||||
doc_id=entity_name,
|
||||
message=message,
|
||||
status_code=200,
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Error while deleting entity '{entity_name}': {e}")
|
||||
error_message = f"Error while deleting entity '{entity_name}': {e}"
|
||||
logger.error(error_message)
|
||||
return DeletionResult(
|
||||
status="fail",
|
||||
doc_id=entity_name,
|
||||
message=error_message,
|
||||
status_code=500,
|
||||
)
|
||||
|
||||
|
||||
async def _delete_by_entity_done(
|
||||
|
|
@ -60,7 +86,7 @@ async def adelete_by_relation(
|
|||
relationships_vdb,
|
||||
source_entity: str,
|
||||
target_entity: str,
|
||||
) -> None:
|
||||
) -> DeletionResult:
|
||||
"""Asynchronously delete a relation between two entities.
|
||||
|
||||
Args:
|
||||
|
|
@ -69,6 +95,7 @@ async def adelete_by_relation(
|
|||
source_entity: Name of the source entity
|
||||
target_entity: Name of the target entity
|
||||
"""
|
||||
relation_str = f"{source_entity} -> {target_entity}"
|
||||
graph_db_lock = get_graph_db_lock(enable_logging=False)
|
||||
# Use graph database lock to ensure atomic graph and vector db operations
|
||||
async with graph_db_lock:
|
||||
|
|
@ -78,29 +105,45 @@ async def adelete_by_relation(
|
|||
source_entity, target_entity
|
||||
)
|
||||
if not edge_exists:
|
||||
logger.warning(
|
||||
f"Relation from '{source_entity}' to '{target_entity}' does not exist"
|
||||
message = f"Relation from '{source_entity}' to '{target_entity}' does not exist"
|
||||
logger.warning(message)
|
||||
return DeletionResult(
|
||||
status="not_found",
|
||||
doc_id=relation_str,
|
||||
message=message,
|
||||
status_code=404,
|
||||
)
|
||||
return
|
||||
|
||||
# Delete relation from vector database
|
||||
relation_id = compute_mdhash_id(
|
||||
source_entity + target_entity, prefix="rel-"
|
||||
)
|
||||
await relationships_vdb.delete([relation_id])
|
||||
rel_ids_to_delete = [
|
||||
compute_mdhash_id(source_entity + target_entity, prefix="rel-"),
|
||||
compute_mdhash_id(target_entity + source_entity, prefix="rel-"),
|
||||
]
|
||||
|
||||
await relationships_vdb.delete(rel_ids_to_delete)
|
||||
|
||||
# Delete relation from knowledge graph
|
||||
await chunk_entity_relation_graph.remove_edges(
|
||||
[(source_entity, target_entity)]
|
||||
)
|
||||
|
||||
logger.info(
|
||||
f"Successfully deleted relation from '{source_entity}' to '{target_entity}'"
|
||||
)
|
||||
message = f"Successfully deleted relation from '{source_entity}' to '{target_entity}'"
|
||||
logger.info(message)
|
||||
await _delete_relation_done(relationships_vdb, chunk_entity_relation_graph)
|
||||
return DeletionResult(
|
||||
status="success",
|
||||
doc_id=relation_str,
|
||||
message=message,
|
||||
status_code=200,
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(
|
||||
f"Error while deleting relation from '{source_entity}' to '{target_entity}': {e}"
|
||||
error_message = f"Error while deleting relation from '{source_entity}' to '{target_entity}': {e}"
|
||||
logger.error(error_message)
|
||||
return DeletionResult(
|
||||
status="fail",
|
||||
doc_id=relation_str,
|
||||
message=error_message,
|
||||
status_code=500,
|
||||
)
|
||||
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue