feat: adding docstrings (#1045)
<!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
This commit is contained in:
parent
ada3f7b086
commit
c936f5e0a3
3 changed files with 403 additions and 0 deletions
|
|
@ -16,6 +16,128 @@ async def add(
|
||||||
graph_db_config: dict = None,
|
graph_db_config: dict = None,
|
||||||
dataset_id: UUID = None,
|
dataset_id: UUID = None,
|
||||||
):
|
):
|
||||||
|
"""
|
||||||
|
Add data to Cognee for knowledge graph processing.
|
||||||
|
|
||||||
|
This is the first step in the Cognee workflow - it ingests raw data and prepares it
|
||||||
|
for processing. The function accepts various data formats including text, files, and
|
||||||
|
binary streams, then stores them in a specified dataset for further processing.
|
||||||
|
|
||||||
|
Prerequisites:
|
||||||
|
- **LLM_API_KEY**: Must be set in environment variables for content processing
|
||||||
|
- **Database Setup**: Relational and vector databases must be configured
|
||||||
|
- **User Authentication**: Uses default user if none provided (created automatically)
|
||||||
|
|
||||||
|
Supported Input Types:
|
||||||
|
- **Text strings**: Direct text content (str) - any string not starting with "/" or "file://"
|
||||||
|
- **File paths**: Local file paths as strings in these formats:
|
||||||
|
* Absolute paths: "/path/to/document.pdf"
|
||||||
|
* File URLs: "file:///path/to/document.pdf" or "file://relative/path.txt"
|
||||||
|
* S3 paths: "s3://bucket-name/path/to/file.pdf"
|
||||||
|
- **Binary file objects**: File handles/streams (BinaryIO)
|
||||||
|
- **Lists**: Multiple files or text strings in a single call
|
||||||
|
|
||||||
|
Supported File Formats:
|
||||||
|
- Text files (.txt, .md, .csv)
|
||||||
|
- PDFs (.pdf)
|
||||||
|
- Images (.png, .jpg, .jpeg) - extracted via OCR/vision models
|
||||||
|
- Audio files (.mp3, .wav) - transcribed to text
|
||||||
|
- Code files (.py, .js, .ts, etc.) - parsed for structure and content
|
||||||
|
- Office documents (.docx, .pptx)
|
||||||
|
|
||||||
|
Workflow:
|
||||||
|
1. **Data Resolution**: Resolves file paths and validates accessibility
|
||||||
|
2. **Content Extraction**: Extracts text content from various file formats
|
||||||
|
3. **Dataset Storage**: Stores processed content in the specified dataset
|
||||||
|
4. **Metadata Tracking**: Records file metadata, timestamps, and user permissions
|
||||||
|
5. **Permission Assignment**: Grants user read/write/delete/share permissions on dataset
|
||||||
|
|
||||||
|
Args:
|
||||||
|
data: The data to ingest. Can be:
|
||||||
|
- Single text string: "Your text content here"
|
||||||
|
- Absolute file path: "/path/to/document.pdf"
|
||||||
|
- File URL: "file:///absolute/path/to/document.pdf" or "file://relative/path.txt"
|
||||||
|
- S3 path: "s3://my-bucket/documents/file.pdf"
|
||||||
|
- List of mixed types: ["text content", "/path/file.pdf", "file://doc.txt", file_handle]
|
||||||
|
- Binary file object: open("file.txt", "rb")
|
||||||
|
dataset_name: Name of the dataset to store data in. Defaults to "main_dataset".
|
||||||
|
Create separate datasets to organize different knowledge domains.
|
||||||
|
user: User object for authentication and permissions. Uses default user if None.
|
||||||
|
Default user: "default_user@example.com" (created automatically on first use).
|
||||||
|
Users can only access datasets they have permissions for.
|
||||||
|
node_set: Optional list of node identifiers for graph organization and access control.
|
||||||
|
Used for grouping related data points in the knowledge graph.
|
||||||
|
vector_db_config: Optional configuration for vector database (for custom setups).
|
||||||
|
graph_db_config: Optional configuration for graph database (for custom setups).
|
||||||
|
dataset_id: Optional specific dataset UUID to use instead of dataset_name.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
PipelineRunInfo: Information about the ingestion pipeline execution including:
|
||||||
|
- Pipeline run ID for tracking
|
||||||
|
- Dataset ID where data was stored
|
||||||
|
- Processing status and any errors
|
||||||
|
- Execution timestamps and metadata
|
||||||
|
|
||||||
|
Next Steps:
|
||||||
|
After successfully adding data, call `cognify()` to process the ingested content:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import cognee
|
||||||
|
|
||||||
|
# Step 1: Add your data (text content or file path)
|
||||||
|
await cognee.add("Your document content") # Raw text
|
||||||
|
# OR
|
||||||
|
await cognee.add("/path/to/your/file.pdf") # File path
|
||||||
|
|
||||||
|
# Step 2: Process into knowledge graph
|
||||||
|
await cognee.cognify()
|
||||||
|
|
||||||
|
# Step 3: Search and query
|
||||||
|
results = await cognee.search("What insights can you find?")
|
||||||
|
```
|
||||||
|
|
||||||
|
Example Usage:
|
||||||
|
```python
|
||||||
|
# Add a single text document
|
||||||
|
await cognee.add("Natural language processing is a field of AI...")
|
||||||
|
|
||||||
|
# Add multiple files with different path formats
|
||||||
|
await cognee.add([
|
||||||
|
"/absolute/path/to/research_paper.pdf", # Absolute path
|
||||||
|
"file://relative/path/to/dataset.csv", # Relative file URL
|
||||||
|
"file:///absolute/path/to/report.docx", # Absolute file URL
|
||||||
|
"s3://my-bucket/documents/data.json", # S3 path
|
||||||
|
"Additional context text" # Raw text content
|
||||||
|
])
|
||||||
|
|
||||||
|
# Add to a specific dataset
|
||||||
|
await cognee.add(
|
||||||
|
data="Project documentation content",
|
||||||
|
dataset_name="project_docs"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Add a single file
|
||||||
|
await cognee.add("/home/user/documents/analysis.pdf")
|
||||||
|
```
|
||||||
|
|
||||||
|
Environment Variables:
|
||||||
|
Required:
|
||||||
|
- LLM_API_KEY: API key for your LLM provider (OpenAI, Anthropic, etc.)
|
||||||
|
|
||||||
|
Optional:
|
||||||
|
- LLM_PROVIDER: "openai" (default), "anthropic", "gemini", "ollama"
|
||||||
|
- LLM_MODEL: Model name (default: "gpt-4o-mini")
|
||||||
|
- DEFAULT_USER_EMAIL: Custom default user email
|
||||||
|
- DEFAULT_USER_PASSWORD: Custom default user password
|
||||||
|
- VECTOR_DB_PROVIDER: "lancedb" (default), "chromadb", "qdrant", "weaviate"
|
||||||
|
- GRAPH_DATABASE_PROVIDER: "kuzu" (default), "neo4j", "networkx"
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
FileNotFoundError: If specified file paths don't exist
|
||||||
|
PermissionError: If user lacks access to files or dataset
|
||||||
|
UnsupportedFileTypeError: If file format cannot be processed
|
||||||
|
InvalidValueError: If LLM_API_KEY is not set or invalid
|
||||||
|
"""
|
||||||
tasks = [
|
tasks = [
|
||||||
Task(resolve_data_directories),
|
Task(resolve_data_directories),
|
||||||
Task(ingest_data, dataset_name, user, node_set, dataset_id),
|
Task(ingest_data, dataset_name, user, node_set, dataset_id),
|
||||||
|
|
|
||||||
|
|
@ -39,6 +39,151 @@ async def cognify(
|
||||||
graph_db_config: dict = None,
|
graph_db_config: dict = None,
|
||||||
run_in_background: bool = False,
|
run_in_background: bool = False,
|
||||||
):
|
):
|
||||||
|
"""
|
||||||
|
Transform ingested data into a structured knowledge graph.
|
||||||
|
|
||||||
|
This is the core processing step in Cognee that converts raw text and documents
|
||||||
|
into an intelligent knowledge graph. It analyzes content, extracts entities and
|
||||||
|
relationships, and creates semantic connections for enhanced search and reasoning.
|
||||||
|
|
||||||
|
Prerequisites:
|
||||||
|
- **LLM_API_KEY**: Must be configured (required for entity extraction and graph generation)
|
||||||
|
- **Data Added**: Must have data previously added via `cognee.add()`
|
||||||
|
- **Vector Database**: Must be accessible for embeddings storage
|
||||||
|
- **Graph Database**: Must be accessible for relationship storage
|
||||||
|
|
||||||
|
Input Requirements:
|
||||||
|
- **Datasets**: Must contain data previously added via `cognee.add()`
|
||||||
|
- **Content Types**: Works with any text-extractable content including:
|
||||||
|
* Natural language documents
|
||||||
|
* Structured data (CSV, JSON)
|
||||||
|
* Code repositories
|
||||||
|
* Academic papers and technical documentation
|
||||||
|
* Mixed multimedia content (with text extraction)
|
||||||
|
|
||||||
|
Processing Pipeline:
|
||||||
|
1. **Document Classification**: Identifies document types and structures
|
||||||
|
2. **Permission Validation**: Ensures user has processing rights
|
||||||
|
3. **Text Chunking**: Breaks content into semantically meaningful segments
|
||||||
|
4. **Entity Extraction**: Identifies key concepts, people, places, organizations
|
||||||
|
5. **Relationship Detection**: Discovers connections between entities
|
||||||
|
6. **Graph Construction**: Builds semantic knowledge graph with embeddings
|
||||||
|
7. **Content Summarization**: Creates hierarchical summaries for navigation
|
||||||
|
|
||||||
|
Graph Model Customization:
|
||||||
|
The `graph_model` parameter allows custom knowledge structures:
|
||||||
|
- **Default**: General-purpose KnowledgeGraph for any domain
|
||||||
|
- **Custom Models**: Domain-specific schemas (e.g., scientific papers, code analysis)
|
||||||
|
- **Ontology Integration**: Use `ontology_file_path` for predefined vocabularies
|
||||||
|
|
||||||
|
Args:
|
||||||
|
datasets: Dataset name(s) or dataset uuid to process. Processes all available data if None.
|
||||||
|
- Single dataset: "my_dataset"
|
||||||
|
- Multiple datasets: ["docs", "research", "reports"]
|
||||||
|
- None: Process all datasets for the user
|
||||||
|
user: User context for authentication and data access. Uses default if None.
|
||||||
|
graph_model: Pydantic model defining the knowledge graph structure.
|
||||||
|
Defaults to KnowledgeGraph for general-purpose processing.
|
||||||
|
chunker: Text chunking strategy (TextChunker, LangchainChunker).
|
||||||
|
- TextChunker: Paragraph-based chunking (default, most reliable)
|
||||||
|
- LangchainChunker: Recursive character splitting with overlap
|
||||||
|
Determines how documents are segmented for processing.
|
||||||
|
chunk_size: Maximum tokens per chunk. Auto-calculated based on LLM if None.
|
||||||
|
Formula: min(embedding_max_tokens, llm_max_tokens // 2)
|
||||||
|
Default limits: ~512-8192 tokens depending on models.
|
||||||
|
Smaller chunks = more granular but potentially fragmented knowledge.
|
||||||
|
ontology_file_path: Path to RDF/OWL ontology file for domain-specific entity types.
|
||||||
|
Useful for specialized fields like medical or legal documents.
|
||||||
|
vector_db_config: Custom vector database configuration for embeddings storage.
|
||||||
|
graph_db_config: Custom graph database configuration for relationship storage.
|
||||||
|
run_in_background: If True, starts processing asynchronously and returns immediately.
|
||||||
|
If False, waits for completion before returning.
|
||||||
|
Background mode recommended for large datasets (>100MB).
|
||||||
|
Use pipeline_run_id from return value to monitor progress.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Union[dict, list[PipelineRunInfo]]:
|
||||||
|
- **Blocking mode**: Dictionary mapping dataset_id -> PipelineRunInfo with:
|
||||||
|
* Processing status (completed/failed/in_progress)
|
||||||
|
* Extracted entity and relationship counts
|
||||||
|
* Processing duration and resource usage
|
||||||
|
* Error details if any failures occurred
|
||||||
|
- **Background mode**: List of PipelineRunInfo objects for tracking progress
|
||||||
|
* Use pipeline_run_id to monitor status
|
||||||
|
* Check completion via pipeline monitoring APIs
|
||||||
|
|
||||||
|
Next Steps:
|
||||||
|
After successful cognify processing, use search functions to query the knowledge:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import cognee
|
||||||
|
from cognee import SearchType
|
||||||
|
|
||||||
|
# Process your data into knowledge graph
|
||||||
|
await cognee.cognify()
|
||||||
|
|
||||||
|
# Query for insights using different search types:
|
||||||
|
|
||||||
|
# 1. Natural language completion with graph context
|
||||||
|
insights = await cognee.search(
|
||||||
|
"What are the main themes?",
|
||||||
|
query_type=SearchType.GRAPH_COMPLETION
|
||||||
|
)
|
||||||
|
|
||||||
|
# 2. Get entity relationships and connections
|
||||||
|
relationships = await cognee.search(
|
||||||
|
"connections between concepts",
|
||||||
|
query_type=SearchType.INSIGHTS
|
||||||
|
)
|
||||||
|
|
||||||
|
# 3. Find relevant document chunks
|
||||||
|
chunks = await cognee.search(
|
||||||
|
"specific topic",
|
||||||
|
query_type=SearchType.CHUNKS
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
Advanced Usage:
|
||||||
|
```python
|
||||||
|
# Custom domain model for scientific papers
|
||||||
|
class ScientificPaper(DataPoint):
|
||||||
|
title: str
|
||||||
|
authors: List[str]
|
||||||
|
methodology: str
|
||||||
|
findings: List[str]
|
||||||
|
|
||||||
|
await cognee.cognify(
|
||||||
|
datasets=["research_papers"],
|
||||||
|
graph_model=ScientificPaper,
|
||||||
|
ontology_file_path="scientific_ontology.owl"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Background processing for large datasets
|
||||||
|
run_info = await cognee.cognify(
|
||||||
|
datasets=["large_corpus"],
|
||||||
|
run_in_background=True
|
||||||
|
)
|
||||||
|
# Check status later with run_info.pipeline_run_id
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
Environment Variables:
|
||||||
|
Required:
|
||||||
|
- LLM_API_KEY: API key for your LLM provider
|
||||||
|
|
||||||
|
Optional (same as add function):
|
||||||
|
- LLM_PROVIDER, LLM_MODEL, VECTOR_DB_PROVIDER, GRAPH_DATABASE_PROVIDER
|
||||||
|
- LLM_RATE_LIMIT_ENABLED: Enable rate limiting (default: False)
|
||||||
|
- LLM_RATE_LIMIT_REQUESTS: Max requests per interval (default: 60)
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
DatasetNotFoundError: If specified datasets don't exist
|
||||||
|
PermissionError: If user lacks processing rights
|
||||||
|
InvalidValueError: If LLM_API_KEY is not set
|
||||||
|
OntologyParsingError: If ontology file is malformed
|
||||||
|
ValueError: If chunks exceed max token limits (reduce chunk_size)
|
||||||
|
DatabaseNotCreatedError: If databases are not properly initialized
|
||||||
|
"""
|
||||||
tasks = await get_default_tasks(user, graph_model, chunker, chunk_size, ontology_file_path)
|
tasks = await get_default_tasks(user, graph_model, chunker, chunk_size, ontology_file_path)
|
||||||
|
|
||||||
if run_in_background:
|
if run_in_background:
|
||||||
|
|
|
||||||
|
|
@ -20,6 +20,142 @@ async def search(
|
||||||
node_type: Optional[Type] = None,
|
node_type: Optional[Type] = None,
|
||||||
node_name: Optional[List[str]] = None,
|
node_name: Optional[List[str]] = None,
|
||||||
) -> list:
|
) -> list:
|
||||||
|
"""
|
||||||
|
Search and query the knowledge graph for insights, information, and connections.
|
||||||
|
|
||||||
|
This is the final step in the Cognee workflow that retrieves information from the
|
||||||
|
processed knowledge graph. It supports multiple search modes optimized for different
|
||||||
|
use cases - from simple fact retrieval to complex reasoning and code analysis.
|
||||||
|
|
||||||
|
Search Prerequisites:
|
||||||
|
- **LLM_API_KEY**: Required for GRAPH_COMPLETION and RAG_COMPLETION search types
|
||||||
|
- **Data Added**: Must have data previously added via `cognee.add()`
|
||||||
|
- **Knowledge Graph Built**: Must have processed data via `cognee.cognify()`
|
||||||
|
- **Dataset Permissions**: User must have 'read' permission on target datasets
|
||||||
|
- **Vector Database**: Must be accessible for semantic search functionality
|
||||||
|
|
||||||
|
Search Types & Use Cases:
|
||||||
|
|
||||||
|
**GRAPH_COMPLETION** (Default - Recommended):
|
||||||
|
Natural language Q&A using full graph context and LLM reasoning.
|
||||||
|
Best for: Complex questions, analysis, summaries, insights.
|
||||||
|
Returns: Conversational AI responses with graph-backed context.
|
||||||
|
|
||||||
|
**RAG_COMPLETION**:
|
||||||
|
Traditional RAG using document chunks without graph structure.
|
||||||
|
Best for: Direct document retrieval, specific fact-finding.
|
||||||
|
Returns: LLM responses based on relevant text chunks.
|
||||||
|
|
||||||
|
**INSIGHTS**:
|
||||||
|
Structured entity relationships and semantic connections.
|
||||||
|
Best for: Understanding concept relationships, knowledge mapping.
|
||||||
|
Returns: Formatted relationship data and entity connections.
|
||||||
|
|
||||||
|
**CHUNKS**:
|
||||||
|
Raw text segments that match the query semantically.
|
||||||
|
Best for: Finding specific passages, citations, exact content.
|
||||||
|
Returns: Ranked list of relevant text chunks with metadata.
|
||||||
|
|
||||||
|
**SUMMARIES**:
|
||||||
|
Pre-generated hierarchical summaries of content.
|
||||||
|
Best for: Quick overviews, document abstracts, topic summaries.
|
||||||
|
Returns: Multi-level summaries from detailed to high-level.
|
||||||
|
|
||||||
|
**CODE**:
|
||||||
|
Code-specific search with syntax and semantic understanding.
|
||||||
|
Best for: Finding functions, classes, implementation patterns.
|
||||||
|
Returns: Structured code information with context and relationships.
|
||||||
|
|
||||||
|
**CYPHER**:
|
||||||
|
Direct graph database queries using Cypher syntax.
|
||||||
|
Best for: Advanced users, specific graph traversals, debugging.
|
||||||
|
Returns: Raw graph query results.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
query_text: Your question or search query in natural language.
|
||||||
|
Examples:
|
||||||
|
- "What are the main themes in this research?"
|
||||||
|
- "How do these concepts relate to each other?"
|
||||||
|
- "Find information about machine learning algorithms"
|
||||||
|
- "What functions handle user authentication?"
|
||||||
|
|
||||||
|
query_type: SearchType enum specifying the search mode.
|
||||||
|
Defaults to GRAPH_COMPLETION for conversational AI responses.
|
||||||
|
|
||||||
|
user: User context for data access permissions. Uses default if None.
|
||||||
|
|
||||||
|
datasets: Dataset name(s) to search within. Searches all accessible if None.
|
||||||
|
- Single dataset: "research_papers"
|
||||||
|
- Multiple datasets: ["docs", "reports", "analysis"]
|
||||||
|
- None: Search across all user datasets
|
||||||
|
|
||||||
|
dataset_ids: Alternative to datasets - use specific UUID identifiers.
|
||||||
|
|
||||||
|
system_prompt_path: Custom system prompt file for LLM-based search types.
|
||||||
|
Defaults to "answer_simple_question.txt".
|
||||||
|
|
||||||
|
top_k: Maximum number of results to return (1-N)
|
||||||
|
Higher values provide more comprehensive but potentially noisy results.
|
||||||
|
|
||||||
|
node_type: Filter results to specific entity types (for advanced filtering).
|
||||||
|
|
||||||
|
node_name: Filter results to specific named entities (for targeted search).
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
list: Search results in format determined by query_type:
|
||||||
|
|
||||||
|
**GRAPH_COMPLETION/RAG_COMPLETION**:
|
||||||
|
[List of conversational AI response strings]
|
||||||
|
|
||||||
|
**INSIGHTS**:
|
||||||
|
[List of formatted relationship descriptions and entity connections]
|
||||||
|
|
||||||
|
**CHUNKS**:
|
||||||
|
[List of relevant text passages with source metadata]
|
||||||
|
|
||||||
|
**SUMMARIES**:
|
||||||
|
[List of hierarchical summaries from general to specific]
|
||||||
|
|
||||||
|
**CODE**:
|
||||||
|
[List of structured code information with context]
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Performance & Optimization:
|
||||||
|
- **GRAPH_COMPLETION**: Slower but most intelligent, uses LLM + graph context
|
||||||
|
- **RAG_COMPLETION**: Medium speed, uses LLM + document chunks (no graph traversal)
|
||||||
|
- **INSIGHTS**: Fast, returns structured relationships without LLM processing
|
||||||
|
- **CHUNKS**: Fastest, pure vector similarity search without LLM
|
||||||
|
- **SUMMARIES**: Fast, returns pre-computed summaries
|
||||||
|
- **CODE**: Medium speed, specialized for code understanding
|
||||||
|
- **top_k**: Start with 10, increase for comprehensive analysis (max 100)
|
||||||
|
- **datasets**: Specify datasets to improve speed and relevance
|
||||||
|
|
||||||
|
Next Steps After Search:
|
||||||
|
- Use results for further analysis or application integration
|
||||||
|
- Combine different search types for comprehensive understanding
|
||||||
|
- Export insights for reporting or downstream processing
|
||||||
|
- Iterate with refined queries based on initial results
|
||||||
|
|
||||||
|
Environment Variables:
|
||||||
|
Required for LLM-based search types (GRAPH_COMPLETION, RAG_COMPLETION):
|
||||||
|
- LLM_API_KEY: API key for your LLM provider
|
||||||
|
|
||||||
|
Optional:
|
||||||
|
- LLM_PROVIDER, LLM_MODEL: Configure LLM for search responses
|
||||||
|
- VECTOR_DB_PROVIDER: Must match what was used during cognify
|
||||||
|
- GRAPH_DATABASE_PROVIDER: Must match what was used during cognify
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
DatasetNotFoundError: If specified datasets don't exist or aren't accessible
|
||||||
|
PermissionDeniedError: If user lacks read access to requested datasets
|
||||||
|
NoDataError: If no relevant data found for the search query
|
||||||
|
InvalidValueError: If LLM_API_KEY is not set (for LLM-based search types)
|
||||||
|
ValueError: If query_text is empty or search parameters are invalid
|
||||||
|
CollectionNotFoundError: If vector collection not found (data not processed)
|
||||||
|
"""
|
||||||
# We use lists from now on for datasets
|
# We use lists from now on for datasets
|
||||||
if isinstance(datasets, UUID) or isinstance(datasets, str):
|
if isinstance(datasets, UUID) or isinstance(datasets, str):
|
||||||
datasets = [datasets]
|
datasets = [datasets]
|
||||||
|
|
|
||||||
Loading…
Add table
Reference in a new issue