cognee/docs/pipelines.md
alekszievr 4dbf559a59
Updating cognify pipeline documentation (#181)
* Updating cognify pipeline documentation

* typo fix

* Update docs/pipelines.md

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* removing a minor confusing part

---------

Co-authored-by: Boris <boris@topoteretes.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2024-11-08 14:04:58 +01:00

81 lines
3.4 KiB
Markdown

# PIPELINES
Cognee uses [tasks](https://github.com/topoteretes/cognee/blob/main/cognee/modules/pipelines/tasks/Task.py) grouped into pipelines that populate graph and vector stores. [These tasks](https://github.com/topoteretes/cognee/tree/main/cognee/tasks) analyze and enrich data, enhancing the quality of answers produced by Large Language Models (LLMs).
The tasks are managed and executed asynchronously using the `run_tasks` and `run_tasks_parallel` functions.
```python
pipeline = run_tasks(tasks, documents)
async for result in pipeline:
print(result)
```
## Main pipeline: [cognee.cognify](https://github.com/topoteretes/cognee/blob/168cb5d1bf1964b5b0c645b2f3d8638d84554fda/cognee/api/v1/cognify/cognify_v2.py#L38)
This is the main pipeline currently implemented in cognee. It is designed to process data in a structured way and populate the graph and vector stores.
This function is the entry point for processing datasets. It handles dataset retrieval, user authorization, and manages the execution of a pipeline of tasks that process documents.
### Parameters
- `datasets: Union[str, list[str]] = None`: A string or list of dataset names to be processed.
- `user: User = None`: The user requesting the processing. If not provided, the default user is retrieved.
### Steps in the Function
#### User Authentication
```python
if user is None:
user = await get_default_user()
```
If no user is provided, the function retrieves the default user.
#### Handling Empty or String Dataset Input
```python
existing_datasets = await get_datasets(user.id)
if datasets is None or len(datasets) == 0:
datasets = existing_datasets
if type(datasets[0]) == str:
datasets = await get_datasets_by_name(datasets, user.id)
```
If no datasets are provided, the function retrieves all datasets owned by the user. If a list of dataset names (strings) is provided, they are converted into dataset objects.
#### Selecting datasets from the input list that are owned by the user
```python
existing_datasets_map = {
generate_dataset_name(dataset.name): True for dataset in existing_datasets
}
```
#### Run Cognify Pipeline for Each Dataset
```python
awaitables = []
for dataset in datasets:
dataset_name = generate_dataset_name(dataset.name)
if dataset_name in existing_datasets_map:
awaitables.append(run_cognify_pipeline(dataset, user))
return await asyncio.gather(*awaitables)
The `run_cognify_pipeline` function is defined within `cognify` and is responsible for processing a single dataset. This is where most of the heavy lifting occurs. The function processes multiple datasets concurrently using `asyncio.gather`.
#### Pipeline Tasks
The pipeline consists of several tasks, each responsible for different parts of the processing:
- `classify_documents`: Converts each of the documents into one of the specific Document types: PdfDocument, AudioDocument, ImageDocument or TextDocument
- `check_permissions_on_documents`: Checks if the user has the necessary permissions to access the documents. In this case, it checks for "write" permission.
- `extract_chunks_from_documents`: Extracts text chunks based on the document type.
- `add_data_points`: Creates nodes and edges from the chunks and their properties. Adds them to the graph engine.
- `extract_graph_from_data`: Generates knowledge graphs from the document chunks.
- `summarize_text`: Extracts a summary for each chunk using an llm.