cognee/docs/conceptual_overview.md
Vasilije bb679c2dd7
Improve processing, update networkx client, and Neo4j, and dspy (#69)
* Update cognify and the networkx client to prepare for running in Neo4j

* Fix for openai model

* Add the fix to the infra so that the models can be passed to the library. Enable llm_provider to be passed.

* Auto graph generation now works with neo4j

* Added fixes for both neo4j and networkx

* Explicitly name semantic node connections

* Added updated docs, readme, chunkers and updates to cognify

* Make docs build trigger only when changes on it happen

* Update docs, test git actions

* Separate cognify logic into tasks

* Introduce dspy knowledge graph extraction

---------
Co-authored-by: Boris Arzentar <borisarzentar@gmail.com>
2024-04-20 19:05:40 +02:00

5 KiB

Conceptual Overview - cognee

Introduction

!!! info "What is cognee?" cognee is a framework for data processing that enables LLMs to produce for deterministic and traceable outputs.

cognee focuses on creating tools that assist developers in introducing greater predictability and management into their Retrieval-Augmented Generation (RAG) workflows through the use of graph architectures, vector stores and auto-optimizing pipelines.

Displaying this information as a graph is the clearest method to grasp the content of your documents. Crucially, using a graph allows for the systematic navigation and extraction of data from documents based on your grasp of a document's organization, an idea often termed 'document hierarchies'.

Core Concepts

Concept 1: Data Pipelines

Most of the data we provide to a system can be understood as unstructured, semi-structured or structured. Rows from a database would belong to structured data, jsons to semi-structured data and logs could be unstructured. To organize and process this data, we need to make sure to have custom loaders for all data types and also to unify and organize the data well together.

![Data Pipelines](img/pipelines.png)
Data Pipeline Example

In the example above, we have a data pipeline that imports the data from various sources, normalizes it, and stores it in a database. It also creates relevant identifiers and relationships between the data.

Concept 2: Data Enrichment with LLMs

LLMs are adept at processing unstructured data. We can easily extract summaries, keywords, and other useful information from documents.

![Data Enrichment](img/enrichment.png)
Data Enrichment Example
We decompose content into graphs, allowing us to more precisely map out the relationships between entities and concepts.

Concept 3: Linguistic Analysis

LLMs are probabilistic models, meaning they can make mistakes. To mitigate this, we can use a combination of NLP and LLMs to determine how to analyze the data and score each part of the text.

![Linguistic analysis](img/linguistic_analysis.png)
Linguistic analysis
### Concept 4: Graphs Knowledge graphs simply map out knowledge, linking specific facts and their connections. When Large Language Models (LLMs) process text, they infer these links, leading to occasional inaccuracies due to their probabilistic nature.

Clearly defined relationships enhance their accuracy.

This structured approach can extend beyond concepts to document layouts, pages, or other organizational schemas.

![Graph structure](img/graph_structure.png)
Graph Structure
### Concept 5: Vector and Graph Retrieval Cognee lets you use multiple vector and graph retrieval methods to find the most relevant information. !!! info "Learn more?" Check out learning materials to see how you can use these methods in your projects. ### Concept 6: Auto-Optimizing Pipelines Integrating knowledge graphs into Retrieval-Augmented Generation (RAG) pipelines leads to an intriguing outcome: the system's adeptness at contextual understanding allows it to be evaluated in a way Machine Learning (ML) engineers are accustomed to.

This involves bombarding the RAG system with hundreds of synthetic questions, enabling the knowledge graph to evolve and refine its context autonomously over time.

This method paves the way for developing self-improving memory engines that can adapt to new data and user feedback.

Architecture Overview

A high-level diagram of the cognee's architecture, illustrating the main components and their interactions.

![Architecture](img/architecture.png)
Architecture

Main components:

  • Data Pipelines: Responsible for ingesting, processing, and transforming data from various sources.
  • LLMs: Large Language Models that process unstructured data and generate text.
  • Graphs: Knowledge graphs that represent relationships between entities and concepts.
  • Vector Stores: Databases that store vector representations of data for efficient retrieval.
  • dspy module: Pipelines that automatically adjust based on feedback and data changes.
  • Search wrapper: Retrieves relevant information from the knowledge graph and vector stores.

How It Fits Into Your Projects

!!! info "How cognee fits into your projects" cognee is a self-contained library that simplifies the process of loading and structuring LLM context. It can be integrated into your data pipelines to enhance your AI applications.

By integrating cognee into your data pipelines, you can leverage the power of LLMs, knowledge graphs, and vector retrieval to enhance your AI applications.

cognee provides a self-contained library that simplifies the process of loading and structuring LLM context, enabling you to create accurate and explainable AI solutions.

Check out some case studies to see how cognee has been used in real-world applications.