Update README.md

2024-08-11 14:26:49 +02:00 · 2024-08-11 14:26:49 +02:00 · b49553ab1c
commit b49553ab1c
parent e494ec6c9e
1 changed files with 150 additions and 36 deletions
--- a/README.md
+++ b/README.md
@ -1,6 +1,6 @@
 # cognee

-Deterministic LLMs Outputs for AI Engineers using graphs, LLMs and vector retrieval
+We build for developers who need a reliable, production-ready data layer for AI applications


 <p>
@ -11,7 +11,7 @@ Deterministic LLMs Outputs for AI Engineers using graphs, LLMs and vector retrie


 <p>
-  <i>Open-source framework for creating self-improving deterministic outputs for LLMs.</i>
+  <i> Developer-friendly framework for creating reliable data layer for AI applications using graph and vector stores for</i>
 </p>

 <p>
@ -29,7 +29,9 @@ Deterministic LLMs Outputs for AI Engineers using graphs, LLMs and vector retrie
  </a>
 </p>

-![Cognee Demo](assets/cognee_demo.gif)
+cognee aims to be dbt for LLMOps
+
+cognee implements scalable, modular data pipelines that allow for a creation of the LLM-enriched data layer using graph and vector stores.


 Try it in a Google collab  <a href="https://colab.research.google.com/drive/1jayZ5JRwDaUGFvCw9UZySBG-iB9gpYfu?usp=sharing">notebook</a>  or have a look at our <a href="https://topoteretes.github.io/cognee">documentation</a>
@ -74,6 +76,26 @@ or
 import cognee
 cognee.config.llm_api_key = "YOUR_OPENAI_API_KEY"
 ```
+You can use different LLM providers, for more info check out our <a href="https://topoteretes.github.io/cognee">documentation</a>
+
+In the next step make sure to launch a Postgres instance. Here is an example from our docker-compose:
+```
+  postgres:
+    image: postgres:latest
+    container_name: postgres
+    environment:
+      POSTGRES_USER: cognee
+      POSTGRES_PASSWORD: cognee
+      POSTGRES_DB: cognee_db
+    volumes:
+      - postgres_data:/var/lib/postgresql/data
+    ports:
+      - 5432:5432
+    networks:
+      - cognee-network
+```
+
+
 If you are using Networkx, create an account on Graphistry to visualize results:
 ```
   
@ -81,15 +103,20 @@ If you are using Networkx, create an account on Graphistry to visualize results:
   cognee.config.set_graphistry_password = "YOUR_PASSWORD"
 ```

-To run the UI, run:
+(Optional) To run the UI, run:
 ```
 docker-compose up cognee
 ```
 Then navigate to localhost:3000/wizard

-You can also use Ollama or Anyscale as your LLM provider. For more info on local models check our [docs](https://topoteretes.github.io/cognee)
+### Run the default example

-### Run
+Make sure to launch the Postgres instance first. Navigate to the cognee folder and run:
+```
+docker compose up postgres
+```
+
+Run the default cognee pipeline:

 ```
 import cognee
@ -106,32 +133,128 @@ await search_results = cognee.search("SIMILARITY", {'query': 'Tell me about NLP'
 print(search_results)

 ```
-Add alternative data types:
-```
-cognee.add("file://{absolute_path_to_file}", dataset_name)
-```
-Or
-```
-cognee.add("data://{absolute_path_to_directory}", dataset_name)

-# This is useful if you have a directory with files organized in subdirectories.
-# You can target which directory to add by providing dataset_name.
-# Example:
-#            root
-#           /    \
-#      reports  bills
-#     /       \
-#   2024     2023
-#
-# cognee.add("data://{absolute_path_to_root}", "reports.2024")
-# This will add just directory 2024 under reports.
+
+### Create your pipelines
+
+cognee framework consists of tasks that can be grouped into pipelines. Each task can be an independent part of business logic, that can be tied to other tasks to form a pipeline.
+Here is an example of how it looks for a default cognify pipeline:
+
+
+1. To prepare the data for the pipeline run, first we need to add it to our metastore and normalize it:
+
+Start with: 
+```
+docker compose up postgres
+```
+And then run: 
+```
+text = """Natural language processing (NLP) is an interdisciplinary
+       subfield of computer science and information retrieval"""
+
+await cognee.add([text], "example_dataset") # Add a new piece of information
 ```

-Read more [here](docs/index.md#run).
+2. In the next step we make a task. The task can be any business logic we need, but the important part is that it should be encapsulated in one function.
+
+Here we show an example of creating a naive LLM classifier that takes a Pydantic model and then stores the data in both the graph and vector stores after analyzing each chunk.
+We provided just a snippet for reference, but feel free to check out the implementation in our repo. 
+
+```
+async def chunk_naive_llm_classifier(data_chunks: list[DocumentChunk], classification_model: Type[BaseModel]):
+    if len(data_chunks) == 0:
+        return data_chunks
+
+    chunk_classifications = await asyncio.gather(
+        *[extract_categories(chunk.text, classification_model) for chunk in data_chunks],
+    )
+
+    classification_data_points = []
+
+    for chunk_index, chunk in enumerate(data_chunks):
+        chunk_classification = chunk_classifications[chunk_index]
+        classification_data_points.append(uuid5(NAMESPACE_OID, chunk_classification.label.type))
+        classification_data_points.append(uuid5(NAMESPACE_OID, chunk_classification.label.type))
+
+        for classification_subclass in chunk_classification.label.subclass:
+            classification_data_points.append(uuid5(NAMESPACE_OID, classification_subclass.value))
+
+    vector_engine = get_vector_engine()
+
+    class Keyword(BaseModel):
+        uuid: str
+        text: str
+        chunk_id: str
+        document_id: str
+
+    collection_name = "classification"
+
+    if await vector_engine.has_collection(collection_name):
+        existing_data_points = await vector_engine.retrieve(
+            collection_name,
+            list(set(classification_data_points)),
+        ) if len(classification_data_points) > 0 else []
+
+        existing_points_map = {point.id: True for point in existing_data_points}
+    else:
+        existing_points_map = {}
+        await vector_engine.create_collection(collection_name, payload_schema=Keyword)
+
+    data_points = []
+    nodes = []
+    edges = []
+
+    for (chunk_index, data_chunk) in enumerate(data_chunks):
+        chunk_classification = chunk_classifications[chunk_index]
+        classification_type_label = chunk_classification.label.type
+        classification_type_id = uuid5(NAMESPACE_OID, classification_type_label)
+
+...
+
+```
+
+To see existing tasks, have a look at the cognee.tasks
+
+
+3. Once we have our tasks, it is time to group them into a pipeline.
+This snippet shows how a group of tasks can be added to a pipeline, and how they can pass the information forward from one to another. 
+
+```
+            tasks = [
+                Task(document_to_ontology, root_node_id = root_node_id),
+                Task(source_documents_to_chunks, parent_node_id = root_node_id), # Classify documents and save them as a nodes in graph db, extract text chunks based on the document type
+                Task(chunk_to_graph_decomposition, topology_model = KnowledgeGraph, task_config = { "batch_size": 10 }), # Set the graph topology for the document chunk data
+                Task(chunks_into_graph, graph_model = KnowledgeGraph, collection_name = "entities"), # Generate knowledge graphs from the document chunks and attach it to chunk nodes
+                Task(chunk_update_check, collection_name = "chunks"), # Find all affected chunks, so we don't process unchanged chunks
+                Task(
+                    save_chunks_to_store,
+                    collection_name = "chunks",
+                ), # Save the document chunks in vector db and as nodes in graph db (connected to the document node and between each other)
+                run_tasks_parallel([
+                    Task(
+                        chunk_extract_summary,
+                        summarization_model = cognee_config.summarization_model,
+                        collection_name = "chunk_summaries",
+                    ), # Summarize the document chunks
+                    Task(
+                        chunk_naive_llm_classifier,
+                        classification_model = cognee_config.classification_model,
+                    ),
+                ]),
+                Task(chunk_remove_disconnected), # Remove the obsolete document chunks.
+            ]
+
+            pipeline = run_tasks(tasks, documents)
+
+```
+
+To see the working code, check cognee.api.v1.cognify default pipeline in our repo.
+

 ## Vector retrieval, Graphs and LLMs

 Cognee supports a variety of tools and services for different operations:
+- **Modular**: Cognee is modular by nature, using tasks grouped into pipelines

 - **Local Setup**: By default, LanceDB runs locally with NetworkX and OpenAI.

@ -140,6 +263,8 @@ Cognee supports a variety of tools and services for different operations:
 - **Language Models (LLMs)**: You can use either Anyscale or Ollama as your LLM provider.

 - **Graph Stores**: In addition to LanceDB, Neo4j is also supported for graph storage.
+  
+- **User management**: Create individual user graphs and manage permissions

 ## Demo

@ -151,17 +276,6 @@ Check out our demo notebook [here](https://github.com/topoteretes/cognee/blob/ma



-
-
-## How it works
-
-
-
-
-
-![Image](assets/architecture.png)
-
-
 ## Star History