ragflow

No description

Find a file

dependabot[bot] c998ad7a18 Bump nltk from 3.8.1 to 3.9 (#2250 ) Bumps [nltk](https://github.com/nltk/nltk) from 3.8.1 to 3.9. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/nltk/nltk/blob/develop/ChangeLog">nltk's changelog</a>.</em></p> <blockquote> <p>Version 3.9.1 2024-08-19</p> <ul> <li>Fixed bug that prevented wordnet from loading</li> </ul> <p>Version 3.9 2024-08-18</p> <ul> <li>Fix security vulnerability CVE-2024-39705 (breaking change)</li> <li>Replace pickled models (punkt, chunker, taggers) by new pickle-free "_tab" packages</li> <li>No longer sort WordNet synsets and relations (sort in calling function when required)</li> <li>Add Python 3.12 support</li> <li>Many other minor fixes</li> </ul> <p>Thanks to the following contributors to 3.8.2: Tom Aarsen, Cat Lee Ball, Veralara Bernhard, Carlos Brandt, Konstantin Chernyshev, Michael Higgins, Eric Kafe, Vivek Kalyan, David Lukes, Rob Malouf, purificant, Alex Rudnick, Liling Tan, Akihiro Yamazaki.</p> <p>Version 3.8.1 2023-01-02</p> <ul> <li>Resolve RCE vulnerability in localhost WordNet Browser (<a href="https://redirect.github.com/nltk/nltk/issues/3100">#3100</a>)</li> <li>Remove unused tool scripts (<a href="https://redirect.github.com/nltk/nltk/issues/3099">#3099</a>)</li> <li>Resolve XSS vulnerability in localhost WordNet Browser (<a href="https://redirect.github.com/nltk/nltk/issues/3096">#3096</a>)</li> <li>Add Python 3.11 support (<a href="https://redirect.github.com/nltk/nltk/issues/3090">#3090</a>)</li> </ul> <p>Thanks to the following contributors to 3.8.1: Francis Bond, John Vandenberg, Tom Aarsen</p> <p>Version 3.8 2022-12-12</p> <ul> <li>Refactor dispersion plot (<a href="https://redirect.github.com/nltk/nltk/issues/3082">#3082</a>)</li> <li>Provide type hints for LazyCorpusLoader variables (<a href="https://redirect.github.com/nltk/nltk/issues/3081">#3081</a>)</li> <li>Throw warning when LanguageModel is initialized with incorrect vocabulary (<a href="https://redirect.github.com/nltk/nltk/issues/3080">#3080</a>)</li> <li>Fix WordNet's all_synsets() function (<a href="https://redirect.github.com/nltk/nltk/issues/3078">#3078</a>)</li> <li>Resolve TreebankWordDetokenizer inconsistency with end-of-string contractions (<a href="https://redirect.github.com/nltk/nltk/issues/3070">#3070</a>)</li> <li>Support both iso639-3 codes and BCP-47 language tags (<a href="https://redirect.github.com/nltk/nltk/issues/3060">#3060</a>)</li> <li>Avoid DeprecationWarning in Regexp tokenizer (<a href="https://redirect.github.com/nltk/nltk/issues/3055">#3055</a>)</li> <li>Fix many doctests, add doctests to CI (<a href="https://redirect.github.com/nltk/nltk/issues/3054">#3054</a>, <a href="https://redirect.github.com/nltk/nltk/issues/3050">#3050</a>, <a href="https://redirect.github.com/nltk/nltk/issues/3048">#3048</a>)</li> <li>Fix bool field not being read in VerbNet (<a href="https://redirect.github.com/nltk/nltk/issues/3044">#3044</a>)</li> <li>Greatly improve time efficiency of SyllableTokenizer when tokenizing numbers (<a href="https://redirect.github.com/nltk/nltk/issues/3042">#3042</a>)</li> <li>Fix encodings of Polish udhr corpus reader (<a href="https://redirect.github.com/nltk/nltk/issues/3038">#3038</a>)</li> <li>Allow TweetTokenizer to tokenize emoji flag sequences (<a href="https://redirect.github.com/nltk/nltk/issues/3034">#3034</a>)</li> <li>Prevent LazyModule from increasing the size of nltk.<strong>dict</strong> (<a href="https://redirect.github.com/nltk/nltk/issues/3033">#3033</a>)</li> <li>Fix CoreNLPServer non-default port issue (<a href="https://redirect.github.com/nltk/nltk/issues/3031">#3031</a>)</li> <li>Add "acion" suffix to the Spanish SnowballStemmer (<a href="https://redirect.github.com/nltk/nltk/issues/3030">#3030</a>)</li> <li>Allow loading WordNet without OMW (<a href="https://redirect.github.com/nltk/nltk/issues/3026">#3026</a>)</li> <li>Use input() in nltk.chat.chatbot() for Jupyter support (<a href="https://redirect.github.com/nltk/nltk/issues/3022">#3022</a>)</li> <li>Fix edit_distance_align() in distance.py (<a href="https://redirect.github.com/nltk/nltk/issues/3017">#3017</a>)</li> <li>Tackle performance and accuracy regression of sentence tokenizer since NLTK 3.6.6 (<a href="https://redirect.github.com/nltk/nltk/issues/3014">#3014</a>)</li> <li>Add the Iota operator to semantic logic (<a href="https://redirect.github.com/nltk/nltk/issues/3010">#3010</a>)</li> <li>Resolve critical errors in WordNet app (<a href="https://redirect.github.com/nltk/nltk/issues/3008">#3008</a>)</li> <li>Resolve critical error in CHILDES Corpus (<a href="https://redirect.github.com/nltk/nltk/issues/2998">#2998</a>)</li> <li>Make WordNet information_content() accept adjective satellites (<a href="https://redirect.github.com/nltk/nltk/issues/2995">#2995</a>)</li> <li>Add "strict=True" parameter to CoreNLP (<a href="https://redirect.github.com/nltk/nltk/issues/2993">#2993</a>, <a href="https://redirect.github.com/nltk/nltk/issues/3043">#3043</a>)</li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`24936a2d0c`"><code>24936a2</code></a> Bump version to 3.9</li> <li><a href="`c222897403`"><code>c222897</code></a> Merge branch 'develop' of <a href="https://github.com/nltk/nltk">https://github.com/nltk/nltk</a> into develop</li> <li><a href="`34c3a4ad4e`"><code>34c3a4a</code></a> Merge branch 'develop' of <a href="https://github.com/nltk/nltk">https://github.com/nltk/nltk</a> into develop</li> <li><a href="`253dd3acd1`"><code>253dd3a</code></a> add black</li> <li><a href="`c43727fad6`"><code>c43727f</code></a> Update version</li> <li><a href="`7137405da3`"><code>7137405</code></a> Merge pull request <a href="https://redirect.github.com/nltk/nltk/issues/3066">#3066</a> from asishm/bugfix-lambda-closure-leak</li> <li><a href="`369cb9f85d`"><code>369cb9f</code></a> Merge pull request <a href="https://redirect.github.com/nltk/nltk/issues/3245">#3245</a> from ekaf/hotfix-closuredup</li> <li><a href="`501c70e20a`"><code>501c70e</code></a> Merge branch 'develop' into hotfix-closuredup</li> <li><a href="`bf05dc4cf2`"><code>bf05dc4</code></a> Merge pull request <a href="https://redirect.github.com/nltk/nltk/issues/3306">#3306</a> from ekaf/py3_compat</li> <li><a href="`66539c7cc7`"><code>66539c7</code></a> Sorted output in unit/test_wordnet.py</li> <li>Additional commits viewable in <a href="https://github.com/nltk/nltk/compare/3.8.1...3.9">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=nltk&package-manager=pip&previous-version=3.8.1&new-version=3.9)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/infiniflow/ragflow/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>		2024-09-05 09:34:11 +08:00
.github	Updated badge link (#545 )	2024-04-25 19:34:21 +08:00
agent	Add component yahoo finance (#2244 )	2024-09-04 19:51:07 +08:00
api	fix issue wrong agent prologue for api (#2246 )	2024-09-04 19:11:55 +08:00
conf	add support for Google Cloud (#2175 )	2024-09-02 12:06:41 +08:00
deepdoc	optimize text parser (#2144 )	2024-08-28 18:11:19 +08:00
docker	make task resumable (#2132 )	2024-08-28 14:06:27 +08:00
docs	Minor editorial updates (#2207 )	2024-09-03 11:18:10 +08:00
graphrag	fix(graphrag): variable refernence error (#2136 )	2024-08-28 16:35:42 +08:00
rag	fix QWenSeq2txt bug (#2245 )	2024-09-04 18:25:43 +08:00
sdk/python	Complete DataSet SDK implementation (#2171 )	2024-08-30 16:54:22 +08:00
web	feat: Add SearchPage #2247 (#2248 )	2024-09-05 09:33:05 +08:00
.gitattributes	add lf end-lines in `*.sh` (#425 )	2024-04-18 17:17:54 +08:00
.gitignore	Update SDK->sdk, and add create_dataset (#1047 )	2024-06-03 20:14:47 +08:00
Dockerfile	refine Dockerfile (#1802 )	2024-08-05 09:38:51 +08:00
Dockerfile.arm	Fix the issue about `No module named 'graspologic'` #2157 (#2158 )	2024-08-29 15:54:25 +08:00
Dockerfile.cuda	Format file format from Windows/dos to Unix (#1949 )	2024-08-15 09:17:36 +08:00
Dockerfile.scratch	Format file format from Windows/dos to Unix (#1949 )	2024-08-15 09:17:36 +08:00
Dockerfile.scratch.oc9	Format file format from Windows/dos to Unix (#1949 )	2024-08-15 09:17:36 +08:00
LICENSE	Initial commit	2023-12-12 14:13:13 +08:00
printEnvironment.sh	Add automation scripts to support displaying environment information such as RAGFlow repository version, operating system, Python version, etc. in a Linux environment for users to report issues. (#396 )	2024-04-17 11:54:06 +08:00
README.md	update doc for release (#2071 )	2024-08-23 16:32:17 +08:00
README_ja.md	update doc for release (#2071 )	2024-08-23 16:32:17 +08:00
README_ko.md	update doc for release (#2071 )	2024-08-23 16:32:17 +08:00
README_zh.md	update doc for release (#2071 )	2024-08-23 16:32:17 +08:00
requirements.txt	Bump nltk from 3.8.1 to 3.9 (#2250 )	2024-09-05 09:34:11 +08:00
requirements_arm.txt	Bump nltk from 3.8.1 to 3.9 (#2250 )	2024-09-05 09:34:11 +08:00
SECURITY.md	Update SECURITY.md (#1248 )	2024-06-24 16:30:17 +08:00

README.md

English | 简体中文 | 日本語 | 한국어

Document | Roadmap | Twitter | Discord | Demo

📕 Table of Contents

💡 What is RAGFlow?
🎮 Demo
📌 Latest Updates
🌟 Key Features
🔎 System Architecture
🎬 Get Started
🔧 Configurations
🛠️ Build from source
🛠️ Launch service from source
📚 Documentation
📜 Roadmap
🏄 Community
🙌 Contributing

💡 What is RAGFlow?

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding. It offers a streamlined RAG workflow for businesses of any scale, combining LLM (Large Language Models) to provide truthful question-answering capabilities, backed by well-founded citations from various complex formatted data.

🎮 Demo

Try our demo at https://demo.ragflow.io.

🔥 Latest Updates

2024-08-22 Support text to SQL statements through RAG.
2024-08-02 Supports GraphRAG inspired by graphrag and mind map.
2024-07-23 Supports audio file parsing.
2024-07-21 Supports more LLMs (LocalAI, OpenRouter, StepFun, and Nvidia).
2024-07-18 Adds more components (Wikipedia, PubMed, Baidu, and Duckduckgo) to the graph.
2024-07-08 Supports workflow based on Graph.
2024-06-27 Supports Markdown and Docx in the Q&A parsing method.
2024-06-27 Supports extracting images from Docx files.
2024-06-27 Supports extracting tables from Markdown files.
2024-06-06 Supports Self-RAG, which is enabled by default in dialog settings.
2024-05-30 Integrates BCE and BGE reranker models.
2024-05-23 Supports RAPTOR for better text retrieval.
2024-05-15 Integrates OpenAI GPT-4o.

🌟 Key Features

🍭 "Quality in, quality out"

Deep document understanding-based knowledge extraction from unstructured data with complicated formats.
Finds "needle in a data haystack" of literally unlimited tokens.

🍱 Template-based chunking

Intelligent and explainable.
Plenty of template options to choose from.

🌱 Grounded citations with reduced hallucinations

Visualization of text chunking to allow human intervention.
Quick view of the key references and traceable citations to support grounded answers.

🍔 Compatibility with heterogeneous data sources

Supports Word, slides, excel, txt, images, scanned copies, structured data, web pages, and more.

🛀 Automated and effortless RAG workflow

Streamlined RAG orchestration catered to both personal and large businesses.
Configurable LLMs as well as embedding models.
Multiple recall paired with fused re-ranking.
Intuitive APIs for seamless integration with business.

🔎 System Architecture

🎬 Get Started

📝 Prerequisites

CPU >= 4 cores
RAM >= 16 GB
Disk >= 50 GB
Docker >= 24.0.0 & Docker Compose >= v2.26.1

If you have not installed Docker on your local machine (Windows, Mac, or Linux), see Install Docker Engine.

🚀 Start up the server

Ensure vm.max_map_count >= 262144:
To check the value of vm.max_map_count:
```
$ sysctl vm.max_map_count
```
Reset vm.max_map_count to a value at least 262144 if it is not.
```
# In this case, we set it to 262144:
$ sudo sysctl -w vm.max_map_count=262144
```
This change will be reset after a system reboot. To ensure your change remains permanent, add or update the vm.max_map_count value in /etc/sysctl.conf accordingly:
```
vm.max_map_count=262144
```

Clone the repo:

$ git clone https://github.com/infiniflow/ragflow.git

Build the pre-built Docker images and start up the server:

Running the following commands automatically downloads the dev version RAGFlow Docker image. To download and run a specified Docker version, update RAGFLOW_VERSION in docker/.env to the intended version, for example RAGFLOW_VERSION=v0.10.0, before running the following commands.
```
$ cd ragflow/docker
$ chmod +x ./entrypoint.sh
$ docker compose up -d
```
The core image is about 9 GB in size and may take a while to load.

Check the server status after having the server up and running:

$ docker logs -f ragflow-server

The following output confirms a successful launch of the system:

    ____                 ______ __
   / __ \ ____ _ ____ _ / ____// /____  _      __
  / /_/ // __ `// __ `// /_   / // __ \| | /| / /
 / _, _// /_/ // /_/ // __/  / // /_/ /| |/ |/ /
/_/ |_| \__,_/ \__, //_/    /_/ \____/ |__/|__/
              /____/

 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:9380
 * Running on http://x.x.x.x:9380
 INFO:werkzeug:Press CTRL+C to quit

If you skip this confirmation step and directly log in to RAGFlow, your browser may prompt a network anomaly error because, at that moment, your RAGFlow may not be fully initialized.

In your web browser, enter the IP address of your server and log in to RAGFlow.

With the default settings, you only need to enter http://IP_OF_YOUR_MACHINE (sans port number) as the default HTTP serving port 80 can be omitted when using the default configurations.
In service_conf.yaml, select the desired LLM factory in user_default_llm and update the API_KEY field with the corresponding API key.

See llm_api_key_setup for more information.

The show is now on!

🔧 Configurations

When it comes to system configurations, you will need to manage the following files:

.env: Keeps the fundamental setups for the system, such as SVR_HTTP_PORT, MYSQL_PASSWORD, and MINIO_PASSWORD.
service_conf.yaml: Configures the back-end services.
docker-compose.yml: The system relies on docker-compose.yml to start up.

You must ensure that changes to the .env file are in line with what are in the service_conf.yaml file.

The ./docker/README file provides a detailed description of the environment settings and service configurations, and you are REQUIRED to ensure that all environment settings listed in the ./docker/README file are aligned with the corresponding configurations in the service_conf.yaml file.

To update the default HTTP serving port (80), go to docker-compose.yml and change 80:80 to <YOUR_SERVING_PORT>:80.

Updates to all system configurations require a system reboot to take effect:
$ docker-compose up -d

🛠️ Build from source

To build the Docker images from source:

$ git clone https://github.com/infiniflow/ragflow.git
$ cd ragflow/
$ docker build -t infiniflow/ragflow:dev .
$ cd ragflow/docker
$ chmod +x ./entrypoint.sh
$ docker compose up -d

🛠️ Launch service from source

To launch the service from source:

Clone the repository:

$ git clone https://github.com/infiniflow/ragflow.git
$ cd ragflow/

Create a virtual environment, ensuring that Anaconda or Miniconda is installed:

$ conda create -n ragflow python=3.11.0
$ conda activate ragflow
$ pip install -r requirements.txt

# If your CUDA version is higher than 12.0, run the following additional commands:
$ pip uninstall -y onnxruntime-gpu
$ pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/

Copy the entry script and configure environment variables:

# Get the Python path:
$ which python
# Get the ragflow project path:
$ pwd

$ cp docker/entrypoint.sh .
$ vi entrypoint.sh

# Adjust configurations according to your actual situation (the following two export commands are newly added):
# - Assign the result of `which python` to `PY`.
# - Assign the result of `pwd` to `PYTHONPATH`.
# - Comment out `LD_LIBRARY_PATH`, if it is configured.
# - Optional: Add Hugging Face mirror.
PY=${PY}
export PYTHONPATH=${PYTHONPATH}
export HF_ENDPOINT=https://hf-mirror.com

Launch the third-party services (MinIO, Elasticsearch, Redis, and MySQL):
```
$ cd docker
$ docker compose -f docker-compose-base.yml up -d 
```
Check the configuration files, ensuring that:
- The settings in docker/.env match those in conf/service_conf.yaml.
- The IP addresses and ports for related services in service_conf.yaml match the local machine IP and ports exposed by the container.

Launch the RAGFlow backend service:

$ chmod +x ./entrypoint.sh
$ bash ./entrypoint.sh

Launch the frontend service:

$ cd web
$ npm install --registry=https://registry.npmmirror.com --force
$ vim .umirc.ts
# Update proxy.target to http://127.0.0.1:9380
$ npm run dev

Deploy the frontend service:

$ cd web
$ npm install --registry=https://registry.npmmirror.com --force
$ umi build
$ mkdir -p /ragflow/web
$ cp -r dist /ragflow/web
$ apt install nginx -y
$ cp ../docker/nginx/proxy.conf /etc/nginx
$ cp ../docker/nginx/nginx.conf /etc/nginx
$ cp ../docker/nginx/ragflow.conf /etc/nginx/conf.d
$ systemctl start nginx

📚 Documentation

📜 Roadmap

See the RAGFlow Roadmap 2024

🏄 Community

🙌 Contributing

RAGFlow flourishes via open-source collaboration. In this spirit, we embrace diverse contributions from the community. If you would like to be a part, review our Contribution Guidelines first.