ragflow/deepdoc/parser
gsmini 53c653b099
fix RAGFlowPdfParser AttributeError: 'PdfReader' object has no attribute 'close' err (#6859)
i use PdfParser in local(refer to this case:
https://github.com/infiniflow/ragflow/blob/main/rag/app/paper.py) like
this:
```
import re
import openpyxl

from ragflow.api.db import ParserType
from ragflow.rag.nlp import rag_tokenizer, tokenize, tokenize_table, add_positions, bullets_category, \
    title_frequency, \
    tokenize_chunks
from ragflow.rag.utils import num_tokens_from_string
from ragflow.deepdoc.parser import PdfParser, ExcelParser, DocxParser,PlainParser


def logger(prog=None, msg=""):
    print(msg)


class Pdf(PdfParser):
    def __init__(self):
        self.model_speciess = ParserType.MANUAL.value
        super().__init__()

    def __call__(self, filename, binary=None, from_page=0,
                 to_page=100000, zoomin=3, callback=None):
        from timeit import default_timer as timer
        start = timer()
        callback(msg="OCR is running...")

        self.__images__(
            filename if not binary else binary,
            zoomin,
            from_page,
            to_page,
            callback
        )
        callback(msg="OCR finished.")
        print("OCR:", timer() - start)
   
        self._layouts_rec(zoomin)
        callback(0.65, "Layout analysis finished.")
        print("layouts:", timer() - start)

        self._table_transformer_job(zoomin)
        callback(0.67, "Table analysis finished.")


        self._text_merge()
        tbls = self._extract_table_figure(True, zoomin, True, True)
        self._concat_downward()  
        self._filter_forpages()   
        callback(0.68, "Text merging finished")

        # clean mess
        for b in self.boxes:
            b["text"] = re.sub(r"([\t  ]|\u3000){2,}", " ", b["text"].strip())

        return [(b["text"], b.get("layout_no", ""), self.get_position(b, zoomin))
                for i, b in enumerate(self.boxes)], tbls


```

show err like this:
```
  File "xxxxx/third_party/ragflow/deepdoc/parser/pdf_parser.py", line 1039, in __images__
    self.pdf.close()
AttributeError: 'PdfReader' object has no attribute 'close'
```

i found ragflow source code use
`pdfplumber.open`(https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/pdf_parser.py#L1007C28-L1007C43)

and replace` self.pdf `with ` pdf2_read` (from pypdf import PdfReader as
pdf2_read)in line 1024
(https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/pdf_parser.py#L1024)
```
self.pdf = pdf2_read
```


---
and I found that `pdfplumber` can be used in this way:
```
file_path="xxx.pdf"
res = pdfplumber.open(file_path)
res.close()
```

but `pypdf.PdfReader` source code do not has `close` func, source code
use like this

```
 with open(stream, "rb") as fh:
         stream = BytesIO(fh.read())
          self._stream_opened = True
```
> https://github.com/py-pdf/pypdf/blob/main/pypdf/_reader.py#L156

so I moved the `self.pdf.close` function call and fixed this problem
hoping to help the project😊
2025-04-14 09:40:13 +08:00
..
resume Fix:when start with source code not in docker env report 'UnicodeDec… (#5802) 2025-03-10 11:22:06 +08:00
__init__.py Update comments (#4569) 2025-01-21 20:52:28 +08:00
docx_parser.py Update comments (#4569) 2025-01-21 20:52:28 +08:00
excel_parser.py Fix: When Excel is a formula, the parsed result is a formula, but cannot be correctly parsed as a value type (#6613) 2025-03-28 09:33:49 +08:00
figure_parser.py Feat: add VLM-boosted DocX parser (#6307) 2025-03-20 11:24:44 +08:00
html_parser.py Update comments (#4569) 2025-01-21 20:52:28 +08:00
json_parser.py Update comments (#4569) 2025-01-21 20:52:28 +08:00
markdown_parser.py Feat:Optimize the table extraction logic in the Markdown parser: (#5663) 2025-03-07 17:02:35 +08:00
pdf_parser.py fix RAGFlowPdfParser AttributeError: 'PdfReader' object has no attribute 'close' err (#6859) 2025-04-14 09:40:13 +08:00
ppt_parser.py Feat: support pic base bullet for PPT (#6406) 2025-03-24 09:31:31 +08:00
txt_parser.py Fix: delimiter issue. (#5720) 2025-03-06 17:51:22 +08:00
utils.py Update comments (#4569) 2025-01-21 20:52:28 +08:00