RAG文档解析器外围技术剖析_莆田市城厢区萌爵百货商行

最近，RAG技术逐渐走红，但文档解析这一关键环节却不为人知。说究竟，无论经常使用如许初级的检索和生成技术，最终成果都取决于文档自身的品质。假设文档消息不全或格局凌乱，那么再怎样优化检索战略、嵌入模型或大型言语模型（LLMs）也无济于事。

本文引见三种盛行的文档提取战略，并以亚马逊2024年第一季度报告中的表格解析为例，展现这些战略的实践运行。

1 文本解析器：基础工具

文本解析器曾经开展多年，这些工具能够读取文档并从中提取文本。经常出现的工具备PyPDF、PyMUPDF和PDFMiner。接上去，重点引见PyMUPDF，并经过LlamaIndex集成的PyMUPDF来解析特定页面。以下是相应的代码示例：

from llama_index.core.schema import TextNodefrom llama_index.core.node_parser import SentenceSplitterimport fitzfile_path = "/content/AMZN-Q1-2024-Earnings-Release.pdf"doc = fitz.open(file_path)text_parser = SentenceSplitter(chunk_size=2048,)text_chunks = [] #Cfor doc_idx, page in enumerate(doc):page_text = page.get_text("text")cur_text_chunks = text_parser.split_text(page_text)text_chunks.extend(cur_text_chunks)nodes = [] #Dfor idx, text_chunk in enumerate(text_chunks):node = TextNode(text=text_chunk,)nodes.append(node)print(nodes[10].text)

PyMUPDF在提取文本方面体现低劣，但文本的格局解决并不现实。这在后续的生成环节中或者会形成疑问，尤其是当大型言语模型难以识别文档结构时。

以下是亚马逊公司的财务报表摘要：

AMAZON.COM, INC.Consolidated Statements of Comprehensive Income(in millions)(unaudited)Three Months EndedMarch 31,20232024Net income$3,172 $10,431Other comprehensive income (loss):Foreign currency translation adjustments, net of tax of $(10) and $30386(1,096)Available-for-sale debt securities:Change in net unrealized gains (losses), net of tax of $(29) and $(158)95536Less: reclassification adjustment for losses (gains) included in “Other income(expense), net,” net of tax of $(10) and $0331Net change128537Other, net of tax of $0 and $(1)—1Total other comprehensive income (loss)514(558)Comprehensive income$3,686 $9,873

接上去，让咱们看看OCR在文档解析中的体现。

2 OCR技术：图像识别

from PIL import Imageimport pytesseractimport sysfrom pdf2image import convert_from_pathimport ospages = convert_from_path(file_path)i=10filename = "page"+str(i)+".jpg"pages[i].save(filename, 'JPEG')outfile ="page"+str(i)+"_text.txt"f = open(outfile, "a")text= str(((pytesseract.image_to_string(Image.open(filename)))))text = text.replace('-\n', '')f.write(text)f.close()print(text)

OCR（如下所示）能更好地捕捉文档文本和结构。

AMAZON.COM, INC.Consolidated Statements of Comprehensive Income(in millions)(unaudited)Three Months EndedMarch 31,2023 2024Net income $ 3,172 §$ 10,431Other comprehensive income (loss):Foreign currency translation adjustments, net of tax of $(10) and $30 386 (1,096)Available-for-sale debt securities:Change in net unrealized gains (losses), net of tax of $(29) and $(158) 95 536Less: reclassification adjustment for losses (gains) included in “Other income(expense), net,” net of tax of $(10) and $0 33 1Net change 128 231Other, net of tax of $0 and $(1) _— 1Total other comprehensive income (loss) 514 (558)Comprehensive income $ 3,686 $ 9,873

最后，来看看智能文档解析。

3 智能文档解析（IDP）：结构化提取

智能文档解析（IDP）是一项新兴技术，旨在从文档中提取所无关系消息，并以结构化格局出现。市面上有多种IDP工具，如LlamaParse、DocSumo、Unstructured.io以及Azure Doc Intelligence等。

这些工具的独特点在于，它们都融合了OCR（光学字符识别）、文本提取技术、多模态大型言语模型（LLMs），以及将内容转换为markdown格局的才干，以成功文本的高效提取。以LlamaIndex推出的LlamaParse为例，经常使用前须要先失掉API密钥，而后便可以经过API接口来解析文档。

import getpassimport osfrom copy import deepcopyos.environ["LLAMA_CLOUD_API_KEY"] = getpass.getpass()from llama_parse import LlamaParseimport nest_asyncionest_asyncio.apply()documents = LlamaParse(result_type="markdown").load_data(file_path)def get_page_nodes(docs, separator="\n---\n"):"""Split each document into page node, by separator."""nodes = [] #Cfor doc in docs:doc_chunks = doc.text.split(separator)for doc_chunk in doc_chunks:node = TextNode(text=doc_chunk,metadata=deepcopy(doc.metadata),)nodes.append(node)return nodesnodes_lp = get_page_nodes(documents)print(nodes_lp[10].text)

上方的内容以markdown格局结构化，应该是目前结构最好的示意。

# 亚马逊公司# 综合收益表| |Three Months Ended March 31, 2023|Three Months Ended March 31, 2024||---|---|---||Net income|$3,172|$10,431||Other comprehensive income (loss):| | ||Foreign currency translation adjustments, net of tax of $(10) and $30|386|(1,096)||Available-for-sale debt securities:| | ||Change in net unrealized gains (losses), net of tax of $(29) and $(158)|95|536||Less: reclassification adjustment for losses (gains) included in “Other income (expense), net,” net of tax of $(10) and $0|33|1||Net change|128|537||Other, net of tax of $0 and $(1)|—|1||Total other comprehensive income (loss)|514|(558)||Comprehensive income|$3,686|$9,873|

不过，有一点须要留意，上述内容疏忽了一些关键的高低文消息。特意是，解析后的文档中不再蕴含“millions”（百万）这样的单位标识，这或者会造成生成器LLM无了解时发生曲解。

4 论断

要优化你的RAG运行功能，重点在于选用适合的文档解析器。各种解析战略各有所长，也各有局限：

最终，选用哪种解析器，须要依据你的详细运行场景来选择。最佳做法是尝试不同的解析器，评价它们在你的运行中的体现，而后选用最满足你需求的那一个。有时刻，联合多种方法或者会更有效。始终实验和调整，以期到达最佳的RAG运行成果。

本文转载自，作者：

文章版权声明 1、本网站名称：莆田市城厢区萌爵百货商行
2、本站永久网址：http://www.relax48.com
3、本网站的文章部分内容可能来源于网络，仅供大家学习与参考，如有侵权，请联系站长进行删除处理。
4、本站一切资源不代表本站立场，并不代表本站赞同其观点和对其真实性负责。
5、本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报

#Copilot #Sora #混元 #LaMDA #ChatGPT #GPT #大模型 #言犀 #AIGC #AIGC运行 #OpenAI #通义千问 #4 #孟子 #AI #文心一言 #人工智能 #云雀 #悟道 #RAG #功能 #Agent #开源大模型 #日日新 #紫东太初 #清言 #解析器 #盘古 #多模态 #Bard

RAG文档解析器 外围技术剖析

1 文本解析器：基础工具

2 OCR技术：图像识别

3 智能文档解析（IDP）：结构化提取

4 论断

RAG文档解析器外围技术剖析