Unstructured 文档解析

安装配置

Python 环境要求

推荐使用 Python 3.9 及以上版本

安装方式选择

安装方式	命令	适用场景
基础版	`pip install unstructured`	纯文本处理（Markdown、HTML、TXT）
本地推理	`pip install "unstructured[local-inference]"`	支持 PDF/图片 OCR，本地完整能力
全文档类型	`pip install "unstructured[all-docs]"`	支持所有格式，需外部 API
按需安装	`pip install "unstructured[pdf,docx]"`	仅安装指定格式支持
Serverless	`pip install unstructured-client`	云端 API，3秒启动，高并发
Docker	`docker pull downloads.unstructured.io/...`	容器化部署

核心依赖组件

Tesseract OCR - 图像文本识别

# macOS
brew install tesseract tesseract-lang

# Linux
sudo apt-get install tesseract-ocr tesseract-ocr-chi-sim

# 验证安装
tesseract --list-langs

Poppler - PDF 内容提取

# macOS
brew install poppler

# Linux
sudo apt-get install poppler-utils

# 验证安装
pdfinfo -v

Pandoc - 富文本格式转换（需 2.14.2+）

libmagic - 文件类型检测（Linux/macOS 需要，Windows 可选）

核心概念

五大功能模块

功能	作用	典型应用
Partitioning（分区）	将文档分解为结构化元素	识别标题、段落、表格
Cleaning（清理）	删除不需要的文本	去除样板文件、句子片段
Staging（暂存）	格式化数据供下游使用	ML 推理、数据标记
Chunking（分块）	分割文档为小块	RAG 应用、相似性搜索
Embedding（嵌入）	文本转向量	向量数据库存储

基于元素的解析方法

核心优势：

结构保留：保留文档的逻辑结构和上下文
精细控制：按类型过滤和处理元素
丰富元数据：页码、坐标、语言、文件信息

unstructured不仅仅是"读取"PDF文档；它理解文档并进行解构它

基础用法：

from unstructured.partition.auto import partition

# 自动检测文件类型并解析
elements = partition(filename="document.pdf", strategy="auto")

# 访问元素信息
print(elements[0].text)        # 文本内容
print(elements[0].category)    # 元素类型
print(elements[0].metadata)    # 元数据

Element 对象结构

核心字段：

text：文本内容（表格以 Markdown 格式呈现）
category：元素类型（Title、NarrativeText、Table、Image 等）
metadata：元数据对象

category 元素类型：

类型	说明
`Title`	文档的标题和副标题
`NarrativeText`	纯文本的段落
`Table`	表格数据
`Text`	文本段落
`Image`	所有图片
`Formula`	数学公式（如 y = Wx + b）
`Header` / `Footer`	页眉/页脚

元数据详解：

字段	说明	示例
`page_number`	页码（从 1 开始）	1, 2, 3
`coordinates.points`	边界框坐标（4个角点）	`[左上, 左下, 右下, 右上]`
`coordinates.system`	坐标系统类型	`PixelSpace`（像素坐标系）
`coordinates.layout_width`	页面宽度（像素）	1920
`coordinates.layout_height`	页面高度（像素）	1080
`languages`	检测到的语言	`["zho"]`（中文，ISO 639-3）
`filename`	原始文件名	`document.pdf`
`last_modified`	最后修改时间	ISO 8601 格式
`filetype`	MIME 类型	`application/pdf`

表格提取说明：

表格提取功能已集成至 unstructured 库核心模块，无需再向 unstructured-inference 传递 extract_tables 参数。通过 elements 对象的 category 属性可精准筛选表格元素。

# 筛选表格元素
tables = [e for e in elements if e.category == "Table"]

Partition 通用参数

参数	作用	示例
`encoding`	字符编码	`"utf-8"`
`include_page_breaks`	包含分页符	`True`
`strategy`	解析策略	`"auto"`, `"fast"`, `"hi_res"`, `"ocr_only"`, `"vlm"`
`languages`	OCR 语言包	`["eng", "chi_sim"]`
`skip_infer_table_types`	跳过表格推断	`["pdf", "docx"]`
`fields_include`	输出字段控制	`["text", "type", "metadata"]`
`metadata_include`	保留指定元数据键	`["page_number", "filename"]`
`metadata_exclude`	排除指定元数据键	`["coordinates"]`
`content_type`	指定 MIME 类型	`"application/pdf"`
`starting_page_number`	指定起始页号	`1`

文档解析实战

通用解析函数

from typing import List, Dict, Any, Optional, Sequence
from pathlib import Path

# 自定义解析函数，支持任意类型的文件格式
def parse_file_with_unstructured(file_path: str):
    """
    使用UnstructuredIO解析单个文件

    Args:
        file_path: 文件路径

    Returns:\
        Dict: 包含解析结果和统计信息的字典
    """
    print(f"\n 解析文件: {file_path}")

    try:
        # 使用partition函数自动检测文件类型并解析,默认strategy策略是auto，还会有fast策略，速度比image-to-text models的快100倍
        elements: List[Element] = partition(filename=file_path, strategy="auto")

        # 分析解析结果
        analysis = {
            "file_path": file_path,
            "file_extension": Path(file_path).suffix.lower(),
            "total_elements": len(elements),
            "element_types": {},
            "elements": elements,
            "text_content": "",
            "statistics": {}
        }

        # 统计元素类型
        for element in elements:
            element_type = type(element).__name__
            analysis["element_types"][element_type] = analysis["element_types"].get(element_type, 0) + 1

        # 提取文本内容
        text_parts = []

        for element in elements:
            if hasattr(element, 'text') and element.text:
                text_parts.append(element.text)

        analysis["text_content"] = "\n\n".join(text_parts)

        # 计算统计信息
        analysis["statistics"]["total_characters"] = len(analysis["text_content"])

        print(f"   解析完成")
        print(f"   元素总数: {analysis['total_elements']}")
        print(f"   元素类型: {analysis['element_types']}")
        print(f"   总字符数: {analysis['statistics']['total_characters']}")
        print(f"   文本内容: {analysis['text_content'][:3000]} ")

    except Exception as e:
        print(f"文件解析失败: {e}")
        return {}

Markdown 解析

from unstructured.partition.md import partition_md

elements = partition_md(
    filename="document.md",
    languages=["zho"],
    include_page_breaks=True
)

HTML 解析

from unstructured.partition.html import partition_html

# 本地文件
elements = partition_html(filename="page.html")

# 远程 URL
elements = partition_html(
    url="https://example.com",
    headers={"User-Agent": "MyBot"},
    ssl_verify=False
)

Excel 解析

from unstructured.partition.xlsx import partition_xlsx

elements = partition_xlsx(
    filename="data.xlsx",
    languages=["zho"]
)

CSV 解析

from unstructured.partition.csv import partition_csv

elements = partition_csv(
    filename="data.csv",
    encoding="utf-8"
)

Word 解析

from unstructured.partition.docx import partition_docx

elements = partition_docx(
    filename="document.docx",
    encoding="utf-8",
    include_page_breaks=True
)

图片解析

前置要求：

安装 poppler-utils 和 tesseract-ocr
首次使用会自动下载 YOLOX 模型（需科学上网）
模型来源：HuggingFace

from unstructured.partition.image import partition_image

elements = partition_image(
    filename="screenshot.png",
    strategy="ocr_only",
    languages=["eng", "chi_sim"]
)

注意事项：

OCR 效果非常依赖原始图像质量。
可进行预处理：二值化、降噪、倾斜校正。

PDF 解析

解析策略对比：

策略	适用场景	特点
`auto`	标准 PDF	默认策略，速度快
`fast`	纯文本 PDF	最快速度，基础提取
`hi_res`	复杂布局	使用布局检测模型，精度高
`ocr_only`	扫描件	纯 OCR 提取
`vlm`	极端复杂	多模态模型理解

PDF 特有参数：

参数	作用	默认值/示例
`extract_images_in_pdf`	提取嵌入图像块（需 hi_res）	`True`
`extract_image_block_types`	指定提取类型	`["Image", "Table"]`
`extract_image_block_to_payload`	转换为 base64 输出	`False`
`extract_image_block_output_dir`	图像保存目录	`"./images"`
`max_partition`	单个元素最大字符长度	`1500`
`languages` / `ocr_languages`	OCR 语言包	`["eng", "zho"]`
`skip_infer_table_types`	跳过表格推断	`False`
`split_pdf_page`	大文件分页处理	`True`
`infer_table_structure`	表格结构推断	`True`

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="document.pdf",
    strategy="hi_res",                    # 高精度模式
    extract_images_in_pdf=True,           # 提取图片
    extract_image_block_types=["Table", "Image"],  # 提取类型
    extract_image_block_output_dir="./images",     # 保存目录
    languages=["eng", "zho"],             # 语言支持
    split_pdf_page=True,                  # 分页处理
    infer_table_structure=True,           # 表格结构推断
    include_page_breaks=True              # 包含页码
)

OCR 引擎配置

切换 OCR 引擎：

# 使用 Tesseract（默认）
export OCR_AGENT="unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract"

# 使用 Paddle OCR（中文识别更好）
export OCR_AGENT="unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle"

常见问题与解决方案

安装问题

依赖链报错

问题：Could not build wheels for pikepdf
解决：预先安装 qpdf、libheif、pillow

hi_res 安装困难

问题：detectron2 在 Windows 安装困难
解决：使用远程 API 或 Linux/GPU 容器

识别问题

中文识别不准确

确认 Tesseract 已安装中文语言包
使用 languages=["chi_sim", "eng"]
考虑切换到 Paddle OCR

表格转换错位

使用 strategy="hi_res" 提高精度
调整表格推断参数
考虑专用工具（Camelot、Tabula）

混合中英文本混乱

使用 strategy="hi_res" 做布局检测
对低质量扫描先做图像预处理

性能优化

策略选择：

大量电子 PDF（有文本层）
└─ 使用 fast 策略 + PyMuPDF

复杂布局/表格
└─ 使用 hi_res 策略（开销大，考虑异步处理）

扫描件
└─ 使用 ocr_only 策略 + 图像预处理

批量处理优化：

使用异步队列处理
外包 hi_res 推理到推理服务
根据文档类型分流处理

总结

Unstructured 提供了强大的文档解析能力：

多格式支持：统一接口处理 7+ 种文档格式
灵活策略：根据场景选择合适的解析策略
结构化输出：保留文档结构和元数据
RAG 友好：直接集成到 LangChain/LlamaIndex

最佳实践：

从 fast 策略开始，逐步提升到 hi_res
根据文档类型选择合适的 OCR 引擎
对低质量图像进行预处理
使用元数据进行精细化控制

安装配置​

Python 环境要求​

安装方式选择​

核心依赖组件​

核心概念​

五大功能模块​

基于元素的解析方法​

Element 对象结构​

Partition 通用参数​

文档解析实战​

通用解析函数​

Markdown 解析​

HTML 解析​

Excel 解析​

CSV 解析​

Word 解析​

图片解析​

PDF 解析​

OCR 引擎配置​

常见问题与解决方案​

安装问题​

识别问题​

性能优化​

总结​