- 实时图传:WebSocket JPEG 帧发送 + 帧率控制 + PC 浏览器预览 - PDF 上传与处理:上传/处理分离,支持 ocrpdf 和 markdown 两种类型 - MinerU 真实接入:markdown 处理 + images ZIP 打包 - OCRmyPDF 接入:ocrpdf 生成可搜索双层 PDF - 手机端任务管理面板:轮询状态 + SAF 目录选择下载 - PC 管理面板:/dashboard 文件与任务管理 - 网络层:OkHttp 客户端、WebSocket 图传、局域网发现占位 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
392
requirements/mineru-integration.md
Normal file
392
requirements/mineru-integration.md
Normal file
@@ -0,0 +1,392 @@
|
||||
# MinerU 接入 FairScan PC Server 对接文档
|
||||
|
||||
> 本文档记录 MinerU 在本机的环境信息、API 用法,以及如何将其接入 FairScan PC 服务器,
|
||||
> 替换当前的模拟处理逻辑。
|
||||
|
||||
---
|
||||
|
||||
## 1. 本机环境信息
|
||||
|
||||
> **统一环境**:MinerU 和 OCRmyPDF 共用一个 conda 环境 `MinerU`,
|
||||
> PC 服务器始终通过 `conda activate MinerU` 启动。
|
||||
|
||||
| 项目 | 值 |
|
||||
|------|-----|
|
||||
| MinerU 源码路径 | `F:/datasets_rm/MinerU/` |
|
||||
| **已安装版本** | **3.0.9** |
|
||||
| **最新版本** | **3.2.2**(446 commits 差距) |
|
||||
| Conda 环境 | `D:/ProgramData/miniconda3/envs/MinerU/` |
|
||||
| Python | 3.10.20 |
|
||||
| PyTorch | 2.6.0+cu124 |
|
||||
| CUDA | 12.4 |
|
||||
| GPU | NVIDIA GeForce RTX 4060 Laptop GPU (8 GB VRAM) |
|
||||
| Transformers | 4.57.6 |
|
||||
| onnxruntime | 1.23.2 |
|
||||
| Pipeline 模型 | ✅ 已下载(HF cache: `C:/Users/32892/.cache/huggingface/hub/models--opendatalab--PDF-Extract-Kit-1.0`) |
|
||||
| VLM 模型 | ✅ 已下载(HF cache: `C:/Users/32892/.cache/huggingface/hub/models--opendatalab--MinerU2.5-2509-1.2B`) |
|
||||
| HF Hub 离线模式 | ✅ `HF_HUB_OFFLINE=1`(main.py 启动时设置) |
|
||||
| OCRmyPDF | ✅ v15.4.4 已安装(源码 `F:/datasets_rm/ocRmypdf`,同一 conda 环境) |
|
||||
| Tesseract | ❌ 待安装(OCRmyPDF 必需依赖) |
|
||||
|
||||
---
|
||||
|
||||
## 2. 前置准备:升级 MinerU(强烈建议)
|
||||
|
||||
当前安装的 3.0.9 与最新 3.2.2 差距较大(446 commits),主要改进包括:
|
||||
|
||||
- **`aio_do_parse()` 异步接口** — 可直接 await 调用,不阻塞 FastAPI 事件循环
|
||||
- **并发锁优化** — Layout/MFR/OCR 使用独立推理锁,减少 GPU 争用
|
||||
- **PDF 渲染修复** — 大量 PDFium 资源泄漏和崩溃修复
|
||||
- **图像分析** — 新增 `image_analysis` 参数
|
||||
- **Client-side 输出生成** — 新增 `client_side_output_generation` 选项
|
||||
|
||||
### 2.1 拉取最新代码
|
||||
|
||||
```bash
|
||||
cd F:/datasets_rm/MinerU
|
||||
git checkout main
|
||||
git pull origin main
|
||||
git checkout mineru-3.2.2-released
|
||||
```
|
||||
|
||||
### 2.2 更新安装
|
||||
|
||||
```bash
|
||||
conda activate MinerU
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
### 2.3 验证
|
||||
|
||||
```bash
|
||||
# 检查版本
|
||||
python -c "from mineru.version import __version__; print(__version__)" # 应为 3.2.2
|
||||
|
||||
# 验证模型可用
|
||||
python -c "
|
||||
from mineru.utils.models_download_utils import auto_download_and_get_model_root_path
|
||||
print('Pipeline:', auto_download_and_get_model_root_path('models/README.md', 'pipeline'))
|
||||
print('VLM:', auto_download_and_get_model_root_path('/', 'vlm'))
|
||||
"
|
||||
```
|
||||
|
||||
> **注意**:如果之后需要用 `model_source=local` 指定自定义模型路径,才需要创建 `~/.mineru.json` 配置文件。默认的 HuggingFace 缓存模式不需要。
|
||||
|
||||
---
|
||||
|
||||
## 3. MinerU 编程接口
|
||||
|
||||
### 3.1 核心函数:`do_parse`
|
||||
|
||||
```python
|
||||
from mineru.cli.common import do_parse, read_fn
|
||||
from mineru.utils.enum_class import MakeMode
|
||||
from pathlib import Path
|
||||
|
||||
def do_parse(
|
||||
output_dir: str, # 输出目录路径
|
||||
pdf_file_names: list[str], # PDF 文件名列表(不含扩展名)
|
||||
pdf_bytes_list: list[bytes], # PDF 文件字节列表
|
||||
p_lang_list: list[str], # 语言列表("ch", "en", "japan" 等)
|
||||
backend: str = "pipeline", # "pipeline" | "vlm-auto-engine" | "hybrid-auto-engine"
|
||||
parse_method: str = "auto", # "auto" | "txt" | "ocr"
|
||||
formula_enable: bool = True,
|
||||
table_enable: bool = True,
|
||||
server_url: str | None = None, # 远程服务器 URL(仅 http-client 后端)
|
||||
f_dump_md: bool = True, # 输出 .md 文件
|
||||
f_dump_middle_json: bool = True, # 输出 _middle.json
|
||||
f_dump_model_output: bool = True, # 输出 _model.json
|
||||
f_dump_orig_pdf: bool = True, # 输出原始 PDF 副本
|
||||
f_dump_content_list: bool = True, # 输出 _content_list.json
|
||||
f_draw_layout_bbox: bool = True, # 输出带布局框的 PDF
|
||||
f_draw_span_bbox: bool = True, # 输出带 span 框的 PDF
|
||||
f_make_md_mode: MakeMode = MakeMode.MM_MD, # Markdown 模式
|
||||
start_page_id: int = 0,
|
||||
end_page_id: int | None = None, # None = 所有页
|
||||
**kwargs,
|
||||
)
|
||||
```
|
||||
|
||||
### 3.2 `read_fn` 辅助函数
|
||||
|
||||
```python
|
||||
from mineru.cli.common import read_fn
|
||||
|
||||
# 读取 PDF 文件为 bytes
|
||||
pdf_bytes = read_fn("F:/path/to/doc.pdf")
|
||||
|
||||
# 也支持图片文件(自动转为 PDF bytes)
|
||||
png_bytes = read_fn("scan.png")
|
||||
```
|
||||
|
||||
### 3.3 输出目录结构
|
||||
|
||||
Pipeline 后端(`backend="pipeline"`)输出:
|
||||
|
||||
```
|
||||
{output_dir}/
|
||||
{pdf_name}/
|
||||
auto/ # parse_method="auto"
|
||||
{pdf_name}.md # ★ Markdown 输出(主要产物)
|
||||
{pdf_name}_middle.json # 中间解析结果
|
||||
{pdf_name}_model.json # 模型原始输出
|
||||
{pdf_name}_content_list.json
|
||||
{pdf_name}_origin.pdf # 原始 PDF 副本
|
||||
{pdf_name}_layout.pdf # 布局可视化
|
||||
{pdf_name}_span.pdf # Span 可视化
|
||||
images/ # 提取的图片
|
||||
```
|
||||
|
||||
### 3.4 语言代码
|
||||
|
||||
| 代码 | 语言 |
|
||||
|------|------|
|
||||
| `ch` | 简体中文 |
|
||||
| `ch_server` | 中文服务器版(较快) |
|
||||
| `ch_lite` | 中文轻量版 |
|
||||
| `en` | 英语 |
|
||||
| `japan` | 日语 |
|
||||
| `korean` | 韩语 |
|
||||
| `chinese_cht` | 繁体中文 |
|
||||
|
||||
---
|
||||
|
||||
## 4. 接入方案
|
||||
|
||||
### 方案 A:直接异步 API 调用(强烈推荐,需 v3.2.2)
|
||||
|
||||
升级到 v3.2.2 后,可以直接使用 `aio_do_parse()` — MinerU 原生异步接口,无需 `asyncio.to_thread()`。
|
||||
|
||||
**优点**:
|
||||
- **原生 async**,直接 await,不阻塞 FastAPI 事件循环
|
||||
- 最简单,不需要进程间通信
|
||||
- 可直接获取输出文件路径
|
||||
|
||||
**前提**:
|
||||
- FairScan PC 服务器在 MinerU conda 环境中运行
|
||||
- `F:/datasets_rm/MinerU` 已通过 `pip install -e .` 安装
|
||||
|
||||
**实现思路**:
|
||||
|
||||
```python
|
||||
# ---- pc-server/main.py 新增代码 ----
|
||||
|
||||
from pathlib import Path
|
||||
from mineru.cli.common import aio_do_parse, read_fn
|
||||
|
||||
async def real_mineru_processing(task_id: str):
|
||||
"""使用 MinerU 异步接口真实处理 PDF"""
|
||||
task = tasks_db.get(task_id)
|
||||
if task is None:
|
||||
return
|
||||
|
||||
file_name = task.get("fileName", "document.pdf")
|
||||
base_name = Path(file_name).stem
|
||||
upload_path = Path(task["uploadPath"])
|
||||
process_type = task.get("processType", "ocrpdf")
|
||||
lang = task.get("options", {}).get("lang", "ch")
|
||||
|
||||
task["status"] = "processing"
|
||||
task["progress"] = 10
|
||||
task["message"] = "MinerU processing started..."
|
||||
|
||||
output_dir = TASKS_DIR / task_id
|
||||
output_dir.mkdir(exist_ok=True)
|
||||
pdf_bytes = read_fn(upload_path)
|
||||
|
||||
try:
|
||||
if process_type == "markdown":
|
||||
await aio_do_parse(
|
||||
output_dir=str(output_dir),
|
||||
pdf_file_names=[base_name],
|
||||
pdf_bytes_list=[pdf_bytes],
|
||||
p_lang_list=[lang],
|
||||
backend="pipeline",
|
||||
f_dump_md=True,
|
||||
f_dump_middle_json=False,
|
||||
f_dump_model_output=False,
|
||||
f_dump_orig_pdf=False,
|
||||
f_dump_content_list=False,
|
||||
f_draw_layout_bbox=False,
|
||||
f_draw_span_bbox=False,
|
||||
)
|
||||
md_path = output_dir / base_name / "auto" / f"{base_name}.md"
|
||||
if md_path.exists():
|
||||
art_id = str(uuid.uuid4())
|
||||
artifacts_db[task_id] = [{
|
||||
"artifactId": art_id, "fileName": f"{base_name}.md",
|
||||
"fileSize": md_path.stat().st_size, "fileType": "md",
|
||||
"filePath": str(md_path),
|
||||
}]
|
||||
artifacts_map[art_id] = artifacts_db[task_id][0]
|
||||
task.update(status="completed", progress=100,
|
||||
message="MinerU Markdown completed")
|
||||
return
|
||||
|
||||
elif process_type == "ocrpdf":
|
||||
await aio_do_parse(
|
||||
output_dir=str(output_dir),
|
||||
pdf_file_names=[base_name],
|
||||
pdf_bytes_list=[pdf_bytes],
|
||||
p_lang_list=[lang],
|
||||
backend="pipeline",
|
||||
f_dump_md=False,
|
||||
f_dump_middle_json=False,
|
||||
f_dump_model_output=False,
|
||||
f_dump_orig_pdf=False,
|
||||
f_dump_content_list=False,
|
||||
f_draw_layout_bbox=True,
|
||||
f_draw_span_bbox=False,
|
||||
)
|
||||
layout_pdf = output_dir / base_name / "auto" / f"{base_name}_layout.pdf"
|
||||
if layout_pdf.exists():
|
||||
art_id = str(uuid.uuid4())
|
||||
artifacts_db[task_id] = [{
|
||||
"artifactId": art_id, "fileName": f"{base_name}_ocr.pdf",
|
||||
"fileSize": layout_pdf.stat().st_size, "fileType": "pdf",
|
||||
"filePath": str(layout_pdf),
|
||||
}]
|
||||
artifacts_map[art_id] = artifacts_db[task_id][0]
|
||||
task.update(status="completed", progress=100,
|
||||
message="OCR PDF completed")
|
||||
return
|
||||
|
||||
task["status"] = "failed"
|
||||
task["message"] = "MinerU did not produce output"
|
||||
|
||||
except Exception as e:
|
||||
task["status"] = "failed"
|
||||
task["message"] = f"MinerU error: {str(e)}"
|
||||
logger.error(f"MinerU task {task_id} failed: {e}")
|
||||
```
|
||||
|
||||
### 方案 B:子进程调用(备选)
|
||||
|
||||
通过 `subprocess` 调用 `mineru` CLI:
|
||||
|
||||
```python
|
||||
import subprocess
|
||||
import asyncio
|
||||
|
||||
async def mineru_subprocess(task_id: str):
|
||||
task = tasks_db[task_id]
|
||||
upload_path = task["uploadPath"]
|
||||
output_dir = TASKS_DIR / task_id
|
||||
|
||||
cmd = [
|
||||
r"D:/ProgramData/miniconda3/envs/MinerU/python.exe",
|
||||
"-m", "mineru.cli.client",
|
||||
"-p", str(upload_path),
|
||||
"-o", str(output_dir),
|
||||
"-b", "pipeline",
|
||||
"-l", "ch",
|
||||
]
|
||||
|
||||
proc = await asyncio.create_subprocess_exec(
|
||||
*cmd,
|
||||
stdout=asyncio.subprocess.PIPE,
|
||||
stderr=asyncio.subprocess.PIPE,
|
||||
)
|
||||
|
||||
# 轮询进度(可选:监控 stdout 中的进度信息)
|
||||
while True:
|
||||
line = await proc.stdout.readline()
|
||||
if not line:
|
||||
break
|
||||
# 解析进度...
|
||||
|
||||
returncode = await proc.wait()
|
||||
if returncode == 0:
|
||||
task["status"] = "completed"
|
||||
else:
|
||||
task["status"] = "failed"
|
||||
```
|
||||
|
||||
**优点**:进程隔离,MinerU 崩溃不影响 FairScan 服务。
|
||||
**缺点**:进度监控困难,需要 IPC。
|
||||
|
||||
### 方案 C:MinerU FastAPI 服务
|
||||
|
||||
运行 MinerU 自带的 FastAPI 服务 `mineru-api` 作为微服务,FairScan 通过 HTTP 调用。
|
||||
|
||||
这一方案与 pc-api-spec.md 中对原子服务的建议一致,但实现复杂度更高。
|
||||
|
||||
---
|
||||
|
||||
## 5. 与 pc-api-spec.md 的对应关系
|
||||
|
||||
根据接口规范,两种 `processType` 与 MinerU 的映射:
|
||||
|
||||
| processType | MinerU 后端 | 输出文件 | 文件类型 |
|
||||
|-------------|-----------|---------|---------|
|
||||
| `markdown` | `backend="pipeline"` | `{name}.md` | `text/markdown` |
|
||||
| `ocrpdf` | `backend="pipeline"` + `f_draw_layout_bbox=True` | `{name}_layout.pdf` | `application/pdf` |
|
||||
|
||||
两种类型共用同一个 MinerU `do_parse` 调用,仅输出选项不同。
|
||||
|
||||
---
|
||||
|
||||
## 6. 接入步骤建议
|
||||
|
||||
### Step 1:升级 MinerU 到最新版
|
||||
|
||||
```bash
|
||||
cd F:/datasets_rm/MinerU
|
||||
git checkout main && git pull origin main
|
||||
git checkout mineru-3.2.2-released
|
||||
conda activate MinerU
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
验证:
|
||||
```bash
|
||||
python -c "from mineru.cli.common import aio_do_parse; print('OK')"
|
||||
```
|
||||
|
||||
### Step 2:切换 PC 服务器运行环境
|
||||
|
||||
```bash
|
||||
conda activate MinerU
|
||||
cd E:/race_save/FairScan_cyy/FairScan/pc-server
|
||||
python main.py
|
||||
```
|
||||
|
||||
### Step 3:替换 `simulate_processing` 为真实 MinerU 调用
|
||||
|
||||
在 `main.py` 中将 `simulate_processing` 替换为 `real_mineru_processing`(参考方案 A 的实现)。
|
||||
|
||||
### Step 4:端到端测试
|
||||
|
||||
1. 用小 PDF(1-2 页)先用 `parse_method="txt"` 测试(速度快)
|
||||
2. 确认无误后切换为 `parse_method="auto"`(完整 OCR+公式+表格)
|
||||
3. 测试处理完成后产物下载
|
||||
|
||||
---
|
||||
|
||||
## 7. 注意事项
|
||||
|
||||
| 项目 | 说明 |
|
||||
|------|------|
|
||||
| **GPU 显存** | RTX 4060 有 8GB VRAM。pipeline 后端约需 4-6GB,VLM 后端约需 6-8GB。建议用 pipeline 后端。 |
|
||||
| **处理速度** | 普通 A4 PDF,pipeline 后端约 3-8 秒/页(取决于内容复杂度)。 |
|
||||
| **语言** | 默认传 `ch`(简体中文)。FairScan 可扩展语言选择功能。 |
|
||||
| **页数限制** | 可用 `start_page_id` / `end_page_id` 限制处理范围。 |
|
||||
| **大文件** | PDF > 100 页建议分批处理。 |
|
||||
| **超时** | 单次处理时间与页数成正比,不要设置过短的 HTTP 超时。 |
|
||||
| **锁模型** | `do_parse` 不是线程安全的。FastAPI 的 `async` 端点应在线程池中调用,避免阻塞事件循环。 |
|
||||
| **错误处理** | `do_parse` 出错会抛出异常,需捕获并设置 `task["status"] = "failed"`。 |
|
||||
|
||||
---
|
||||
|
||||
## 8. 关键参考文件
|
||||
|
||||
| 文件 | 说明 |
|
||||
|------|------|
|
||||
| `F:/datasets_rm/MinerU/mineru/cli/common.py` | `do_parse()` 主入口 |
|
||||
| `F:/datasets_rm/MinerU/mineru/cli/client.py` | CLI 参数定义 |
|
||||
| `F:/datasets_rm/MinerU/mineru/cli/output_paths.py` | 输出路径解析 |
|
||||
| `F:/datasets_rm/MinerU/mineru/utils/config_reader.py` | 配置读取 |
|
||||
| `F:/datasets_rm/MinerU/mineru/utils/enum_class.py` | 枚举类型定义 |
|
||||
| `F:/datasets_rm/MinerU/mineru.template.json` | 配置文件模板 |
|
||||
| `E:/race_save/FairScan_cyy/FairScan/pc-server/main.py` | FairScan PC 服务器(需修改) |
|
||||
| `E:/race_save/FairScan_cyy/FairScan/requirements/pc-api-spec.md` | API 接口规范 |
|
||||
Reference in New Issue
Block a user