hot_video_analyse/code/README_CnOCR.md

# 视频字幕OCR提取器 - CnOCR集成

## 概述

字幕提取器现在支持三种OCR引擎：
- **PaddleOCR**: 百度开源OCR引擎
- **EasyOCR**: 轻量级OCR引擎
- **CnOCR**: 中文OCR专用引擎（新增）

## CnOCR安装和配置

### 1. 自动安装（推荐）

```bash
cd code
python install_cnocr.py
```

### 2. 手动安装

```bash
# 安装CnOCR
pip install cnocr[ort-cpu] -i https://pypi.tuna.tsinghua.edu.cn/simple

# 创建模型目录
mkdir -p /root/autodl-tmp/llm/cnocr

# 设置环境变量
export CNOCR_HOME=/root/autodl-tmp/llm/cnocr
```

## 使用方法

### 1. 单独使用CnOCR

```bash
python ocr_subtitle_extractor.py your_video.mp4 -e cnocr
```

### 2. 使用所有OCR引擎

```bash
python ocr_subtitle_extractor.py your_video.mp4 -e all
```

### 3. 完整参数示例

```bash
python ocr_subtitle_extractor.py your_video.mp4 \
    -e cnocr \
    -l ch \
    -i 30 \
    -c 0.5 \
    -o results \
    -f json \
    --position bottom
```

## 参数说明

- `-e, --engine`: OCR引擎选择
  - `paddleocr`: 仅使用PaddleOCR
  - `easyocr`: 仅使用EasyOCR
  - `cnocr`: 仅使用CnOCR（新增）
  - `all`: 使用所有三种引擎

- `-l, --language`: 语言设置
  - `ch`: 中文
  - `en`: 英文
  - `ch_en`: 中英文混合

- `-i, --interval`: 帧采样间隔（默认30帧）
- `-c, --confidence`: 置信度阈值（默认0.5）
- `-o, --output`: 输出目录
- `-f, --format`: 输出格式（json/txt/srt）
- `--position`: 字幕区域位置（full/center/bottom）

## CnOCR特点

1. **专为中文优化**: 对中文识别效果更好
2. **轻量级**: 模型体积较小，运行速度快
3. **易于部署**: 安装简单，依赖少
4. **多种模型**: 支持多种检测和识别模型

## 测试CnOCR集成

```bash
python test_cnocr.py
```

这个脚本会：
1. 测试CnOCR安装
2. 测试模型下载
3. 测试字幕提取器集成
4. 显示测试结果

## 模型存储位置

所有CnOCR模型文件都会下载到：
```
/root/autodl-tmp/llm/cnocr/
```

首次使用时会自动下载所需模型，请耐心等待。

## 输出格式

使用CnOCR时，识别结果中的`engine`字段会标记为`"CnOCR"`，便于区分不同引擎的结果。

## 性能对比

| 引擎 | 中文识别 | 英文识别 | 速度 | 模型大小 |
|------|----------|----------|------|----------|
| PaddleOCR | 优秀 | 优秀 | 中等 | 大 |
| EasyOCR | 良好 | 优秀 | 较慢 | 大 |
| CnOCR | 优秀 | 良好 | 较快 | 中等 |

## 故障排除

### 1. 安装失败
```bash
# 更新pip
pip install --upgrade pip

# 使用国内源
pip install cnocr[ort-cpu] -i https://pypi.tuna.tsinghua.edu.cn/simple
```

### 2. 模型下载失败
```bash
# 检查网络连接
# 确保有足够的磁盘空间
# 重新运行安装脚本
python install_cnocr.py
```

### 3. 环境变量问题
```bash
# 在脚本开头添加
export CNOCR_HOME=/root/autodl-tmp/llm/cnocr
```

## 示例输出

```json
{
  "video_path": "test_video.mp4",
  "subtitles": [
    {
      "timestamp": 1.5,
      "text": "这是一个测试字幕",
      "confidence": 0.95,
      "bbox": [[10, 20], [200, 20], [200, 50], [10, 50]],
      "engine": "CnOCR"
    }
  ],
  "stats": {
    "total_detections": 150,
    "filtered_detections": 120,
    "unique_texts": 50,
    "average_confidence": 0.87
  }
}
```