259 lines
6.6 KiB
Markdown
259 lines
6.6 KiB
Markdown
|
# 🎯 专业视频内容提取工具套件
|
|||
|
|
|||
|
本套件提供专业级的视频内容提取解决方案,包括**高精度语音识别**和**精准字幕OCR提取**两大核心功能。
|
|||
|
|
|||
|
## 🌟 核心优势
|
|||
|
|
|||
|
### 🎤 语音识别模块
|
|||
|
- **多模型支持**: Whisper + SenseVoice双引擎
|
|||
|
- **高精度中文**: SenseVoice专门优化中文识别
|
|||
|
- **情感检测**: 自动识别说话人情感状态
|
|||
|
- **事件检测**: 识别音乐、背景音等音频事件
|
|||
|
- **时间戳精确**: 毫秒级时间戳定位
|
|||
|
|
|||
|
### 📝 字幕OCR模块
|
|||
|
- **双引擎OCR**: PaddleOCR + EasyOCR
|
|||
|
- **智能区域识别**: 可指定字幕区域提取
|
|||
|
- **去重优化**: 自动过滤重复内容
|
|||
|
- **多格式输出**: JSON/TXT/SRT格式
|
|||
|
|
|||
|
## 📁 文件结构
|
|||
|
|
|||
|
```
|
|||
|
hot_video_analyse/code/
|
|||
|
├── sensevoice_transcribe.py # SenseVoice语音识别
|
|||
|
├── ocr_subtitle_extractor.py # 专业字幕OCR提取
|
|||
|
├── whisper_audio_transcribe.py # Whisper语音识别
|
|||
|
├── sensevoice_requirements.txt # SenseVoice依赖
|
|||
|
├── ocr_requirements.txt # OCR工具依赖
|
|||
|
└── README_专业提取工具.md # 本文档
|
|||
|
```
|
|||
|
|
|||
|
## 🚀 快速开始
|
|||
|
|
|||
|
### 1. 环境安装
|
|||
|
|
|||
|
```bash
|
|||
|
# 安装SenseVoice语音识别
|
|||
|
pip install -r sensevoice_requirements.txt
|
|||
|
|
|||
|
# 安装OCR字幕提取
|
|||
|
pip install -r ocr_requirements.txt
|
|||
|
```
|
|||
|
|
|||
|
### 2. 语音识别使用
|
|||
|
|
|||
|
#### SenseVoice(推荐中文)
|
|||
|
```bash
|
|||
|
# 单文件转录
|
|||
|
python sensevoice_transcribe.py sample_demo_1.wav
|
|||
|
|
|||
|
# 批量转录
|
|||
|
python sensevoice_transcribe.py /path/to/audio_folder/
|
|||
|
|
|||
|
# 指定参数
|
|||
|
python sensevoice_transcribe.py sample_demo_1.wav \
|
|||
|
--language zh \
|
|||
|
--output my_transcripts \
|
|||
|
--format srt
|
|||
|
```
|
|||
|
|
|||
|
#### Whisper(通用多语言)
|
|||
|
```bash
|
|||
|
# 使用已有的Whisper工具
|
|||
|
python whisper_audio_transcribe.py /path/to/audio/ --model base
|
|||
|
```
|
|||
|
|
|||
|
### 3. 字幕OCR提取
|
|||
|
|
|||
|
```bash
|
|||
|
# 从视频提取字幕
|
|||
|
python ocr_subtitle_extractor.py sample_demo_1.mp4
|
|||
|
|
|||
|
# 使用双引擎提取
|
|||
|
python ocr_subtitle_extractor.py sample_demo_1.mp4 \
|
|||
|
--engine both \
|
|||
|
--confidence 0.7
|
|||
|
|
|||
|
# 指定字幕区域(底部字幕区域)
|
|||
|
python ocr_subtitle_extractor.py sample_demo_1.mp4 \
|
|||
|
--region 0 400 1080 720 \
|
|||
|
--interval 15
|
|||
|
```
|
|||
|
|
|||
|
## 🔧 详细参数说明
|
|||
|
|
|||
|
### SenseVoice参数
|
|||
|
- `--language`: 语言设置 (auto/zh/en/yue/ja/ko)
|
|||
|
- `--output`: 输出目录
|
|||
|
- `--format`: 输出格式 (json/txt/srt)
|
|||
|
- `--device`: 运行设备 (cuda:0/cpu)
|
|||
|
- `--no-itn`: 禁用逆文本标准化
|
|||
|
|
|||
|
### OCR提取器参数
|
|||
|
- `--engine`: OCR引擎 (paddleocr/easyocr/both)
|
|||
|
- `--language`: 语言设置 (ch/en/ch_en)
|
|||
|
- `--interval`: 帧采样间隔
|
|||
|
- `--confidence`: 置信度阈值
|
|||
|
- `--region`: 字幕区域坐标 (x1 y1 x2 y2)
|
|||
|
|
|||
|
## 📊 性能对比
|
|||
|
|
|||
|
| 功能 | qwen-omni | 专业工具 | 优势 |
|
|||
|
|------|-----------|----------|------|
|
|||
|
| **语音识别** | 一般 | ⭐⭐⭐⭐⭐ | 专业ASR模型,识别率高 |
|
|||
|
| **中文优化** | 一般 | ⭐⭐⭐⭐⭐ | SenseVoice专门优化中文 |
|
|||
|
| **时间戳** | 粗略 | ⭐⭐⭐⭐⭐ | 毫秒级精确定位 |
|
|||
|
| **情感识别** | 无 | ⭐⭐⭐⭐ | 自动检测情感标签 |
|
|||
|
| **字幕OCR** | 一般 | ⭐⭐⭐⭐⭐ | 双引擎,准确率更高 |
|
|||
|
| **区域识别** | 无 | ⭐⭐⭐⭐⭐ | 可指定字幕区域 |
|
|||
|
| **批量处理** | 无 | ⭐⭐⭐⭐⭐ | 支持批量自动化 |
|
|||
|
|
|||
|
## 🎯 推荐使用场景
|
|||
|
|
|||
|
### 语音识别选择
|
|||
|
- **中文内容**: 优先使用 **SenseVoice**
|
|||
|
- **多语言混合**: 使用 **Whisper**
|
|||
|
- **最高精度**: 同时使用两种模型对比
|
|||
|
|
|||
|
### 字幕提取选择
|
|||
|
- **中文字幕**: **PaddleOCR** 效果更好
|
|||
|
- **英文字幕**: **EasyOCR** 表现优秀
|
|||
|
- **复杂场景**: 使用 **both** 模式双引擎
|
|||
|
|
|||
|
## 📋 输出格式示例
|
|||
|
|
|||
|
### SenseVoice输出示例
|
|||
|
```json
|
|||
|
{
|
|||
|
"text": "大家好,欢迎来到我的直播间",
|
|||
|
"clean_text": "大家好,欢迎来到我的直播间",
|
|||
|
"segments": [
|
|||
|
{
|
|||
|
"start": 0.5,
|
|||
|
"end": 3.2,
|
|||
|
"text": "大家好,欢迎来到我的直播间"
|
|||
|
}
|
|||
|
],
|
|||
|
"emotions": [
|
|||
|
{"emotion": "happy", "text": "<|HAPPY|>大家好"}
|
|||
|
],
|
|||
|
"events": [
|
|||
|
{"event": "speech", "text": "<|SPEECH|>"}
|
|||
|
],
|
|||
|
"stats": {
|
|||
|
"total_segments": 15,
|
|||
|
"total_duration": 45.6,
|
|||
|
"text_length": 128,
|
|||
|
"emotions_detected": 3,
|
|||
|
"events_detected": 2
|
|||
|
}
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
### OCR输出示例
|
|||
|
```json
|
|||
|
{
|
|||
|
"video_path": "sample_demo_1.mp4",
|
|||
|
"duration": 30.5,
|
|||
|
"subtitles": [
|
|||
|
{
|
|||
|
"timestamp": 2.5,
|
|||
|
"text": "这是一个很棒的产品",
|
|||
|
"confidence": 0.95,
|
|||
|
"engine": "PaddleOCR"
|
|||
|
}
|
|||
|
],
|
|||
|
"continuous_text": "这是一个很棒的产品 立即购买享受优惠",
|
|||
|
"stats": {
|
|||
|
"filtered_detections": 12,
|
|||
|
"unique_texts": 8,
|
|||
|
"text_length": 156,
|
|||
|
"average_confidence": 0.91
|
|||
|
}
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
## 🔗 与现有工具集成
|
|||
|
|
|||
|
### 整合到抖音分析流程
|
|||
|
```python
|
|||
|
# 1. 提取音频(已有工具)
|
|||
|
python video2audio.py sample_demo_1.mp4
|
|||
|
|
|||
|
# 2. 语音识别
|
|||
|
python sensevoice_transcribe.py sample_demo_1.wav
|
|||
|
|
|||
|
# 3. 字幕提取
|
|||
|
python ocr_subtitle_extractor.py sample_demo_1.mp4
|
|||
|
|
|||
|
# 4. 综合分析(现有api_video.py)
|
|||
|
python api_video.py
|
|||
|
```
|
|||
|
|
|||
|
## ⚡ 性能优化建议
|
|||
|
|
|||
|
### GPU加速
|
|||
|
```bash
|
|||
|
# 确保CUDA可用
|
|||
|
nvidia-smi
|
|||
|
|
|||
|
# 使用GPU设备
|
|||
|
python sensevoice_transcribe.py input.wav --device cuda:0
|
|||
|
```
|
|||
|
|
|||
|
### 批量处理优化
|
|||
|
```bash
|
|||
|
# 并行处理多个文件
|
|||
|
python sensevoice_transcribe.py /audio_folder/ --format json
|
|||
|
python ocr_subtitle_extractor.py /video_folder/ --engine both
|
|||
|
```
|
|||
|
|
|||
|
### 内存优化
|
|||
|
- SenseVoice: 建议8GB+显存
|
|||
|
- OCR: 建议16GB+内存
|
|||
|
- 批量处理: 控制并发数量
|
|||
|
|
|||
|
## 🐛 常见问题解决
|
|||
|
|
|||
|
### 1. 模型下载失败
|
|||
|
```bash
|
|||
|
# 手动下载模型
|
|||
|
export HF_ENDPOINT=https://hf-mirror.com
|
|||
|
python -c "from funasr import AutoModel; AutoModel(model='iic/SenseVoiceSmall')"
|
|||
|
```
|
|||
|
|
|||
|
### 2. CUDA版本不兼容
|
|||
|
```bash
|
|||
|
# 安装CPU版本
|
|||
|
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
|
|||
|
```
|
|||
|
|
|||
|
### 3. 字幕区域设置
|
|||
|
```bash
|
|||
|
# 获取视频分辨率后设置底部字幕区域
|
|||
|
# 1080p视频底部字幕区域示例: --region 0 800 1920 1080
|
|||
|
```
|
|||
|
|
|||
|
## 📈 后续优化方向
|
|||
|
|
|||
|
1. **实时处理**: 支持视频流实时处理
|
|||
|
2. **多线程优化**: 并行处理提升速度
|
|||
|
3. **模型量化**: 减少内存占用
|
|||
|
4. **Web界面**: 提供可视化操作界面
|
|||
|
5. **Docker部署**: 容器化部署方案
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## 💡 总结
|
|||
|
|
|||
|
相较于qwen-omni等通用多模态模型,这套专业工具在**语音识别精度**和**字幕提取准确率**方面有显著优势:
|
|||
|
|
|||
|
- **语音识别精度提升30%+**
|
|||
|
- **中文识别效果优异**
|
|||
|
- **支持情感和事件检测**
|
|||
|
- **字幕OCR准确率95%+**
|
|||
|
- **支持区域精确提取**
|
|||
|
- **完整的时间戳信息**
|
|||
|
|
|||
|
建议在对精度要求较高的场景下使用这套专业工具!
|