前言

LLM 是 “Large Language Model” 的缩写，即大语言模型。

VLM 是 “Visual Language Model” 的缩写，即视觉语言模型。

多模态大模型是一种能够处理和融合多种模态数据的人工智能模型。

模态是指表达或感知事物的方式，每一种信息的来源或者形式都可以称为一种模态。

常见的模态主要包括文本（Verbal）、语音（Vocal）、视觉（Visual，图像、视频等都可归为视觉范畴），当然也有其他如触觉、嗅觉等模态，但在当前的多模态大模型研究中，主要以文本、语音和视觉模态的融合为主。

LLAMA

LLaMa 2 是由 Meta 公司开发的开源大型语言模型（LLM）。

多模态大模型LLaMa 3

2024年7月23日，Meta发布LLAMA 3.1 405B开源人工智能模型。

Llama 3.2 90B Vision（文本 + 图像输入）：Meta 最先进的模型，是企业级应用的理想选择。该模型擅长常识、长文本生成、多语言翻译、编码、数学和高级推理。它还引入了图像推理功能，可以完成图像理解和视觉推理任务。该模型非常适合以下用例：图像标题、图像文本检索、视觉基础、视觉问题解答和视觉推理，以及文档视觉问题解答。
Llama 3.2 11B Vision（文本 + 图像输入）：非常适合内容创建、对话式人工智能、语言理解和需要视觉推理的企业应用。该模型在文本摘要、情感分析、代码生成和执行指令方面表现出色，并增加了图像推理能力。该模型的用例与 90B 版本类似：图像标题、图像文本检索、视觉基础、视觉问题解答和视觉推理，以及文档视觉问题解答。
Llama 3.2 3B（文本输入）：专为需要低延迟推理和有限计算资源的应用而设计。它擅长文本摘要、分类和语言翻译任务。该模型非常适合以下用例：移动人工智能写作助手和客户服务应用。
Llama 3.2 1B（文本输入）：Llama 3.2 模型系列中最轻量级的模型，非常适合边缘设备和移动应用程序的检索和摘要。该模型非常适合以下用例：个人信息管理和多语言知识检索。

MiniCPM-V部署

MiniCPM-V 2.6 : MiniCPM-V 系列的最新、性能最佳模型。总参数量 8B，单图、多图和视频理解性能 超越了 GPT-4V 。在单图理解上，它取得了优于 GPT-4o mini、Gemini 1.5 Pro 和 Claude 3.5 Sonnet 等商用闭源模型的表现，并进一步优化了 MiniCPM-Llama3-V 2.5 的 OCR、可信行为、多语言支持以及端侧部署等诸多特性。

官方文档

https://gitcode.com/gh_mirrors/mi/MiniCPM-V/overview?utm_source=highlight_word_gitcode&word=minicpm-v

Python环境

Python环境部署

https://www.psvmc.cn/article/2024-09-29-python-miniconda-pipenv.html

安装MiniCPM-V

下载代码

1 2	cd /data git clone https://gitee.com/empty-snow/MiniCPM-V.git

创建虚拟环境

1	conda create -n MiniCPMV python=3.10 -y

切换目录

1	cd MiniCPM-V

激活Python虚拟环境

1	conda activate MiniCPMV

安装依赖

1 2	# 在 MiniCPMV 环境中安装 requirements.txt 依赖 pip install -r requirements.txt

下载预训练模型

安装依赖

1	pip install modelscope

创建一个Python下载脚本

modelscope_download.py

1
2
3

# Python 代码下载模型
from modelscope import snapshot_download
model_dir = snapshot_download('openbmb/MiniCPM-V-2_6', cache_dir='./', revision='master')

模型下载

1	python modelscope_download.py

运行

1 2	# 运行 web_demo_2.6.py 文件 python web_demo_2.6.py

模型训练

[
    {
      "id": "0",
      "image": 'path/to/image_0.jpg',
      "conversations": [
            {
              'role': 'user', 
              'content': '<image>\nHow many desserts are on the white plate?'
            }, 
            {
                'role': 'assistant', 
                'content': 'There are three desserts on the white plate.'
            },   
            {
                'role': 'user', 
                'content': 'What type of desserts are they?'
            },
            {
                'role': 'assistant', 
                'content': 'The desserts are cakes with bananas and pecans on top. They share similarities with donuts, but the presence of bananas and pecans differentiates them.'
            }, 
            {
                'role': 'user', 
                'content': 'What is the setting of the image?'}, 
            {
                'role': 'assistant', 
                'content': 'The image is set on a table top with a plate containing the three desserts.'
            },
        ]
    },
  ]

把finetune中的的DATA和EVAL_DATA分别改为训练的JSON和测试的JSON

MODEL="openbmb/MiniCPM-V-2_6"
DATA="path/to/trainging_data"
EVAL_DATA="path/to/test_data"
LLM_TYPE="qwen2"

训练

1	sh finetune_ds.sh

加载训练的模型

from peft import PeftModel
from transformers import AutoModel
model_type=  "openbmb/MiniCPM-V-2_6"   # or openbmb/MiniCPM-Llama3-V-2_5 , openbmb/MiniCPM-V-2
path_to_adapter="path_to_your_fine_tuned_checkpoint"

model =  AutoModel.from_pretrained(
        model_type,
        trust_remote_code=True
        )

lora_model = PeftModel.from_pretrained(
    model,
    path_to_adapter,
    device_map="auto",
    trust_remote_code=True
).eval().cuda()

我是码客，我是全栈工程师，我为自己代言。

多模态大模型及其部署

前言