分享如何使用vLLM部署DeepSeek。

访客 51分钟前 335 0

近期，许多企业在考虑数据隐私问题时，选择在内部部署私有化的大语言模型。常见的部署工具包括ollama、vllm、xinference、sglang和lm studio。其中，ollama和lm studio仅支持gguf类型量化的大语言模型，而vllm、xinference和sglang则支持pytorch或transformer类型的大模型，这些模型通常可以在huggingface上找到。ollama和lm studio适用于桌面显卡领域的个人电脑部署，而vllm、xinference和sglang则更适合服务器领域的部署。本文将重点介绍如何使用vllm部署和量化deepseek大语言模型，部署环境为4卡nvidia 2080ti，共约48g显存。

下载LLM模型

首先，我们需要下载所需的大语言模型。在国内，通常使用ModelScope下载，因为其速度快且稳定。我们使用ModelScope官方提供的工具modelscope来下载，它支持自动重连和断点续传功能。首先，我们需要切换到conda的base环境，并安装modelscope。

conda activate base
pip install modelscope

登录后复制

然后，我们访问ModelScope，找到要下载的模型，例如DeepSeek V2 Lite模型。

如何使用vLLM部署DeepSeek V2 Lite模型

拷贝模型的限定名称，并使用以下命令将其下载到当前目录。

modelscope download --model deepseek-ai/DeepSeek-V2-Lite-Chat --local_dir .

登录后复制

下载速度很快，约为30MB/s。

如何使用vLLM部署DeepSeek V2 Lite模型

安装vLLM推理引擎

接下来，创建vLLM的虚拟环境并激活。

conda create -n vllm python=3.11
conda activate vllm

登录后复制

配置国内源以加快安装速度。

conda config --show channels
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/pkgs/main/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/conda-forge/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/msys2/
conda config --set channel_priority flexible

登录后复制

然后，根据官方文档安装vLLM。

pip install vllm

登录后复制

注意，如果使用CUDA 11.8，可以使用以下命令安装vLLM。

# Install vLLM with CUDA 11.8.
export VLLM_VERSION=0.4.0
export PYTHON_VERSION=310
pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118

登录后复制

开始部署

使用以下命令开始部署DeepSeek V2 Lite Chat模型。

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-V2-Lite-Chat --port 11434 --tensor-parallel-size 4 --gpu-memory-utilization 0.9 --max-model-len 8192 --trust-remote-code --enforce_eager --dtype=half

登录后复制

需要特别说明的参数包括：

dtype - 数据类型，由于RTX 2080Ti仅支持半精度类型，因此必须指定为half。 max-model-len - 指定上下文长度，vLLM会自动预留KV Cache。虽然DeepSeek V2支持128K上下文，但这会占用大量显存，因此需要逐步尝试找到最佳上下文长度。 gpu-memory-utilization - 指定显存利用率，默认0.9，意味着最大可以使用48*0.9=43.2G显存。 tensor-parallel-size - 张量并行推理，如果单卡显存不足以承载大模型，可以启用此选项，根据显卡数量设置大小。

在尝试部署时，发现8K上下文导致显存不足，无法启动。通过将gpu-memory-utilization增大到0.95，可以启动并支持8K上下文，速度约为每秒15 tokens。

如何使用vLLM部署DeepSeek V2 Lite模型

使用Lora

如果在基础模型上进行微调，可以通过以下方式指定Lora模型。

vllm serve meta-llama/Llama-2-7b-hf \
    --enable-lora \
    --lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/

登录后复制

使用--enable-lora --lora-modules {name}={lora-path}来指定Lora模型。在使用OAI兼容的接口请求时，必须将模型名称指定为Lora的模型名称。

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "sql-lora",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }' | jq

登录后复制

量化DeepSeek Lite Chat模型

量化模型时，需要考虑显卡平台支持的量化类型。由于RTX 2080Ti是Turing架构，计算能力为7.5，不支持FP8量化。

如何使用vLLM部署DeepSeek V2 Lite模型

此处使用AWQ进行4bit量化。

pip install autoawq

登录后复制

还需要单独安装一个依赖，否则会报错。

pip install flash_attn

登录后复制

如果安装时找不到nvcc，可以执行以下命令找到nvcc路径并手动设置CUDA_HOME。

which nvcc

登录后复制

然后根据获得的地址手动设置CUDA_HOME并安装。

CUDA_HOME=/usr/local/cuda pip install flash_attn

登录后复制

编译wheel时可能需要较长时间。注意，量化时依赖可能与vLLM不一致，可以考虑建立两个虚拟环境。接下来使用以下代码开始量化。

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
<p>model_path = 'hub/deepseek-ai/DeepSeek-V2-Lite-Chat/'
quant_path = 'hub/deepseek-ai/DeepSeek-V2-Lite-Chat-awq-int4/'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }</p><h1>Load model</h1><p>model = AutoAWQForCausalLM.from_pretrained(model_path, **{"low_cpu_mem_usage": True})
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)</p><h1>Quantize</h1><p>model.quantize(tokenizer, quant_config=quant_config)</p><h1>Save quantized model</h1><p>model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

登录后复制

如果显卡支持FP8量化，可以使用AutoFP8进行离线量化。

git clone <a href="https://www.php.cn/link/89b3f18cd4609f9af4d1aa05a3df378e">https://www.php.cn/link/89b3f18cd4609f9af4d1aa05a3df378e</a>
pip install -e AutoFP8

登录后复制

然后使用动态激活规模因子进行离线量化，不损失精度。

from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig</p><p>pretrained_model_dir = "hub/deepseek-ai/DeepSeek-V2-Lite-Chat/"
quantized_model_dir = "hub/deepseek-ai/DeepSeek-V2-Lite-Chat-FP8/"</p><h1>Define quantization config with static activation scales</h1><p>quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="dynamic")</p><h1>For dynamic activation scales, there is no need for calbration examples</h1><p>examples = []</p><h1>Load the model, quantize, and save checkpoint</h1><p>model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)
model.save_quantized(quantized_model_dir)

登录后复制

总结

本文主要记录了我在RTX 2080Ti上部署DeepSeek V2 16B模型的过程，希望能为大家提供一个参考。更多的参数设置可以参考vLLM官方文档。此外，DeepSeek V2模型使用的MLA（Multi-head Latent Attention）目前vLLM尚未实现，但sglang最近实现了MLA，速度有了明显提升。下一篇文章我们将尝试使用sglang进行部署。

参考资料

[1] ModelScope: https://www.php.cn/link/6d9814b5207f1d3ff1d50bc3a89ac9b3

[2] DeepSeek V2 Lite模型: https://www.php.cn/link/6d9814b5207f1d3ff1d50bc3a89ac9b3/deepseek-ai/deepseek-v2-lite-chat

[3] 官方文档: https://www.php.cn/link/8596dd1dc67d1200fe0606146fcee1a4

[4] vLLM官方文档: https://www.php.cn/link/37c9c9e3401bacdf3fb42cb447dadb4b

以上就是如何使用vLLM部署DeepSeek V2 Lite模型的详细内容，更多请关注楠楠科技社其它相关文章！

标签： #如何使用 #模型 #vLLM