DeepSeek-V4-Pro 部署实战指南:H100/H200/B200/B300/GB200/GB300 全硬件配置详解

张开发
2026/4/28 3:43:28 15 分钟阅读

分享文章

DeepSeek-V4-Pro 部署实战指南:H100/H200/B200/B300/GB200/GB300 全硬件配置详解
来源: recipes.vllm.ai 官方配置DeepSeek-V4-Pro 是 DeepSeek V4 预览系列的旗舰模型拥有1.6T 总参数 / 49B 激活参数的 MoE 架构checkpoint 高达960GB。本文基于 vLLM 官方 Recipes 配置详细介绍六种主流 GPU 平台的部署方案。一、模型概览指标参数总参数量1.6 Trillion (1600B)激活参数量49 Billion上下文长度最高 1,048,576 tokens (1M)精度格式FP4 FP8 混合精度Checkpoint~960 GBvLLM 版本≥ 0.20.1二、Docker 镜像选择镜像CUDA 版本适用平台vllm/vllm-openai:deepseekv4-cu129CUDA 12.9H100、H200、B200vllm/vllm-openai:deepseekv4-cu130CUDA 13GB200、B300、GB300三、H200 部署单节点 TP8EP硬件: 1 台 × 8× H200 (141GB × 8 1128GB)dockerrun--gpusall\--privileged--ipchost-p8000:8000\-v~/.cache/huggingface:/root/.cache/huggingface\-eVLLM_ENGINE_READY_TIMEOUT_S3600\vllm/vllm-openai:deepseekv4-cu129 deepseek-ai/DeepSeek-V4-Pro\--trust-remote-code\--kv-cache-dtype fp8\--block-size256\--tensor-parallel-size8\--enable-expert-parallel\--max-model-len800000\--gpu-memory-utilization0.95\--compilation-config{mode: 0, cudagraph_mode: FULL_DECODE_ONLY}四、B200 部署单节点 TP8EP硬件: 1 台 × 8× B200 (180GB × 8 1440GB)dockerrun--gpusall\--privileged--ipchost-p8000:8000\-v~/.cache/huggingface:/root/.cache/huggingface\-eVLLM_ENGINE_READY_TIMEOUT_S3600\vllm/vllm-openai:deepseekv4-cu130 deepseek-ai/DeepSeek-V4-Pro\--trust-remote-code\--kv-cache-dtype fp8\--block-size256\--tensor-parallel-size8\--enable-expert-parallel\--compilation-config{mode: 0, cudagraph_mode: FULL_DECODE_ONLY}\--attention_config.use_fp4_indexer_cacheTrue五、GB200 NVL4 部署多节点 DEP硬件: 2 trays × 4× GB200 8 GPU注意: 单 tray 768GB 960GB必须 2 trays部署前准备exportHEAD_IP192.168.1.100# 替换为实际 Tray 0 的 IPTray 0 (Head)dockerrun--gpusall\--privileged--ipchost-p8000:8000\-v~/.cache/huggingface:/root/.cache/huggingface\-eVLLM_ENGINE_READY_TIMEOUT_S3600\vllm/vllm-openai:deepseekv4-cu130 deepseek-ai/DeepSeek-V4-Pro\--trust-remote-code\--kv-cache-dtype fp8\--block-size256\--enable-expert-parallel\--data-parallel-hybrid-lb\--data-parallel-size8\--data-parallel-size-local4\--data-parallel-address$HEAD_IP\--compilation-config{cudagraph_mode:FULL_AND_PIECEWISE, custom_ops:[all]}\--attention_config.use_fp4_indexer_cacheTrueTray 1 (Worker)dockerrun--gpusall\--privileged--ipchost-p8000:8000\-v~/.cache/huggingface:/root/.cache/huggingface\-eVLLM_ENGINE_READY_TIMEOUT_S3600\vllm/vllm-openai:deepseekv4-cu130 deepseek-ai/DeepSeek-V4-Pro\--trust-remote-code\--kv-cache-dtype fp8\--block-size256\--enable-expert-parallel\--data-parallel-hybrid-lb\--data-parallel-size8\--data-parallel-size-local4\--data-parallel-address$HEAD_IP\--data-parallel-start-rank4\--compilation-config{cudagraph_mode:FULL_AND_PIECEWISE, custom_ops:[all]}\--attention_config.use_fp4_indexer_cacheTrue六、B300 部署单节点 TP8EP硬件: 1 台 × 8× B300 (268GB × 8 2144GB)注意: 单节点显存最充裕无需多节点dockerrun--gpusall\--privileged--ipchost-p8000:8000\-v~/.cache/huggingface:/root/.cache/huggingface\-eVLLM_ENGINE_READY_TIMEOUT_S3600\vllm/vllm-openai:deepseekv4-cu130 deepseek-ai/DeepSeek-V4-Pro\--trust-remote-code\--kv-cache-dtype fp8\--block-size256\--tensor-parallel-size8\--enable-expert-parallel\--compilation-config{mode: 0, cudagraph_mode: FULL_DECODE_ONLY}\--attention_config.use_fp4_indexer_cacheTrue七、GB300 NVL4 部署多节点 DEP硬件: 2 trays × 4× GB300 8 GPUTray 0 (Head)dockerrun--gpusall\--privileged--ipchost-p8000:8000\-v~/.cache/huggingface:/root/.cache/huggingface\-eVLLM_ENGINE_READY_TIMEOUT_S3600\vllm/vllm-openai:deepseekv4-cu130 deepseek-ai/DeepSeek-V4-Pro\--trust-remote-code\--kv-cache-dtype fp8\--block-size256\--enable-expert-parallel\--data-parallel-hybrid-lb\--data-parallel-size8\--data-parallel-size-local4\--data-parallel-address$HEAD_IP\--compilation-config{cudagraph_mode:FULL_AND_PIECEWISE, custom_ops:[all]}\--attention_config.use_fp4_indexer_cacheTrueTray 1 (Worker)dockerrun--gpusall\--privileged--ipchost-p8000:8000\-v~/.cache/huggingface:/root/.cache/huggingface\-eVLLM_ENGINE_READY_TIMEOUT_S3600\vllm/vllm-openai:deepseekv4-cu130 deepseek-ai/DeepSeek-V4-Pro\--trust-remote-code\--kv-cache-dtype fp8\--block-size256\--enable-expert-parallel\--data-parallel-hybrid-lb\--data-parallel-size8\--data-parallel-size-local4\--data-parallel-address$HEAD_IP\--data-parallel-start-rank4\--compilation-config{cudagraph_mode:FULL_AND_PIECEWISE, custom_ops:[all]}\--attention_config.use_fp4_indexer_cacheTrue八、H100 部署多节点 DEP硬件: 2 台 × 8× H100 16 GPU注意: 单节点 640GB 960GB必须多节点部署前准备两台机器都执行# 1. 设置 Head 节点 IPNode 0 的 IP 地址exportHEAD_IP192.168.1.100# 替换为实际 Node 0 的 IP# 2. 拉取 Docker 镜像dockerpull vllm/vllm-openai:deepseekv4-cu129# 3. 准备模型目录mkdir-p~/.cache/huggingface# 4. NCCL 网络环境变量exportNCCL_DEBUGINFOexportNCCL_IB_DISABLE0exportNCCL_SOCKET_IFNAMEeth0# 替换为实际 RDMA 网卡名Node 0Master / Head 节点dockerrun--gpusall\--privileged--ipchost-p8000:8000\-v~/.cache/huggingface:/root/.cache/huggingface\-eVLLM_ENGINE_READY_TIMEOUT_S3600\vllm/vllm-openai:deepseekv4-cu129 deepseek-ai/DeepSeek-V4-Pro\--trust-remote-code\--kv-cache-dtype fp8\--block-size256\--enable-expert-parallel\--data-parallel-hybrid-lb\--data-parallel-size16\--data-parallel-size-local8\--data-parallel-address$HEAD_IP\--max-model-len800000\--gpu-memory-utilization0.95\--max-num-seqs512\--max-num-batched-tokens512\--no-enable-flashinfer-autotune\--compilation-config{mode: 0, cudagraph_mode: FULL_DECODE_ONLY}Node 1Worker 节点dockerrun--gpusall\--privileged--ipchost\-v~/.cache/huggingface:/root/.cache/huggingface\-eVLLM_ENGINE_READY_TIMEOUT_S3600\vllm/vllm-openai:deepseekv4-cu129 deepseek-ai/DeepSeek-V4-Pro\--trust-remote-code\--kv-cache-dtype fp8\--block-size256\--enable-expert-parallel\--data-parallel-hybrid-lb\--data-parallel-size16\--data-parallel-size-local8\--data-parallel-address$HEAD_IP\--data-parallel-start-rank8\--max-model-len800000\--gpu-memory-utilization0.95\--max-num-seqs512\--max-num-batched-tokens512\--no-enable-flashinfer-autotune\--compilation-config{mode: 0, cudagraph_mode: FULL_DECODE_ONLY}H100 启动顺序# Step 1: 先在 Node 0 (Master) 上启动sshnode0cd/path/to/scripts ./start_master.sh# Step 2: 等待 Node 0 启动完成约需 5-10 分钟加载 960GB 模型# Step 3: 在 Node 1 (Worker) 上启动sshnode1cd/path/to/scripts ./start_worker.sh# Step 4: 验证服务sshnode0curlhttp://localhost:8000/v1/models九、各平台配置对比平台台数GPU/台总 GPU总显存策略DockerH2001881128 GBTP8EPcu129B2001881440 GBTP8EPcu130GB2002481536 GBMulti-Node DEPcu130B3001882144 GBTP8EPcu130GB3002482304 GBMulti-Node DEPcu130H10028161280 GBMulti-Node DEPcu129关键参数说明参数H200/B200/B300GB200/GB300H100--tensor-parallel-size8----data-parallel-size-816--data-parallel-size-local-48--data-parallel-start-rank-48--max-model-len800K (H200) / 不限不限800K--no-enable-flashinfer-autotune✅ (H200)-✅十、参考文献vLLM 官方 Recipes: https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Pro模型地址: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro

更多文章