Kubernetes和机器学习工作负载:从分布式训练到模型部署的全面指南

张开发
2026/4/28 12:12:22 15 分钟阅读

分享文章

Kubernetes和机器学习工作负载:从分布式训练到模型部署的全面指南
Kubernetes和机器学习工作负载从分布式训练到模型部署的全面指南 硬核开场各位技术大佬们今天咱们来聊聊Kubernetes和机器学习工作负载。别跟我说你的机器学习训练还在单机上跑那都不叫现代化在云原生时代Kubernetes已经成为机器学习工作负载的最佳载体。从分布式训练到模型部署从GPU管理到自动扩缩容每一步都需要精心设计。今天susu就带你们从实战角度全方位覆盖Kubernetes上的机器学习工作负载最佳实践让你的模型训练既高效又可靠 核心内容1. Kubernetes上的机器学习工作负载类型模型训练分布式训练、超参数调优模型推理在线推理、批量推理数据处理数据预处理、特征工程模型管理模型版本控制、模型注册2. 准备Kubernetes集群2.1 安装GPU支持# 安装NVIDIA设备插件 kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml # 验证GPU可用性 kubectl get nodes -o jsonpath{range .items[*]}{.metadata.name}{\n}{.status.allocatable.nvidia.com/gpu}{\n}{end}2.2 安装必要的工具# 安装kubeflow kubectl apply -f https://github.com/kubeflow/kfctl/releases/download/v1.2.0/kfctl_k8s_istio.v1.2.0.yaml # 安装mpi-operator helm repo add mpi-operator https://kubeflow.github.io/mpi-operator helm install mpi-operator mpi-operator/mpi-operator # 安装tf-operator helm repo add kubeflow https://kubeflow.github.io/helm-charts helm install tf-operator kubeflow/tf-operator3. 分布式训练3.1 TensorFlow分布式训练apiVersion: kubeflow.org/v1 kind: TFJob metadata: name: tensorflow-training namespace: default spec: tfReplicaSpecs: Worker: replicas: 3 template: spec: containers: - name: tensorflow image: tensorflow/tensorflow:latest-gpu command: - python - /app/train.py resources: limits: nvidia.com/gpu: 1 volumeMounts: - name: training-data mountPath: /data - name: training-code mountPath: /app volumes: - name: training-data persistentVolumeClaim: claimName: training-data - name: training-code configMap: name: training-code PS: replicas: 2 template: spec: containers: - name: tensorflow image: tensorflow/tensorflow:latest command: - python - /app/train.py resources: requests: cpu: 1 memory: 4Gi3.2 PyTorch分布式训练apiVersion: kubeflow.org/v1 kind: MPIJob metadata: name: pytorch-training namespace: default spec: slotsPerWorker: 1 runPolicy: cleanPodPolicy: Running mpiReplicaSpecs: Launcher: replicas: 1 template: spec: containers: - name: mpi-launcher image: mpioperator/pytorch:latest command: - mpirun - --allow-run-as-root - -np - 3 - --bind-to - none - -map-by - slot - -x - NCCL_DEBUGINFO - python - /app/train.py Worker: replicas: 3 template: spec: containers: - name: mpi-worker image: mpioperator/pytorch:latest resources: limits: nvidia.com/gpu: 1 volumeMounts: - name: training-data mountPath: /data - name: training-code mountPath: /app volumes: - name: training-data persistentVolumeClaim: claimName: training-data - name: training-code configMap: name: training-code4. 模型部署4.1 部署模型服务apiVersion: apps/v1 kind: Deployment metadata: name: model-serving namespace: default spec: replicas: 3 selector: matchLabels: app: model-serving template: metadata: labels: app: model-serving spec: containers: - name: model-server image: tensorflow/serving:latest ports: - containerPort: 8501 env: - name: MODEL_NAME value: mymodel volumeMounts: - name: model-storage mountPath: /models/mymodel volumes: - name: model-storage persistentVolumeClaim: claimName: model-storage --- apiVersion: v1 kind: Service metadata: name: model-serving namespace: default spec: selector: app: model-serving ports: - port: 8501 targetPort: 8501 type: ClusterIP4.2 使用Seldon Core部署模型# 安装Seldon Core helm repo add seldon-charts https://seldonio.github.io/seldon-core helm install seldon-core seldon-charts/seldon-core-operator --namespace seldon-system --create-namespace # 部署模型 kubectl apply -f model-deployment.yaml# model-deployment.yaml apiVersion: machinelearning.seldon.io/v1 kind: SeldonDeployment metadata: name: my-model namespace: default spec: predictors: - name: default replicas: 3 graph: name: model implementation: MODEL_SERVER modelUri: gs://my-model-bucket/model env: - name: MODEL_NAME value: mymodel5. 自动扩缩容5.1 基于CPU/GPU使用率的扩缩容apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: model-serving-hpa namespace: default spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: model-serving minReplicas: 1 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 805.2 基于自定义指标的扩缩容apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: model-serving-hpa namespace: default spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: model-serving minReplicas: 1 maxReplicas: 10 metrics: - type: Pods pods: metric: name: requests-per-second target: type: AverageValue averageValue: 1006. 数据管理6.1 数据存储apiVersion: v1 kind: PersistentVolumeClaim metadata: name: training-data namespace: default spec: accessModes: - ReadWriteMany resources: requests: storage: 100Gi storageClassName: standard6.2 数据预处理apiVersion: batch/v1 kind: Job metadata: name:>apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: training-jobs namespace: monitoring spec: selector: matchLabels: app: training endpoints: - port: metrics interval: 15s7.2 监控模型服务apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: model-serving namespace: monitoring spec: selector: matchLabels: app: model-serving endpoints: - port: metrics interval: 15s8. 最佳实践8.1 训练作业最佳实践使用StatefulSet对于需要稳定存储的训练作业配置资源限制合理设置CPU、内存和GPU资源使用节点亲和性将训练作业调度到合适的节点设置Pod中断预算保证训练作业的稳定性8.2 模型部署最佳实践使用Deployment便于水平扩缩容配置健康检查确保服务可用性使用服务网格管理流量和监控实现蓝绿部署无缝更新模型8.3 资源管理最佳实践GPU资源管理合理分配GPU资源使用节点池为不同类型的工作负载创建专用节点池资源配额设置命名空间级别的资源限制限制Pod优先级确保关键工作负载的资源需求9. 实战演练完整的机器学习工作流9.1 数据预处理apiVersion: batch/v1 kind: Job metadata: name:>apiVersion: kubeflow.org/v1 kind: MPIJob metadata: name: pytorch-training namespace: ml-workloads spec: slotsPerWorker: 1 runPolicy: cleanPodPolicy: Running mpiReplicaSpecs: Launcher: replicas: 1 template: spec: containers: - name: mpi-launcher image: mpioperator/pytorch:latest command: - mpirun - --allow-run-as-root - -np - 4 - --bind-to - none - -map-by - slot - -x - NCCL_DEBUGINFO - python - /app/train.py Worker: replicas: 4 template: spec: containers: - name: mpi-worker image: mpioperator/pytorch:latest resources: limits: nvidia.com/gpu: 1 volumeMounts: - name: processed-data mountPath: /data - name: training-code mountPath: /app volumes: - name: processed-data persistentVolumeClaim: claimName: processed-data - name: training-code configMap: name: training-code9.3 模型部署apiVersion: apps/v1 kind: Deployment metadata: name: model-serving namespace: ml-workloads spec: replicas: 3 selector: matchLabels: app: model-serving template: metadata: labels: app: model-serving spec: containers: - name: model-server image: tensorflow/serving:latest ports: - containerPort: 8501 env: - name: MODEL_NAME value: mymodel volumeMounts: - name: model-storage mountPath: /models/mymodel volumes: - name: model-storage persistentVolumeClaim: claimName: model-storage --- apiVersion: v1 kind: Service metadata: name: model-serving namespace: ml-workloads spec: selector: app: model-serving ports: - port: 8501 targetPort: 8501 type: LoadBalancer9.4 自动扩缩容apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: model-serving-hpa namespace: ml-workloads spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: model-serving minReplicas: 1 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80️ 最佳实践集群配置为机器学习工作负载创建专用节点池安装GPU驱动和设备插件配置足够的存储容量训练作业使用分布式训练框架合理配置资源限制使用StatefulSet管理有状态训练作业实现训练数据的持久化模型部署使用Deployment进行模型服务部署配置健康检查和就绪探针实现自动扩缩容使用服务网格管理流量数据管理使用PersistentVolumeClaim管理数据实现数据预处理的自动化考虑使用对象存储服务监控与日志监控训练作业的进度和资源使用监控模型服务的性能和可用性集中管理日志资源管理合理分配GPU资源使用节点亲和性和反亲和性设置资源配额和限制安全配置限制容器权限使用Secret管理敏感信息配置网络策略 总结Kubernetes已经成为机器学习工作负载的理想平台通过本文的实践你应该已经掌握了分布式训练的配置和管理模型部署的最佳实践自动扩缩容的实现数据管理和处理监控与日志资源管理和安全配置记住机器学习工作负载在Kubernetes上的运行需要根据实际需求进行调整。在实际生产环境中要结合模型特点和业务需求制定合适的部署策略确保机器学习工作负载的高效和可靠运行。susu碎碎念GPU资源是宝贵的要合理分配和使用分布式训练可以显著加速模型训练过程模型部署要考虑性能和可用性数据管理是机器学习工作流的关键环节监控和日志对于问题排查至关重要安全配置不能忽视特别是处理敏感数据时觉得有用点个赞再走咱们下期见

更多文章