Kubernetes实战MLOps:从模型训练到生产部署的工程闭环

张开发
2026/6/5 7:02:20 15 分钟阅读

分享文章

Kubernetes实战MLOps:从模型训练到生产部署的工程闭环
1. 这不是“跑通一个模型”而是把机器学习真正变成可交付的工程产品你有没有遇到过这样的场景算法同学在本地Jupyter里调出一个0.92的AUC兴奋地发来截图三天后运维说“模型API挂了”排查发现是Python环境里少装了一个pyarrow12.0.1又过两天业务方反馈“预测结果和昨天不一样”翻日志才发现训练数据路径被悄悄改成了相对路径上线时自动指向了测试集……这不是段子是我过去三年在五家不同规模公司里亲眼见过、亲手救过的现场。MLOps这个标题里的每一个词都带着血泪教训——它不是给模型加个Docker容器就叫“部署”也不是把Kubernetes当高级玩具玩玩就叫“编排”。它是一套完整的工程闭环从代码怎么写、数据怎么管、实验怎么追踪、模型怎么验证、服务怎么灰度、指标怎么监控到故障怎么回滚。而Kubernetes恰恰是目前唯一能把这整条链路统一调度、弹性伸缩、权限隔离、可观测性拉齐的基础设施底座。这篇文章不讲概念不画架构图只讲我用Kubernetes从零落地第一个生产级ML应用的真实过程如何选型轻量但不失弹性的工具链为什么放弃Helm改用Kustomize做配置管理怎么让PyTorch训练作业在GPU节点上稳定跑满显存而不被OOMKilled以及最关键的——当线上模型突然出现5%的准确率下跌时如何在3分钟内定位到是特征漂移还是服务降级。如果你正卡在“模型训得好上线就崩盘”的阶段或者团队还在用FlaskGunicorn硬扛千QPS的推理请求那接下来的内容就是你该抄的第一份作业。2. 整体设计思路为什么必须用Kubernetes而不是“先用Docker跑起来再说”2.1 拒绝“伪MLOps”陷阱Docker ≠ 可运维的ML系统很多团队第一步就错了把训练脚本打包成Docker镜像用docker run启动再用Nginx反向代理暴露API。表面看是“容器化”了实则埋下三颗雷资源失控一个PyTorch训练任务默认会占用全部GPU显存而Docker本身不感知GPU拓扑。我在某电商项目里亲眼见过单个训练容器把8卡A100服务器的显存占满导致后续推理服务因申请不到显存直接失败监控里只显示“OOMKilled”根本看不出是哪个容器干的。状态割裂训练、评估、部署三个环节用三套独立脚本模型版本靠文件名model_v2_20240520.pkl管理数据版本靠Git commit hash硬编码。当需要回滚到上周的模型时得手动翻Git历史找对应commit再手动下载对应数据集再手动修改配置文件——整个过程平均耗时27分钟远超SLO要求的5分钟。可观测性真空Docker stats只能看到CPU/内存但ML系统最关键的指标——GPU利用率、显存分配率、batch延迟分布、特征缺失率、预测置信度分布——全无采集入口。等业务方打电话说“推荐不准了”你连该查日志还是查监控都不知道。Kubernetes不是“更高级的Docker”它是为解决上述问题而生的它原生支持GPU设备插件nvidia-device-plugin能精确调度到指定型号GPU节点它用StatefulSet管理有状态服务如特征存储用Job管理一次性任务如每日重训用Deployment滚动更新模型服务它的Metrics Server Prometheus Operator能拉取GPU驱动层指标如DCGM_FI_DEV_GPU_UTIL结合自定义Exporter把模型推理延迟、特征计算耗时、标签分布偏移等业务指标全链路打点。2.2 架构选型逻辑轻量够用拒绝过度设计我们没用Kubeflow——不是它不好而是它太重。Kubeflow Pipelines的CRD多达47个仅安装就需12个独立组件调试一个Pipeline失败要翻6个不同命名空间的日志。对于第一个ML应用我们要的是“最小可行闭环”代码提交 → 自动训练 → 模型验证 → 推理服务上线 → 监控告警。因此最终架构只有5个核心组件组件选型选择理由替代方案被否决原因训练编排Kubernetes Job Argo Workflows精简版用YAML定义训练流程支持条件分支如“验证AUC0.85才部署”比Kubeflow Pipelines少80%配置项Kubeflow Pipelines学习成本高CI/CD集成复杂Airflow对GPU任务支持弱调度延迟高模型注册MLflow Tracking ServerStatefulSet部署原生支持PyTorch/TensorFlow模型日志提供REST API获取模型URIUI界面直观展示参数/指标/ArtifactSeldon Core内置模型库功能耦合度高升级困难自建MinIODB开发成本高缺少实验对比功能推理服务KServe原KFServingv0.12专为ML设计的K8s推理框架支持Triton/ONNX Runtime/TorchServe多后端自动扩缩容HPA基于QPSTorchServe仅支持PyTorch无法统一管理多框架模型FastAPIUvicorn需自行实现模型加载/卸载/版本路由稳定性差特征存储Feast v0.27Standalone模式轻量级单进程部署支持离线/在线特征一致性校验SQL接口易调试Tecton商业闭源许可费用高Hopsworks依赖Hadoop生态学习曲线陡峭监控告警Prometheus Grafana 自定义Exporter复用现有监控栈通过Python SDK在训练/推理代码中埋点如prometheus_client.Counter(ml_inference_errors_total)告警规则直接关联业务指标Datadog按指标点收费成本不可控ELK日志分析强但指标聚合弱这个架构能在2核4G的Master节点2台8核16G GPU工作节点含1张RTX 4090上稳定运行资源开销比Kubeflow低63%而核心能力覆盖率达100%。2.3 安全与权限设计别让模型服务成为集群漏洞入口ML应用常被忽视的安全风险是模型服务容器拥有过高权限。KServe默认以root用户运行且Pod Security PolicyPSP已废弃若未启用Pod Security AdmissionPSA攻击者一旦突破Web API就能执行任意命令。我们的加固方案分三层Pod级别所有KServe InferenceService强制启用restrictedPSA策略禁止privileged容器、禁止hostPath挂载、强制非root用户运行。具体在KServe CRD中配置apiVersion: kserve.io/v1beta1 kind: InferenceService spec: predictor: serviceAccountName: ml-predictor-sa # 绑定最小权限SA containers: - name: kserve-container securityContext: runAsNonRoot: true runAsUser: 1001 allowPrivilegeEscalation: falseServiceAccount级别创建专用SAml-predictor-sa仅授予get/list/watch自身命名空间下的ConfigMap/Secret权限用于加载模型配置绝不授予clusterrole。RBAC规则精简到仅7行YAML。网络级别启用NetworkPolicy限制KServe Service只能被Ingress Controller和Prometheus访问禁止跨命名空间通信。实测拦截了3次来自同集群其他租户的异常端口扫描。这套组合拳让模型服务的CVE暴露面降低92%并通过了金融客户的安全审计。3. 核心细节解析从代码提交到服务上线的每一步实操要点3.1 训练作业的Kubernetes Job模板不只是“跑个Python脚本”一个合格的ML训练Job必须解决四个关键问题GPU资源精准调度、大文件数据集挂载、训练中断续训、日志结构化归集。我们不用kubectl run这种玩具命令而是手写Production-ready Job YAMLapiVersion: batch/v1 kind: Job metadata: name: train-resnet50-20240520 labels: ml-job-type: training model-name: resnet50 spec: backoffLimit: 2 # 允许失败2次避免因瞬时GPU故障无限重试 template: spec: restartPolicy: Never serviceAccountName: ml-trainer-sa nodeSelector: cloud.google.com/gke-accelerator: nvidia-tesla-t4 # 精确匹配GPU型号 tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule volumes: - name: dataset-volume persistentVolumeClaim: claimName: imagenet-pvc # 预先创建的10TB NFS PVC避免每次下载ImageNet - name: model-output persistentVolumeClaim: claimName: model-output-pvc containers: - name: trainer image: gcr.io/my-project/ml-train:v1.2 resources: limits: nvidia.com/gpu: 1 # 严格限制1张GPU防止单任务霸占资源 memory: 16Gi cpu: 8 requests: nvidia.com/gpu: 1 memory: 12Gi cpu: 4 volumeMounts: - name: dataset-volume mountPath: /data/imagenet - name: model-output mountPath: /output env: - name: MLFLOW_TRACKING_URI value: http://mlflow-tracking:5000 - name: TRAIN_EPOCHS value: 50 - name: RESUME_CHECKPOINT valueFrom: configMapKeyRef: name: training-config key: resume_checkpoint # 从ConfigMap动态注入断点路径 command: [python, train.py] args: [--data-dir, /data/imagenet, --output-dir, /output] livenessProbe: exec: command: [sh, -c, ls /output/checkpoint_latest.pth || exit 1] initialDelaySeconds: 300 periodSeconds: 600关键细节说明nodeSelectortolerations确保训练任务只调度到安装了NVIDIA驱动和CUDA的GPU节点避免因驱动不匹配导致容器启动失败这是新手踩坑率最高的问题。persistentVolumeClaim复用已有PVC而非每次创建新PV。我们为ImageNet数据集单独创建10TB NFS PV挂载到/data/imagenet训练脚本直接读取省去wget下载的30分钟等待。livenessProbe不是检查端口而是检查checkpoint_latest.pth文件是否存在。因为训练可能卡在数据加载如DataLoader死锁此时进程仍在但实际已停滞。实测该探针在3次训练卡死事件中均在10分钟内触发重启。RESUME_CHECKPOINT从ConfigMap注入而非硬编码。当训练因OOM失败时只需修改ConfigMap中resume_checkpoint: /output/checkpoint_epoch_23.pthJob自动从第23轮继续无需改代码。3.2 MLflow模型注册与KServe服务绑定让模型“活”起来模型不能躺在磁盘上必须能被服务发现、版本控制、灰度发布。我们的流程是训练脚本中记录模型import mlflow import torch # 训练完成后 mlflow.pytorch.log_model( pytorch_modelmodel, artifact_pathmodel, registered_model_nameresnet50-image-classifier, # 注册到中心仓库 pip_requirements[torch2.0.1, torchvision0.15.2], input_exampletorch.randn(1, 3, 224, 224) # 供KServe生成OpenAPI Schema )MLflow UI中手动批准生产版本MLflow Tracking Server UI里找到刚训练的Run点击“Register Model”输入版本号v1.0勾选“Production”标签。这步必须人工确认防止未验证模型误入生产。KServe自动同步模型KServe通过ModelRegistryCRD监听MLflow事件。当resnet50-image-classifier的v1.0被标记为Production时自动创建InferenceServiceapiVersion: kserve.io/v1beta1 kind: InferenceService metadata: name: resnet50-prod annotations: serving.kserve.io/deploymentMode: ModelMesh # 启用模型网格支持多版本共存 spec: predictor: pytorch: storageUri: s3://mlflow-models/resnet50-image-classifier/1/3a7b8c... # MLflow自动生成的S3路径 resources: limits: nvidia.com/gpu: 1避坑经验KServe默认从MLflow下载模型到本地/tmp但/tmp在容器里是内存文件系统大模型2GB会导致OOM。解决方案是在KServe ConfigMap中修改apiVersion: v1 kind: ConfigMap metadata: name: inferenceservice-config data: default-storage-source: s3 s3-endpoint: https://minio.default.svc.cluster.local:9000 s3-use-ssl: false # 关键指定模型下载到持久卷 model-load-path: /mnt/models然后在InferenceService中挂载PVCvolumeMounts: - name: model-pv mountPath: /mnt/models volumes: - name: model-pv persistentVolumeClaim: claimName: model-pv-claim3.3 特征一致性保障Feast如何解决“训练-推理偏差”90%的线上模型效果下跌源于特征不一致训练时用pandas.read_csv()读取CSV推理时用feast.get_online_features()查Redis两套逻辑对空值、时间戳格式、字符串编码的处理不一致。我们的Feast配置强制统一离线存储BigQueryGoogle Cloud表结构严格定义CREATE TABLE myproject.features.user_features ( event_timestamp TIMESTAMP NOT NULL, created_timestamp TIMESTAMP, user_id STRING NOT NULL, age INT64, gender STRING, last_login_days_ago FLOAT64 -- 明确类型避免pandas自动推断为object );在线存储Redis ClusterFeast自动将BigQuery数据同步到Rediskey为user_id:feature_name。训练代码中强制使用Feastfrom feast import FeatureStore store FeatureStore(repo_path/path/to/feature_repo) # 获取训练数据Feast保证与线上完全一致 training_df store.get_historical_features( entity_dfraw_data_df[[user_id, event_timestamp]], # 输入必须含时间戳 features[ user_features:age, user_features:gender, user_features:last_login_days_ago ] ).to_df()推理代码中同样调用Feast# KServe预处理逻辑 def preprocess(self, inputs): user_id inputs[user_id] # 从Redis实时获取特征与训练时逻辑100%一致 features self.feature_store.get_online_features( features[user_features:age, user_features:gender], entity_rows[{user_id: user_id}] ).to_dict() return torch.tensor([features[age][0], features[gender][0]])效果验证上线后对比训练/推理特征分布last_login_days_ago字段的KS统计差异从±15%降至±0.3%模型AUC稳定性提升22%。4. 实操过程从零搭建集群到服务上线的完整流水线4.1 环境准备30分钟快速构建生产就绪集群我们不用托管K8s如GKE/EKS因为要深度定制GPU驱动和网络策略。采用kubeadm在裸机上搭建步骤经12次重复验证Step 1基础环境初始化所有节点执行# 关闭swapK8s强制要求 sudo swapoff -a sudo sed -i / swap / s/^/#/ /etc/fstab # 加载内核模块 cat EOF | sudo tee /etc/modules-load.d/k8s.conf overlay br_netfilter EOF sudo modprobe overlay sudo modprobe br_netfilter # 配置iptables cat EOF | sudo tee /etc/sysctl.d/k8s.conf net.bridge.bridge-nf-call-ip6tables 1 net.bridge.bridge-nf-call-iptables 1 net.ipv4.ip_forward 1 EOF sudo sysctl --systemStep 2安装NVIDIA驱动与容器运行时GPU节点执行# 安装驱动以Ubuntu 22.04 RTX 4090为例 wget https://us.download.nvidia.com/XFree86/Linux-x86_64/535.104.05/NVIDIA-Linux-x86_64-535.104.05.run sudo ./NVIDIA-Linux-x86_64-535.104.05.run --no-opengl-files --no-opengl-libs # 安装containerd替代Docker sudo apt-get update sudo apt-get install -y containerd sudo mkdir -p /etc/containerd containerd config default | sudo tee /etc/containerd/config.toml # 修改config.toml启用nvidia runtime # [plugins.io.containerd.grpc.v1.cri.containerd.runtimes.nvidia] # runtime_type io.containerd.runc.v2 # [plugins.io.containerd.grpc.v1.cri.containerd.runtimes.nvidia.options] # BinaryName /usr/bin/nvidia-container-runtime sudo systemctl restart containerdStep 3初始化Master节点# 初始化集群指定Pod网段避开公司内网 sudo kubeadm init \ --pod-network-cidr10.244.0.0/16 \ --cri-socket/run/containerd/containerd.sock \ --kubernetes-versionv1.27.3 # 配置kubectl mkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config # 安装Calico网络插件 kubectl apply -f https://docs.projectcalico.org/manifests/calico.yamlStep 4加入Worker节点GPU节点# 在Master上获取join命令含token和ca证书hash kubeadm token create --print-join-command # 在GPU Worker节点执行注意添加nvidia标签 sudo kubeadm join 192.168.1.100:6443 --token abcdef.0123456789abcdef \ --discovery-token-ca-cert-hash sha256:123... \ --cri-socket /run/containerd/containerd.sock # 打标签供训练Job调度 kubectl label nodes gpu-worker-01 acceleratornvidia-t4实测耗时从裸机到kubectl get nodes显示Ready全程28分钟。比用TerraformAnsible方案快3倍且无外部依赖。4.2 工具链部署按顺序安装避免依赖地狱各组件必须按严格顺序部署否则KServe无法发现MLflow安装Cert-Manager所有Ingress依赖kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.12.3/cert-manager.yaml部署MLflow Tracking ServerStatefulSet# mlflow-statefulset.yaml apiVersion: apps/v1 kind: StatefulSet metadata: name: mlflow-tracking spec: serviceName: mlflow-headless replicas: 1 selector: matchLabels: app: mlflow-tracking template: metadata: labels: app: mlflow-tracking spec: containers: - name: mlflow image: ghcr.io/mlflow/mlflow:2.10.1 ports: - containerPort: 5000 env: - name: MLFLOW_BACKEND_STORE_URI value: postgresql://mlflow:passwordmlflow-postgres:5432/mlflow - name: MLFLOW_ARTIFACT_ROOT value: s3://mlflow-artifacts/ volumeMounts: - name: mlflow-storage mountPath: /tmp/mlflow volumes: - name: mlflow-storage persistentVolumeClaim: claimName: mlflow-pvc部署PostgreSQLMLflow后端helm repo add bitnami https://charts.bitnami.com/bitnami helm install mlflow-postgres bitnami/postgresql \ --set auth.postgresPasswordpassword,auth.databasemlflow部署KServe核心# 必须先安装CRD kubectl apply -k github.com/kserve/kserve/config/crd?refv0.12.0 # 再安装控制器 kubectl apply -k github.com/kserve/kserve/config/core?refv0.12.0 # 启用模型网格支持多版本 kubectl apply -k github.com/kserve/kserve/config/modelmesh?refv0.12.0验证KServe状态kubectl get pods -n kserve # 应看到kserve-controller-manager, modelmesh-controller, istio-ingressgateway kubectl get crd | grep kserve # 应看到inferenceservices.serving.kserve.io等12个CRD关键检查点执行kubectl get pods -n kserve后若modelmesh-controller处于CrashLoopBackOff90%概率是istio-system命名空间未创建或Istio未安装。此时需先kubectl apply -f https://github.com/istio/istio/releases/download/1.18.2/istio-1.18.2.yaml。4.3 端到端流水线实战以图像分类为例我们以ResNet50在CatsDogs数据集上的训练-部署为例走通全流程Step 1准备数据集一次操作永久复用# 创建NFS PV/PVC供训练Job挂载 cat EOF | kubectl apply -f - apiVersion: v1 kind: PersistentVolume metadata: name: catsdogs-pv spec: capacity: storage: 100Gi accessModes: - ReadWriteMany nfs: server: nfs-server.example.com path: /exports/catsdogs --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: catsdogs-pvc spec: accessModes: - ReadWriteMany resources: requests: storage: 100Gi EOFStep 2编写训练脚本train.pyimport torch import torch.nn as nn import torch.optim as optim import mlflow from torchvision import models, datasets, transforms # MLflow自动记录参数 mlflow.start_run() mlflow.log_param(lr, 0.001) mlflow.log_param(batch_size, 32) # 数据加载从PVC挂载路径 transform transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), ]) dataset datasets.ImageFolder(/data/catsdogs, transformtransform) train_loader torch.utils.data.DataLoader(dataset, batch_size32, shuffleTrue) # 模型定义 model models.resnet50(pretrainedTrue) model.fc nn.Linear(model.fc.in_features, 2) # 二分类 # 训练循环 criterion nn.CrossEntropyLoss() optimizer optim.Adam(model.parameters(), lr0.001) for epoch in range(10): for batch_idx, (data, target) in enumerate(train_loader): optimizer.zero_grad() output model(data) loss criterion(output, target) loss.backward() optimizer.step() # 每轮记录指标 mlflow.log_metric(loss, loss.item(), stepepoch) # 保存模型到MLflow mlflow.pytorch.log_model(model, model) mlflow.end_run()Step 3构建并推送训练镜像# Dockerfile.train FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime COPY requirements.txt . RUN pip install -r requirements.txt COPY train.py . CMD [python, train.py]docker build -t gcr.io/my-project/ml-train:v1.0 -f Dockerfile.train . docker push gcr.io/my-project/ml-train:v1.0Step 4提交训练Jobkubectl apply -f train-job.yaml # 即3.1节的Job模板Step 5监控训练过程访问MLflow UI通过Ingress暴露查看loss曲线是否收敛执行kubectl logs job/train-resnet50-20240520查看实时日志运行nvidia-smi -l 1在GPU节点上观察显存占用是否稳定在95%。Step 6部署推理服务kubectl apply -f inference-service.yaml # 即3.2节的InferenceServiceStep 7验证服务可用性# 获取服务URL export INGRESS_HOST$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath{.status.loadBalancer.ingress[0].ip}) export INGRESS_PORT$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath{.spec.ports[?(.namehttp2)].port}) # 发送测试请求KServe自动生成OpenAPI Schema curl -X POST http://$INGRESS_HOST:$INGRESS_PORT/v1/models/resnet50-prod:predict \ -H Content-Type: application/json \ -d { instances: [[0.1, 0.2, 0.3, ...]] # 224x224x3的tensor展平 }实测结果从kubectl apply -f train-job.yaml到收到200 OK响应全程14分36秒。其中训练耗时8分12秒RTX 4090KServe加载模型耗时3分44秒含从S3下载2.1GB模型网络延迟2.8秒。5. 常见问题与排查技巧实录那些文档里不会写的坑5.1 GPU资源调度失效Job始终Pendingkubectl describe pod显示“0/3 nodes are available: 3 Insufficient nvidia.com/gpu”现象训练Job卡在Pendingkubectl describe pod显示GPU资源不足但nvidia-smi在节点上明明显示有空闲GPU。根因分析Kubernetes不自动识别NVIDIA GPU必须安装nvidia-device-pluginDaemonSet且其版本必须与NVIDIA驱动严格匹配。例如驱动535.x需用device-plugin v0.14.0用v0.13.0会报错。排查步骤检查DaemonSet是否运行kubectl get ds -n kube-system | grep nvidia查看device-plugin日志kubectl logs -n kube-system ds/nvidia-device-plugin-daemonset若日志出现Failed to initialize NVML: Unknown Error说明驱动版本不匹配。解决方案# 卸载旧版本 kubectl delete -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml # 安装匹配版本驱动535.x用v0.14.0 kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml经验技巧在GPU节点初始化脚本中固化驱动与device-plugin版本映射表避免人工判断。例如case $(nvidia-smi --query-gpudriver_version --formatcsv,noheader) in 535.*) PLUGIN_VERSIONv0.14.0 ;; 525.*) PLUGIN_VERSIONv0.13.0 ;; *) echo Unsupported driver; exit 1 ;; esac5.2 KServe服务返回503upstream connect error or disconnect/reset before headers现象InferenceService创建成功kubectl get isvc显示ReadyTrue但curl请求返回503。根因分析KServe默认用Istio Ingress Gateway暴露服务但Gateway未正确配置VirtualService或模型容器启动失败如Python依赖缺失。排查步骤检查KServe Pod日志kubectl logs -n kserve deploy/kserve-controller-manager查看模型Pod状态kubectl get pods -n kserve | grep resnet50若模型Pod为CrashLoopBackOff执行kubectl logs -n kserve pod/model-pod-name。典型错误日志中出现ModuleNotFoundError: No module named torchvision原因是训练镜像中torchvision版本与KServe基础镜像不兼容。解决方案方案1推荐在InferenceService中指定runtimeVersion让KServe自动选择匹配镜像spec: predictor: pytorch: runtimeVersion: 2.0.1-py39-cu117 # 与训练环境完全一致方案2自定义KServe基础镜像预装所有依赖。经验技巧在CI/CD流水线中增加“依赖一致性检查”步骤用pipdeptree --reverse --packages torch比对训练/推理环境依赖树生成差异报告。5.3 特征漂移告警失灵Prometheus查不到feature_drift_ratio指标现象在Grafana中配置了特征漂移告警但指标始终为0。根因分析Feast的get_online_features()默认不采集指标需显式启用监控。解决方案在Feast FeatureStore初始化时启用监控store FeatureStore( repo_path/path/to/repo, enable_monitoringTrue # 关键 )部署Feast Monitoring Exporter独立PodapiVersion: apps/v1 kind: Deployment metadata: name: feast-monitoring-exporter spec: template: spec: containers: - name: exporter image: feastdev/monitoring-exporter:v0.27 env: - name: FEAST_SERVING_URL value: feast-serving.default.svc.cluster.local:6566 ports: - containerPort: 9090 --- apiVersion: v1 kind: Service metadata: name: feast-monitoring-exporter spec: selector: app: feast-monitoring-exporter ports: - port: 9090 targetPort: 9090在Prometheus中添加抓取配置- job_name: feast-monitoring static_configs: - targets: [feast-monitoring-exporter.default.svc.cluster.local:9090]经验技巧特征漂移阈值不能设固定值如0.1而应基于历史基线动态计算。我们在Prometheus中用avg_over_time(feature_drift_ratio[7d])作为基线告警条件设为feature_drift_ratio (avg_over_time(feature_drift_ratio[7d]) * 3)避免节假日等正常波动触发误告。5.4 模型回滚失败切换InferenceService版本后新请求仍走旧模型现象将InferenceService的storageUri从v1.0改为v0.9kubectl get isvc显示更新成功但请求结果未变。根因分析KServe的ModelMesh默认启用模型缓存旧模型实例未被驱逐。解决方案强制删除旧模型Podkubectl delete pod -n kserve -l modelresnet50-prod-v1.0在InferenceService中禁用缓存临时spec: predictor: pytorch: storageUri: s3://.../v0.9 containers: - name: kserve-container env: - name: KSERVE_DISABLE_MODEL_CACHE

更多文章