为什么静态3DGS+轨迹回放，可以通过强化学习训练端到端自动驾驶？

张开发

• 2026/6/6 4:16:52 • 15 分钟阅读

分享文章

我们一般理解为static 3DGS 是背景轨迹回放时障碍物是无法交互的。但是这两篇论文仍然进行了RL强化学习。RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement LearningParkingWorld: End-to-End Autonomous Parking Reinforcement Learning from Corrective Experience in 3DG我选择RAD的奖励模型进行分析3.4 奖励建模奖励是训练信号的来源决定了强化学习RL的优化方向。奖励函数的设计旨在通过惩罚不安全行为并鼓励与专家轨迹保持一致来引导自车的行为。它由四个奖励组成部分构成(1) 与动态障碍物的碰撞(2) 与静态障碍物的碰撞(3) 相对于专家轨迹的位置偏差以及 (4) 相对于专家轨迹的航向偏差R { r d c , r s c , r p d , r h d } . ( 4 ) R \{r_{dc}, r_{sc}, r_{pd}, r_{hd}\}. \quad (4)R{rdc,rsc,rpd,rhd}.(4)如图 4 所示这些奖励组成部分在特定条件下被触发。在 3DGS 环境中如果自车的边界框与动态障碍物的标注边界框重叠则检测到动态碰撞并触发负奖励r d c r_{dc}rdc。同样当自车的边界框与静态障碍物的高斯分布重叠时识别为静态碰撞导致负奖励r s c r_{sc}rsc。位置偏差被测量为自车当前位置与专家轨迹上最近点之间的欧几里得距离。超过预设阈值d m a x d_{max}dmax的偏差将产生负奖励r p d r_{pd}rpd。航向偏差计算为自车当前航向角ψ t \psi_tψt与专家轨迹匹配航向角ψ e x p e r t \psi_{expert}ψexpert之间的角度差。超过阈值ψ m a x \psi_{max}ψmax的偏差会导致负奖励r h d r_{hd}rhd。这些事件中的任何一项包括动态碰撞、静态碰撞、过度的位置偏差或过度的航向偏差都会立即导致回合终止。因为在发生此类事件后3DGS 环境通常会产生噪声传感器数据这对强化学习训练不利。With the reward function, it is entirely possible to perform RL training in a:Static 3DGS Scene Trajectory Replay RL Agentframework.However, it is important to understandwhat kind of RL problem you are actually solving.What your environment really isYour environment dynamics are:Dynamic vehicles: fixed replay Pedestrians: fixed replay Traffic: fixed replay Ego: controlled by RLSo:s t 1 f ( s t , a t ) s_{t1}f(s_t,a_t)st1f(st,at)still exists.The ego vehicle’s future state depends on its actions.The difference is that:other agents \text{other agents}other agentsdo not react to the ego.They follow prerecorded trajectories.This is a valid MDPMany papers call this:Open-loop traffic replayLog replay simulationReactive ego / non-reactive worldThe RL agent can still learn:Steer Brake Acceleratebecause:action ↓ ego trajectory changes ↓ reward changesFor example:Replay vehicle: -------------------- Ego: accelerateResult:collisionReward:rdc 0The policy receives a training signal.Why your reward design worksYour reward contains:rdc Dynamic collision rsc Static collision rpd Position deviation rhd Heading deviationThese define a constrained corridor around the expert trajectory.Effectively the RL agent learns:Stay safe Stay near expert Maintain correct headingThis is very similar to:imitation RLoffline RL fine-tuningautonomous-driving RL benchmarksWhat the RL agent can learnIt can learn:✓ smoother steering✓ throttle control✓ brake timing✓ lane keeping✓ trajectory tracking✓ collision avoidance against replayed trafficWhat it cannot learn(3DGS回放训练RL存在局限性反物理常识)Suppose:Replay car: turns leftYour ego decides:stop completelyIn reality:other vehicles react pedestrians react traffic evolves differentlyBut in replay:all agents continue exactly as recordedTherefore the RL agent never sees:counterfactual futuresExampleDataset:Car A: x10 x11 x12 x13Recorded assuming:human driver acceleratesDuring RL:ego brakes hardIn the real world:Car A might also brakeIn replay:Car A still: 10→11→12→13This is physically inconsistent.Yet RL can still train.Why many autonomous-driving papers accept thisBecause the objective is often:Learn a driving policynotBuild a perfect simulatorReplay-based environments provide:real sensor datarealistic trafficeasy reward computationstable trainingwithout needing:behavior predictiontraffic simulationworld modelsThe limitationThe strongest limitation is:No agent interactionYour environment is approximately:World(t)instead of:World(t, EgoAction)A true world model learns:W o r l d t 1 F ( W o r l d t , E g o A c t i o n t ) World_{t1}F(World_t,EgoAction_t)Worldt1F(Worldt,EgoActiont)which enables:negotiationyieldingmerginginteractive drivingIn fact, your quoted text already reveals thisNotice:“这些事件中的任何一项包括动态碰撞、静态碰撞、过度的位置偏差或过度的航向偏差都会立即导致回合终止。因为在发生此类事件后3DGS 环境通常会产生噪声传感器数据这对强化学习训练不利。”This is a strong clue that the environment is fundamentally arendering/replay system, not a generative world model.When the ego leaves the distribution of the recorded trajectory:large deviation collision off-roadthe rendered views become unreliable.Therefore the episode terminates.World-model simulators usually try to continue generating plausible futures after such deviations.ConclusionWith the reward function you showed,RL training in a Static 3DGS Trajectory Replay environment is completely feasible and is already an active research direction.What you are learning is essentially:Policy Learning inside a replay-based simulatorrather than:Policy Learning inside an interactive world modelThe main trade-off is:Feature3DGS ReplayWorld ModelPhotorealismExcellentOften lowerEasy to buildYesNoRL trainingYesYesInteractive trafficNoYesCounterfactual futuresNoYesClosed-loop ego controlYesYesMulti-agent reactionNoYesSo the answer isyes, RL can absolutely be trained successfully in your proposed setup, and a world model is not strictly necessary.The world model becomes valuable when you want other agents and the environment to react realistically to the ego vehicle’s actions rather than simply replaying recorded trajectories.

为什么静态3DGS+轨迹回放，可以通过强化学习训练端到端自动驾驶？

最新文章

树莓派4B + MJPG-streamer + FRP：手把手教你搭建一个能远程看家的低功耗监控系统

保姆级教程：从零在Ubuntu 20.04搭建无人机仿真环境（ROS Noetic + Gazebo + PX4）

随着树木和非树木植被覆盖的扩大，全球人口暴露于城市绿地的不平等加剧

01-React基础入门——11-Refs 与 DOM 操作

终端环境下 AI 图像识别与生成实战：从手绘草稿到精美插画的完整方案

你的模型FLOPs算对了吗？聊聊fvcore、thop这些工具在统计时的那些‘坑’

推荐文章

STM32F4驱动AD7606避坑指南：SPI配置、时序调试与电压换算全流程

TVA与其他AI智能体的本质区别与联系（10）

使用 LangGraph 构建复杂的自动化测试用例“生成-执行-修复”循环

MTKClient终极指南：5分钟快速修复联发科设备变砖问题

Parallels Desktop 17保姆级教程：给CentOS 7虚拟机配个固定IP，开发调试再也不怕IP变来变去

Steam游戏《Turing Complete》通关笔记：手把手教你从逻辑门到可编程CPU的完整搭建流程

相关文章

终极ESP32 Arduino开发指南：从零开始快速上手物联网项目

如何打造个人专属的数字记忆库：WeChatMsg终极数据管理指南

Windows 11下SecureCRT 8.5安装激活全攻略（附注册机与避坑指南）

Gemini推送通知优化终极手册（2024Q2最新API v1.5实测数据+AB测试报告）

【Gemini社交媒体运营实战指南】：20年AI营销专家亲授7大高转化内容公式

保姆级教程：在Ubuntu 22.04上为GStreamer 1.22编译NVIDIA NVENC/NVDEC插件（含CUDA 12.x适配）

分享文章

更多文章

PyTorch入门第一课：从零实现房价预测模型

Qwen3.6-Plus实战指南：从零构建AI原生开发工作流

µFork技术解析：单地址空间操作系统的轻量级进程管理

DeepSeek V4实测：稠密架构、200K上下文与工程化落地指南

163MusicLyrics：一站式音乐歌词获取与管理工具指南

博德之门3模组管理器BG3ModManager：终极免费管理工具完整指南

Pixelbook 2017 折腾记：一根SuzyQ线搞定BIOS写保护，保姆级刷Windows 11教程

利用Arduino与AutoHotkey打造自定义宏键盘：从硬件搭建到自动化脚本

智慧职教自动化学习助手：3步开启高效学习新体验

Grok-3模型技术解析与本地部署实践指南

自媒体系统化运营：内容SEO涨粉变现全链路指南一

从摄像头到专业卡：FFmpeg dshow, v4l2, decklink设备选型与避坑指南