New Algorithm UBP2 Uses Uncertainty to Learn Robot Rewards from Preferences

New Algorithm UBP2 Uses Uncertainty to Learn Robot Rewards from Preferences

Mohamed Nabail, Leo Cheng, Jingmin Wang, Nicholas Rhinehart

7 мин. четене18.06.2026 г.

Researchers have developed UBP2, a preference-based reinforcement learning method that actively guides exploration by balancing expected reward with model uncertainty. This approach allows robots to learn manipulation tasks from limited human preference feedback more efficiently than existing model-free and non-optimistic model-based methods.

What the Researchers Built

UBP2 (Uncertainty-Balanced Preference Planning) is a model-based reinforcement learning algorithm designed to learn robot behaviors from pairwise preference comparisons rather than explicit numerical rewards. The method addresses a critical bottleneck in preference-based RL: how to collect the most informative data when the number of queries a human can answer is limited.

The core innovation is an optimistic exploration strategy that uses three separate deep ensemble models—one for dynamics (predicting next states), one for the reward function (inferred from preferences), and one for the value function. During the feedback phase, UBP2 plans trajectories using a unified score that combines the expected cumulative return with an uncertainty bonus derived from all three ensembles. This encourages the robot to visit states where it is uncertain about the dynamics, the reward, or the eventual value, thereby collecting data that is most useful for learning.

Once the preference budget is exhausted, the system switches to a standard learned policy that executes actions quickly without further planning. UBP2 also includes an optimistic query selection strategy: it shows human trainers pairs of segments that have both high predicted reward and high reward-model uncertainty, ensuring that each query resolves key ambiguities.

UBP2 algorithm pseudocode showing interaction loop with planning and learning steps

Key Results

On a suite of five Meta-World manipulation tasks (including door open, button press, and assembly) using only proprioceptive observations, UBP2 consistently matched or exceeded the success rates of both model-free and non-optimistic model-based preference-based RL baselines while requiring fewer environment interactions. The method achieved earlier task success than PEBBLE (model-free) and MBP (non-optimistic model-based) in all five tasks.

The theoretical analysis establishes finite-horizon regret bounds that grow sublinearly in the number of episodes, with explicit dependence on the maximum information gain of the learned dynamics and reward models. This means UBP2's exploration efficiency is provably near-optimal under standard smoothness assumptions.

When extended to high-dimensional visual observations (using DinoV2 encodings), UBP2 outperformed the non-optimistic model-based baseline on both Walker Walk and Cheetah Run tasks, while matching or exceeding model-free methods on Walker Walk. On Cheetah Run, model-free methods still performed best, suggesting vision-based dynamics models remain challenging.

How It Works

UBP2 operates in two phases: a feedback-available planning phase and a feedback-exhausted execution phase. During the first phase, every action selection involves solving a short-horizon model predictive control problem. The planner evaluates candidate action sequences by simulating trajectories through the learned dynamics model and computing a score that is the sum of predicted rewards plus an uncertainty penalty from all three ensembles:

Planner Objective = Predicted Cumulative Reward + α × (Uncertainty from Dynamics + Uncertainty from Reward + Uncertainty from Value)

The uncertainty is measured as the variance across ensemble members. By planning optimistically—rewarding actions that lead to high-uncertainty regions—UBP2 automatically balances exploitation (going for known high-reward states) with exploration (gathering data in uncertain parts of the state space).

Preference queries are generated by comparing pairs of trajectory segments. Instead of random pairs, UBP2 selects pairs that are both high in predicted reward and high in reward-model uncertainty. This ensures each human query targets the most informative comparisons, accelerating reward learning.

The dynamics model uses an ensemble of probabilistic neural networks, each predicting the next state distribution. The reward model is similar but trained directly on preference comparisons via a Bradley-Terry loss. The value model is an ensemble of deep Q-networks learned from imagined rollouts under the predicted reward.

After the budget of queries is used up, the planning component is disabled. The agent then follows the learned value function greedily, using only the reward and dynamics models to guide actions without further expensive planning.

ComponentModel TypeUncertainty SourceTraining Signal
DynamicsDeep ensemble (probabilistic)Variance across ensembleGround-truth state transitions
RewardDeep ensemble (probabilistic)Variance across ensemblePreference comparisons
ValueDeep ensemble (Q-function)Variance across ensembleRollouts under learned reward
Comparison of success rates across tasks for UBP2 and baselines

Why This Matters for Robotics

Preference-based RL is a natural fit for robotics because many tasks have hard-to-specify reward functions. Rather than engineering a complex reward—or requiring users to give numerical scores—a trainer can simply say "I prefer the trajectory on the left." UBP2's uncertainty-driven query selection reduces the number of such comparisons needed, making it practical for real-world deployment.

The method's ability to switch from planning to policy execution after queries are exhausted is also practical: during training, the robot explores widely; after training, it runs a fast, reactive policy. This decoupling could be adopted in warehouses or assembly lines where initial human demonstrations are costly but final execution must be rapid.

UBP2's use of three separate uncertainty estimates is noteworthy. Most prior work considers uncertainty in only the reward or only the dynamics; including all three sources leads to more targeted exploration. For robot arms learning pick-and-place or door-opening, this could halve the number of required queries compared to current baseline methods.

Explore related robots on BotMarket: browse humanoid robots on BotMarket | used cobots for sale | used industrial robots

Limitations and Open Questions

The theoretical analysis assumes the dynamics and reward models are well-calibrated Gaussian processes, but in practice UBP2 uses deep ensembles. While deep ensembles often produce reliable uncertainty estimates, they are not as theoretically grounded as GPs. The authors note that the preference-learning error is not fully characterized in the regret bound, making it difficult to guarantee how many queries are truly needed.

In visual domains, UBP2's performance lagged behind model-free methods on the Cheetah Run task, suggesting that learned vision-based dynamics remain a weak point. Future work may need to incorporate better latent representations or pre-trained visual encoders.

Frequently Asked Questions

What exactly is preference-based reinforcement learning? Instead of giving the robot a numeric reward signal, a human compares two short video clips of the robot's behavior and indicates which one is preferable. The algorithm infers a reward function from these comparisons.

How does UBP2 use uncertainty to plan better? UBP2 adds an uncertainty bonus to the predicted return during planning. This encourages the robot to visit states where it is uncertain about dynamics, reward, or value, collecting data that most reduces overall uncertainty.

What kinds of robots or tasks can UBP2 be applied to? The method was tested on simulated manipulation tasks like opening doors and pressing buttons, and it works with both proprioceptive sensors and camera images. It could be adapted to real robotic arms, mobile manipulators, or any control task where a human can compare two behaviors.

How does UBP2 compare to simpler preference-based methods like PEBBLE? UBP2 consistently achieved higher success rates and required fewer environment interactions across five Meta-World tasks. Its uncertainty-guided planning is the key advantage over non-optimistic baselines like MBP and model-free methods like PEBBLE.

Conclusion

UBP2 introduces a principled way to combine uncertainty from dynamics, reward, and value models into a single planning objective for preference-based RL. By actively seeking informative data during the feedback phase and switching to fast execution afterward, it offers a practical path toward sample-efficient robot learning from human preferences.

🍪 Предпочитания за бисквитки

Използваме бисквитки за измерване на представянето. Политика за поверителност