Summary. In imitation and reinforcement learning, the cost of human supervision limits the amount of data that robots can be trained on. An aspirational goal is to construct self-improving robots: robots that can learn and improve on their own, from autonomous interaction with minimal human supervision or oversight. Such robots could collect and train on much larger datasets, and thus learn more robust and performant policies. MEDAL++ is an autonomous reinforcement learning algorithm that trains a forward policy to do the task, and a backward policy to undo the task towards states visited by an expert. Starting with a small set of demonstrations collected by an expert, the forward and backward policy interact with the environment in a cyclic fashion, switching control after a fixed number of steps. Chaining the forward and backward policies allows the robot to self-improve, minimizing the need for humans to reset the environment after every episode. Importantly, MEDAL++ learns end-to-end from high-dimensional visual inputs and learns the reward function from the expert demonstrations, bypassing the need for reward engineering. In contrast to prior work, this allows MEDAL++ to be applied in the real world, improving the success rate by 30-70% over behavior cloning policies in practice. Overall, MEDAL++ takes a step towards simple and general self-improving robotic systems.
This website features autonomous training and evaluation videos of MEDAL++ on three manipulation tasks using the Franka Panda arm: cloth hanging, peg insertion and bowl covering.
This task requires the robot to grasp a cloth and hang it on a fixed hook. The cloth itself can be in any location and arbitrary shape.
An annotated segment of training using MEDAL++, showing how forward and backward policies interact to enable the robot to practice autonomously.
A timelapse of training, showing the diverse set of states visited by the robot and, failures and successes of the policy.
This task requires the robot to insert a peg into the goal location, marked by a green boundary.
An annotated segment of training using MEDAL++, showing how forward and backward policies interact to enable the robot to practice autonomously.
A timelapse of training, showing the diverse set of states visited by the robot and, failures and successes of the policy.
This task requires the robot to cover a bowl using a cloth. The cloth itself can be in any location and arbitrary shape.
An annotated segment of training using MEDAL++, showing how forward and backward policies interact to enable the robot to practice autonomously.
A timelapse of training, showing the diverse set of states visited by the robot and, failures and successes of the policy.
Template credits: NeRF in the Palm of Your Hand