Teaching Robots to Dance

Juan Manuel Ortiz de Zarate
55 minutes ago
11 min read

Modern industrial robotics increasingly resembles choreography rather than solo performance. In automotive welding [2], aerospace assembly[3], or high-throughput manufacturing [4,5,6], multiple robotic arms must work simultaneously within tight, obstacle-filled workcells. Each arm must decide what to do, when to do it, and how to move—while never colliding with its neighbors. This deceptively simple description hides one of the most difficult problems in robotics: the joint optimization of task allocation, scheduling, and collision-free motion planning.

In their 2025 Science Robotics paper, Lai et al. introduce RoboBallet [1], a learning-based system that addresses this problem head-on by combining graph neural networks (GNNs) with deep reinforcement learning (RL)[7] . Rather than decomposing the problem into separate stages or relying on hand-crafted heuristics, RoboBallet learns to control many robots simultaneously, generating coordinated trajectories in real time for dense, industrial-scale settings. The result is a planner that trades theoretical completeness for scalability, speed, and practical usefulness—precisely the trade-off that real factories increasingly demand.

They train the model on environments with randomized robot placement, randomly generated obstacles, and randomly generated tasks

Why Multi-Robot Planning Is So Hard

Classical robotics approaches excel when problems are small and well-structured. Sampling-based motion planners like RRT or RRT-Connect can find collision-free paths in high-dimensional spaces, and combinatorial solvers can optimize schedules or assignments under simplifying assumptions. The trouble begins when these components are combined.

In a dense multi-robot workcell, the planner must solve at least three intertwined subproblems:

Task allocation: which robot performs which task.
Task scheduling: the order in which each robot executes its assigned tasks.
Motion planning: generating collision-free joint-space trajectories for all robots simultaneously.

Each of these problems is computationally hard on its own. Together, they become intractable at realistic scales. Theoretical results show that task and motion planning is PSPACE-complete, meaning worst-case complexity grows exponentially with the number of robots, tasks, and obstacles. As a result, state-of-the-art algorithmic approaches typically handle only a handful of robots and tasks, and often rely on iterative decomposition or strong heuristics.

Industry has responded pragmatically: human experts manually design trajectories, insert interlocks, and tweak schedules until everything works. This process can take hundreds or thousands of engineering hours, and even minor changes—new tasks, new obstacles, or a robot failure—can require extensive rework. RoboBallet is motivated by the question: Can we learn the intuition that expert humans use, and apply it at machine speed?

Reframing Planning as Control

The key conceptual shift in RoboBallet is to abandon the idea of planning as a one-shot combinatorial search. Instead, the authors frame the problem as continuous control over time. At every 100 ms timestep, a learned policy outputs joint velocity commands for all robots. These commands are integrated forward in a kinematic simulator, subject to velocity limits and hard collision checks.

In this formulation, task allocation, scheduling, inverse kinematics selection, and collision avoidance are no longer explicit modules. They emerge from the learned control policy. A robot that starts moving toward a task has, in effect, allocated that task to itself. A robot that waits or detours is implicitly negotiating schedule and priority with others.

This approach shifts computational burden from online planning to offline training. During training, the policy interacts with millions of procedurally generated environments. At test time, inference is extremely fast—fast enough to support real-time replanning and optimization.

Graphs as the Language of Coordination

Naively applying deep reinforcement learning to multi-robot planning would almost certainly fail to scale. As the number of robots, tasks, and obstacles increases, the dimensionality of the state space grows combinatorially. A flat state representation—where all robot states, task descriptors, and obstacle parameters are concatenated into a single vector—forces the policy network to relearn essentially the same interactions over and over again for every new configuration. The result is poor data efficiency, brittle generalization, and models whose size and training requirements explode with scene complexity.

RoboBallet’s central architectural insight is to recognize that multi-robot coordination is fundamentally a relational problem, and that graphs provide a natural language for expressing those relations. Rather than treating the environment as an unstructured collection of numbers, RoboBallet represents each state as a graph whose structure mirrors the physical and functional structure of the workcell.

Each environment state is encoded as a graph with three distinct types of nodes:

Robot nodes, which encode each robot’s internal state, including joint configurations, joint velocities, and remaining dwell time when a robot is executing a task.
Task nodes, which represent desired end-effector poses and track whether a task has already been completed.
Obstacle nodes, which correspond to geometric collision primitives derived from the environment, abstracting complex geometry into manageable components.

Edges define how information flows between these entities. Robot-to-robot edges allow the policy to reason about proximity, potential collisions, and coordination between arms sharing the same workspace. Task-to-robot edges encode relative pose information between tasks and robot end effectors, providing the spatial context needed to decide which robot should approach which task and from where. Obstacle-to-robot edges similarly encode spatial relationships that are critical for collision avoidance.

Crucially, edges are directional, reflecting how information should propagate. Tasks and obstacles send information to robots, while robots exchange information with one another bidirectionally. This design ensures that action decisions—joint velocity commands—are computed at the robot nodes, informed by all relevant relational context.

This graph-based formulation introduces strong inductive biases that dramatically improve scalability and generalization. The policy does not learn separate behaviors for “robot 1 approaching task 3” versus “robot 5 approaching task 17.” Instead, it learns a single reusable interaction pattern: how any robot should behave relative to any nearby task or obstacle. These learned interaction rules are shared across all nodes and edges of the same type, allowing the model to generalize seamlessly to scenes with different numbers of robots, tasks, or obstacles.

From a systems perspective, this has two major consequences. First, model complexity remains constant as the environment grows. Adding more robots or tasks increases the size of the graph, but not the number of parameters in the network. Second, computation scales with interactions rather than entities. The cost of inference is dominated by processing edges—robot–robot, robot–task, robot–obstacle—which grows linearly with tasks and obstacles and quadratically with robots, a manageable trade-off given physical limits on robot density.

Conceptually, this mirrors how expert human planners think. Humans do not reason about the entire joint configuration space explicitly; they reason locally and relationally—this robot is too close to that one, this task is better handled by the nearest free arm, this obstacle constrains approach direction. By embedding these relational assumptions directly into the network architecture, RoboBallet transforms the curse of combinatorial complexity into a source of leverage.

In short, graphs are not just a convenient data structure in RoboBallet—they are the mechanism that makes learning-based coordination feasible at industrial scales. Without this structured representation, deep RL would drown in dimensionality. With it, coordination becomes a reusable skill rather than a brittle optimization problem.

Learning with Reinforcement and Hindsight

RoboBallet is trained using a modified version of Twin Delayed Deep Deterministic Policy Gradient (TD3), a state-of-the-art off-policy reinforcement learning algorithm designed for continuous control. TD3 belongs to the family of deterministic actor–critic methods, in which the policy directly outputs continuous actions—in this case, joint velocity commands for every joint of every robot at each timestep. This choice is particularly well suited to robotics, where smooth, high-dimensional control signals are required and discretizing the action space would be impractical.

The learning signal itself is intentionally sparse and minimalist. Rather than shaping the reward to guide robots incrementally toward tasks, RoboBallet only rewards meaningful progress: completing tasks increases the score, while attempting motions that would lead to collisions incurs a penalty. All other behavior—how robots approach tasks, how they negotiate space with one another, and how they sequence their actions over time—is left for the policy to discover. This design choice avoids embedding human assumptions about “good” intermediate behavior into the reward function, which can bias learning and reduce generality.

However, sparse rewards come with a well-known drawback: early in training, the agent rarely experiences success, making it difficult to learn anything at all. In a complex, high-dimensional environment like a multi-robot workcell, random exploration is exceedingly unlikely to result in completed tasks, especially when collisions are disallowed by the simulator.

RoboBallet addresses this challenge using Hindsight Experience Replay (HER). Instead of discarding failed episodes as useless, HER reframes them as alternative successes. After an episode ends, the system retrospectively selects some of the end-effector poses actually reached by the robots and treats them as if they were the original task goals. Under this reinterpretation, trajectories that failed to reach the intended targets become successful demonstrations of reaching some goal.

This mechanism is especially powerful in RoboBallet’s setting. Because robots are constantly moving and interacting, almost every episode contains rich trajectories through configuration space. HER converts these trajectories into dense learning signals without altering the environment dynamics or manually shaping rewards. In effect, the agent learns from what it did, rather than being punished for what it failed to do.

Crucially, this approach keeps the reward function task-agnostic. The learning algorithm does not need domain-specific heuristics such as distance-to-goal rewards, collision margins, or hand-tuned weighting between competing objectives. This simplicity improves robustness and makes it easier to extend the framework to new task types or environments. If the definition of “success” changes—different task poses, different dwell times, or additional constraints—the same learning machinery can be reused with minimal modification.

The combination of TD3 and HER yields a training process that is both stable and scalable. TD3 mitigates overestimation bias and stabilizes learning in continuous action spaces, while HER ensures that even unsuccessful rollouts contribute useful gradients. Together, they allow RoboBallet to converge reliably on effective coordination policies, even in large scenes with many robots, many tasks, and complex obstacle layouts—precisely the regimes where traditional planning algorithms and naive reinforcement learning approaches tend to fail.

Scaling, Quality, and Generalization at Industrial Scale

The most striking results in the RoboBallet paper concern its ability to operate at industrial scale, a regime that has long been out of reach for joint task-and-motion planning systems. The authors train and evaluate their model in environments with up to eight 7-DoF robotic arms, 40 shared reaching tasks, and dozens of obstacle primitives—a combination that would render classical planners computationally infeasible. Crucially, RoboBallet not only functions in this setting, but does so efficiently, producing coordinated, collision-free trajectories in seconds.

Several complementary scaling properties enable this performance. At training time, increasing the number of tasks from 10 to 40 does not lead to an exponential increase in the number of training steps required for convergence. While each step becomes more computationally expensive—due to the larger number of graph edges—the overall learning dynamics remain stable. This indicates that the model is not memorizing specific task configurations, but rather learning reusable coordination patterns that transfer across problem sizes.

Execution time comparison between optimized and default layouts on the 10 evaluation configurations (each with 10 tasks)

Scaling with the number of robots is more demanding, as robot-to-robot interactions grow quadratically. However, this scaling is largely manageable in practice. Physical workcells impose natural limits on robot density: beyond a certain point, adding more robots yields diminishing returns because their workspaces overlap excessively. Within these realistic constraints, RoboBallet maintains tractable computational costs while continuing to benefit from parallelism in task execution.

At inference time, the system’s performance is particularly notable. Even in the largest configurations tested, each planning step takes on the order of 0.3 milliseconds on a single NVIDIA A100 GPU, enabling planning at more than 300× real-time for a 10 Hz control loop. On a single CPU core, inference remains faster than real time. This speed is not merely a technical curiosity—it is what allows RoboBallet to be used for rapid replanning, optimization, and interactive deployment rather than as an offline planning tool.

Scalability alone, however, would be of limited value if it came at the cost of poor solution quality. Learning-based planners are often criticized for lacking the theoretical guarantees of classical methods, such as probabilistic completeness or asymptotic optimality, and RoboBallet is no exception. To assess whether this trade-off is acceptable in practice, the authors compare RoboBallet against an exhaustive baseline in a simplified setting with four robots and tightly constrained tasks. The baseline enumerates all possible task schedules, samples multiple inverse-kinematics solutions per task, and relies on RRT-Connect for motion planning—at the cost of orders of magnitude more computation.

Despite this advantage, the baseline only modestly outperforms RoboBallet in trajectory duration. When sufficient IK diversity is available, RoboBallet produces trajectories that are competitive in execution time, even though it plans all robots simultaneously and explicitly accounts for inter-robot interactions. This result highlights a central theme of the paper: at real-world scales, consistent, near-optimal solutions delivered quickly are often more valuable than theoretically optimal solutions that are computationally impractical.

Perhaps most importantly, these properties extend beyond the environments seen during training. RoboBallet is trained entirely in simulation using randomized robot placements, obstacle geometries, and task distributions, yet it generalizes zero-shot to hand-designed workcells and to real robotic hardware. The authors demonstrate this by executing RoboBallet-generated trajectories on a physical setup with four Franka Panda robots. With minimal post-processing—primarily spline smoothing and conservative collision margins—the robots execute coordinated motions successfully.

This sim-to-real transfer is enabled by two design choices: the kinematic focus of the model, which avoids reliance on fragile dynamics assumptions, and the enforcement of collision constraints during training, which ensures that unsafe behaviors are never learned. Together, they show that RoboBallet is not merely a scalable research prototype, but a practical system capable of bridging the gap between simulation and industrial deployment.

Enabling New Capabilities

Perhaps the most compelling aspect of RoboBallet is not what it replaces, but what it enables.

Workcell Layout Optimization

Because planning is fast, RoboBallet can be embedded inside an outer optimization loop. The authors use a black-box optimizer to search over robot placements, discovering layouts that reduce total execution time by up to 33%. This turns layout design from a manual art into a data-driven optimization problem.

Fault Tolerance and Replanning

In traditional setups, planning for robot failures requires pre-computed backup trajectories. With RoboBallet, alternative plans can be generated on the fly when a robot goes offline, reducing downtime and engineering overhead.

Toward Online and Adaptive Planning

With inference speeds far exceeding real-time requirements, RoboBallet opens the door to perception-driven replanning, dynamic task insertion, and adaptive manufacturing pipelines—capabilities that are impractical with slow, manual planners.

The authors are careful to acknowledge limitations. RoboBallet focuses on reaching tasks with fully specified end-effector poses and no inter-task dependencies. Real industrial workflows often involve precedence constraints, heterogeneous robot capabilities, or complex manipulation like pick-and-place.

However, the graph-based formulation is flexible. Task dependencies could be encoded as task-to-task edges; incompatible robot-task pairs could be removed by masking edges. More expressive obstacle representations or attention mechanisms could further improve scalability.

The paper positions RoboBallet not as a final solution, but as a proof of principle: structured learning architectures can tame combinatorial robotics problems that defeat classical methods.

Conclusion: From Planning to Choreography

RoboBallet represents a philosophical shift in robotics. Instead of explicitly solving planning subproblems, it learns to behave like a skilled human planner—coordinating, anticipating, and adapting through experience. By combining graph neural networks with reinforcement learning, the system scales to problem sizes that matter in industry, while remaining fast enough to unlock entirely new workflows.

In dense robotic environments, the future may belong less to planners that search exhaustively, and more to policies that have learned the dance.

References

[1] Lai, M., Go, K., Li, Z., Kröger, T., Schaal, S., Allen, K., & Scholz, J. (2025). RoboBallet: Planning for multirobot reaching with graph neural networks and reinforcement learning. Science Robotics, 10(106), eads1204.

[2] Pellegrinelli, S., Pedrocchi, N., Tosatti, L. M., Fischer, A., & Tolio, T. (2017). Multi-robot spot-welding cells for car-body assembly: Design and motion planning. Robotics and Computer-Integrated Manufacturing, 44, 97-116.

[3] Yamada, Y., Nagamatsu, S., & Sato, Y. (1995, May). Development of multi-arm robots for automobile assembly. In Proceedings of 1995 IEEE International Conference on Robotics and Automation (Vol. 3, pp. 2224-2229). IEEE.

[4] Chen, H., Fuhlbrigge, T., & Li, X. (2008, August). Automated industrial robot path planning for spray painting process: a review. In 2008 IEEE International Conference on Automation Science and Engineering (pp. 522-527). IEEE.

[5] Hartmann, V. N., Orthey, A., Driess, D., Oguz, O. S., & Toussaint, M. (2022). Long-horizon multi-robot rearrangement planning for construction assembly. IEEE Transactions on Robotics, 39(1), 239-252.

[6] Xian, Z., Lertkultanon, P., & Pham, Q. C. (2017). Closed-chain manipulation of large objects by multi-arm robotic systems. IEEE Robotics and Automation Letters, 2(4), 1832-1839.

[7] Training-Efficient RL, Transcendent AI