RFMs in Warehouses: Jacks-of-all-Trades or Aces?

Phase 0: Testing RFMs in Simulated Production Environments of Warehouse Induction Robots

If you’re at all like the AIML team here at Plus One Robotics, you’ve been following the excitingly rapid advances of large, deep, generative neural networks applied to speech, text, and images.  LLMs (large language models) and their relations have been making astonishing progress in multiple domains, and it’s quite easy these days to generate high-quality text, images, and even videos.

But generating similarly fidelitous real-world behaviors, i.e, robot motions, is still elusive.  In fact, chatGPT sums up the difficulties as follows:

“Bringing LLMs to robotics means closing a huge semantic gap between language and action, and dealing with the physical, temporal, and safety constraints of the real world. It’s a deep multidisciplinary challenge involving NLP, robotics, control theory, computer vision, and human-computer interaction.”

Many projects across academia and industry are focused on bridging this gap by collecting massive amounts of real-world data, exploring simulation as a shortcut, and devising new architectures to support real-time robot behavior.

The promise of these Vision-Language-Action models is one, mostly, of generalizability.  The end goal is a robot (often humanoid) that can accomplish multiple different tasks in a variety of environments, such as cleaning AirBnB’s it’s never seen before.  An added advantage is that it can do so under the direction of a non-expert human who directs the robot with natural language.  These robots promise to be a sort of ‘jack of all trades’, able to help a variety of different people do a variety of different things, in a variety of different environments. A true general purpose robot.

However, what if you only need to do one thing, in one specific environment?

It’s a common saying in this field that “machine learning is the second best way to do anything,” implying that engineered, specialized approaches outperform learned ones.  But is that really true?  And, even if it is, does it matter?  That is, even if ML-based techniques aren’t as good as engineered ones (for some task metric), it’s quite possible that they’re good enough, and possibly better in other ways (cheaper, easier to maintain, etc).  And, of course, there’s the possibility that ML-based techniques are actually better at performing the desired task than anything we can engineer, given the wide variation of inputs inherent in the real world.

Here at Plus One Robotics, we’re arguably one of the leaders in high-speed dual-arm parcel induction for warehouses.  Our InductOne system routinely surpasses 2500-3200 inducted items per hour.  And this task itself is of broad interest, and is one of the test tasks for some of the ongoing RFM work, which has demonstrated impressive performance in package manipulation (flattening and flipping), as well as a healthy 875 Parcels/hour.

These are clearly different systems, and a direct comparison between them on a single metric is hardly fair.  A generalizable humanoid and a specialized induction cell promise very different things, and fit into different markets.  However, we’re curious if the technology behind the former can be applied to the latter and, in some way, improve it.

So, our question is: “Can we apply robot foundation model technology to our specialized system and outperform our custom software?

As discussed, there are multiple ways to measure performance, so we’ll be considering a few different metrics:

  1. Task performance.  In our warehouse applications, the main metric our clients care about is throughput - how many packages are successfully processed in a unit of time.  
  1. Entitlement.  The other major metric in our domain is how many different sorts of packages can the system handle.  We currently use an adaptive ML approach with HitL learning to detect novel objects and adjust the system’s autonomy.  We can apply the same technique to RFM-based approaches, but they may not work as well given the size of the model.  This metric is akin to generalizing over different parcels.
  1. Robustness.  As with most things in the physical world, robot behaviors do not always go according to plan.  How well the system recovers from failures (dropped packages, missed picks, multiple items stuck together, etc) is important.  Sometimes, the correct recovery is to reject the problem and let a human handle it later.  This metric is akin to generalizing over different scenarios.
  1. Applicability.  This is true generalization, across different applications.  How easily can the system be applied beyond induction, to other tasks in the warehouse?  Such as depal, truck unloading, etc.
  1. Setup & Maintenance.  Any system in production needs to be setup, configured, and maintained, which takes time and skill.  The easier the system is to deploy, the better.  Similarly, the less code/complex the system is, the better.

We’re right now finishing up the groundwork necessary to research this question.  We have a simulated environment reminiscent of warehouse induction, and a hand-coded  policy to perform the induction task.  We’ve made some simplifications, such as focusing on a single arm, in order to speed up our iterations.  You can see the simulator in action below.

Our experiments extend to simultaneous muti-simulation environments to be able to collect much more data faster, required to train the massive foundation models.

This simulator generates the same sorts of data that we collect from our real systems, namely RGB images and arm joint angles, and takes the same output we generate, joint trajectories and suction cup control.  Its purpose is to test out our training/evaluation framework, and provide a light-weight means to do preliminary comparisons between different RFM architectures and trained models.  It is not an attempt to do sim2real transfer.  Once we’ve identified some promising RFMs, we’ll train and test, and compare, them on our real robot systems.

Feel free to follow along as we attempt to apply this exciting research to our demanding production domain.  I’m sure we’ll all learn something on this journey!

Learn more about Plus One

Introducing DepalOne
Automated Mixed & Grocery Depal
InductOne Dual-Arm Induction Cell
Watch More
About the Author:
Dr. Grollman is a Staff Engineer at Plus One Robotics, where the motto is “Robots Work, People Rule.” With over a decade of experience in productizing Human-Robot Interaction (Vecna Technologies, Sphero Inc, Misty Robotics), his work centers on applying academic research to commercial problems. At Plus One Robotics, his focus is building systems that enable robots and humans to continuously learn from each other. A self-described ‘full stack roboticist,’ Dr. Grollman was a postdoctoral fellow at EPFL, received his Ph.D and Sc.M in Computer Science from Brown University, and his B.S in Electrical Engineering and Computer Science from Yale.