What happens when you take cutting-edge robotic AI models and drop them into a real-world problem?

We’re trying to find out.

Our team is building AI systems for high-speed parcel induction—picking up packages and placing them into bins, fast. The task itself is simple and repeatable. But the real world is not. Lighting shifts. Parcels reflect glare. Camera angles vary between warehouse installs. No two deployments are ever quite the same.

That’s why we’re comparing two types of AI models:

Specialists: trained from scratch on our exact task and setup. Precise, but fixed.

Generalists: large pretrained "foundation" models meant to generalize across tasks and robots with minimal fine-tuning. Flexible, but unproven.

The question isn’t just which performs better in simulation. It’s:
‍Which models will actually hold up when things get messy?

‍

The Models: Foundation vs Focus

This phase of testing compares two types of robotic AI models—each representing a different path to deployment.

Generalist Models (Robotic Foundation Models)

Generalist models are pretrained across many environments, tasks, and robot types. They’re designed to generalize—to adapt quickly to new deployments with minimal additional training. The promise is faster scalability: rather than training a new model from scratch for every warehouse or robot variation, a generalist can be fine-tuned on a small amount of data and reused.

In theory, this approach reduces long-term engineering effort. If fine-tuning is effective, these models could enable faster rollouts, easier retraining, and simpler maintenance as new cells come online.

‍

Specialist Models

Specialists are trained entirely from scratch using only data from our specific task and hardware setup. They make no assumptions and rely only on what they’ve seen during training. That makes them well-matched to their environment, with the potential for higher precision and consistency.

This approach often involves more upfront effort, but it can yield strong performance in structured settings where the conditions are known and repeatable.

‍

Why This Comparison Matters

Our broader goal is to identify which models offer the best tradeoff between performance and long-term flexibility. We’re not just comparing accuracy—we’re evaluating how each type of model fits into the practical realities of warehouse deployment:

Can we adapt a model across different robots or camera setups without full retraining?

How much data and engineering effort does each approach require over time?

What scales with us as we grow?

By testing generalists and specialists side-by-side, we aim to better understand where each excels—and how to build systems that are both robust and maintainable.

‍

Model Intuition and Behavior

So far, we’ve tested four models in simulation:

ACT (trained from scratch)

VQ-BET (trained from scratch)

Diffusion Policy (trained from scratch)

Smol-VLA (tested both from scratch and fine-tuned)

Here’s how they behave—and what those differences mean in practice.

‍

ACT (Action Chunking Transformer)

Plans motion in segments, like a human breaking a task into steps. This structure leads to smooth, deliberate movement. In sim, ACT was the most successful model overall—fast inference, precise picks, and reliable control.

‍

VQ-BET (Vector Quantized Behavior Embedding Transformer)

Discretizes robot actions into a learned “vocabulary” of behaviors. This results in compact, efficient policies. Despite being limited to a single camera view due to its architecture, it performed consistently across test conditions, showing a useful degree of input flexibility.

‍

Diffusion Policy

Samples trajectories by refining noise into motion over time. It’s powerful and expressive but slower and less reliable in this setting. Without stronger constraints, its sampling process didn’t consistently align with the demands of the task.

‍

Smol-VLA (Vision-Language-Action)

We tested Smol-VLA both from scratch and fine-tuned. Neither version succeeded, but fine-tuning improved behavior: the model moved toward the correct area and sometimes aligned with the package, but never completed a pick. The scratch-trained version rarely left the home pose.

Our Setup

These results come from a controlled simulation environment built using the Genesis platform (see the Jack of All Trades Blog for details). The goal wasn’t to replicate the messiness of the real world—just to establish a clean, consistent baseline.

The scenario is simple: two parcels spawn in a pickup zone. The robot sees the workspace and must move one parcel to a fixed drop location.

We collected over 40,000 demonstrations using a three-camera setup:

A wrist-mounted end-effector camera

A top-down camera

A third external view for context

Most models—ACT, Diffusion—used all three views. VQ-BET used only one, due to architectural constraints, but still performed well. That agnosticism to camera perspective is a promising sign.

Only about 13,000 demonstrations were usable—the rest were discarded after we discovered that the initial camera placement made the task visually ambiguous. Once the framing was corrected, training stabilized.

‍

What’s Next: From Sim to Reality

So far, we’ve evaluated only four models in simulation. The remaining generalist models are next:

OpenHelix

Pi-0

GR00T

Gemini

Each of these is built with generalization in mind—from flow matching to dense attention to real-time inference. We’re converting the dataset and preparing to fine-tune them in simulation using the same Genesis-based task.

But the real test comes after that.

Once we identify the most promising models—whether generalist or specialist—we’ll retrain them on real-world data collected from operational warehouse cells. That’s where we’ll introduce the full complexity of production:

Lighting variation

Occlusion and sensor drift

Unseen packages and clutter

Camera misalignment

Actuation delay and hardware noise

That’s when we’ll find out which models truly adapt, and which ones require tight environmental assumptions to succeed.

‍

Why It Matters

If generalist models adapt well, they could enable faster deployment, simpler maintenance, and lower tech debt as our system scales. But if specialists continue to outperform, we may need tighter pipelines—more retraining, but more control.

Our goal isn’t just top-line performance. We’re measuring:

How easily models adapt to new setups

How much tuning they require

How robust they are to operational variation

And ultimately: which ones are easiest to maintain and scale

We'll keep sharing what we learn—through analysis, videos, and real-world validation.

Next up: more generalist evaluations, real-world retraining, and answers to the question we started with:

Which models will actually hold up when it counts?

‍

Learn more about Plus One

Introducing DepalOne

Automated Mixed & Grocery Depal

InductOne Dual-Arm Induction Cell

Watch More

About the Author:

Gilberto Briscoe-Martinez is a PhD student at the University of Colorado - Boulder. His work investigates how learned models can adapt to varying morphologies, particularly as robots wear and tear over time. He interned at Plus One Robotics during the summer of 2025, during which he performed the work described here.