My First Week at Haptic: Making a 3B Robot Policy Faster

Blueprint-style illustration of a 3B-parameter vision-language-action model being compressed: a large neural network on the left, an arrow labeled 'distill' through a funnel, and a smaller, faster student model on the right running on an edge device

TL;DR

I just joined Haptic as CTO, and day 1, my first project was to do what robotics teams keep asking us for: distill and fine-tune an open VLA so it fits their hardware.
Why robotics teams care. A frontier open VLA might run at 10 Hz on a rented A100 and at 1 Hz on the edge chip their robot actually ships with. Closing that gap, from works on a datacenter GPU to works on a $2k edge box at real-time rates, is often the gap between a research result and a shippable robot.
I took a 3.3B-parameter open VLA model (Pi0.5, Physical Intelligence), fine-tuned it on LIBERO-Spatial, and distilled its 10-step flow-matching action head down to a single forward pass using the SnapFlow recipe.
Fine-tuning: 100% success on LIBERO-Spatial (20/20), beating the published ~97.5% baseline. Distillation: 95% at 1 inference step, equal to the 10-step teacher. Zero accuracy cost for a 10× reduction in denoising steps.
Building in the open. We memorialized all the pain points we hit for the community.
Next: shrinking the VLM backbone itself for edge deployment on a Jetson Orin Nano. The action-head distillation alone isn't enough when your target device is memory-bandwidth-bound.
We're always looking for more design partners!

01 — Why Make Robot Models Faster

Robotics teams sit in an awkward spot today. The best general-purpose policies (Pi0, Pi0.5, OpenVLA) are multi-billion-parameter VLAs that need a datacenter GPU to run at interactive rates. Smaller open alternatives like SmolVLA do exist at ~500M params, but at real cost to generality; if you want SOTA on an arbitrary embodiment, you're still running a multi-billion-parameter model. Actual robots live behind 8 GB edge chips, USB cables, and 20 Hz control loops. The gap between works on an A100 and works on a robot is where most deployments die.

A quick primer, if you're not deep in VLAs

What is Pi0.5? A 3.3-billion-parameter foundation model for robot control, open-sourced by Physical Intelligence. Think of it as GPT for arm motions: feed it a camera image plus a natural-language instruction (put the mug on the plate), and it outputs the next couple of seconds of motion. Pretrained on an enormous, diverse mix of real-robot data, it already recognizes common objects, has a physical sense of basic verbs, and produces plausible trajectories. It's the closest thing the open community has to a general-purpose robot brain today.

What is LIBERO? The obstacle course we evaluate on. A simulated Franka arm doing short tabletop manipulation tasks (pick up the mug, move the bowl, put X on Y) across 40 tasks in four task categories. Every serious VLA paper reports LIBERO numbers; it's the field's standard test kitchen. Published Pi0.5 scores about 97.5% success on LIBERO-Spatial, the category we focused on this week.

Why match the baseline before changing anything? Same principle as measure-twice-cut-once: if we rerun the published recipe and get the published number, then any later difference we see comes from our change, not from an off-by-one in how we built the test harness or fed the model an image at the wrong resolution. This step cost us a day and caught two silent bugs that would have poisoned the rest of the week.

What does fine-tune mean here? Pi0.5 is like a new-hire mechanical engineer: they already know how to use a screwdriver, what a bolt is, how to follow instructions. Fine-tuning is walking them through your factory's specific quirks: which screws go where, how your particular arm moves, what your task's conventions are. Technically, a small number of additional training updates on your task's data, adjusting a narrow slice of the model while keeping the general-knowledge backbone frozen. It's the difference between knows robotics and knows your robotics.

What is SnapFlow distillation? Out of the box, Pi0.5 produces each action by running 10 forward passes through itself. It iterates toward an answer, similar in spirit to how image diffusion models work. Accurate but slow. SnapFlow is a training recipe that teaches the same model to produce the same answer in a single pass. Picture a painter who needs 10 brushstrokes to finish a portrait versus one who nails the same portrait in one stroke. Same quality, ~10× less compute per action.

Two levers, one pitch

There are two ways to close the A100-to-robot gap:

Make the policy smaller (distillation into a specialized student, smaller backbones, quantization).
Make each inference cheaper (fewer denoising steps, architecture compression, layer pruning).

Haptic's thesis is that a system that does this repeatedly and reliably — removing the hidden engineering tax between papers and robots — is worth building. The pitch, if it works: you hand us a robot, a task, and a dataset; we hand back a smaller, faster, cheaper model that runs on your actual hardware.

Before we sell anything, we needed to prove to ourselves that the pipeline works end-to-end. So my first week was a one-person, one-week POC to make the claim stand on its own.

02 — Prior Art (and Who We're Building On)

None of the ingredients are original. The bet is on whether one team can wire them together reliably enough to sell.

Pi0.5 (Physical Intelligence, 2504.16054). The open VLA we start from. PaliGemma-3B backbone plus 300M flow-matching action expert. Apache-2.0. State-of-the-art open weights on LIBERO.
SnapFlow (Luan, Li, Zhao, Zhang, Wu, Ma, 2604.05656). Self-distillation recipe that compresses 10-step Euler flow matching into a single forward pass. Published three weeks ago. The paper's core claim (1-NFE matches 10-NFE on LIBERO) is what we validated independently, on our hardware, on our fine-tune.
LIBERO (Liu et al, 2023). The benchmark. 40 tasks across 4 suites on a Franka arm in MuJoCo. Every published VLA reports LIBERO numbers, so every claim about distillation quality has an external comparison point.
LeRobot v0.5 (HuggingFace). The training and evaluation stack. Pi0.5 is upstreamed. We use lerobot-train and lerobot-eval end-to-end with monkey-patches for the SnapFlow-specific bits.
Flow matching foundations: Lipman et al, 2023; rectified flow, Liu et al, 2209.03003; the consistency-model lineage (Song et al, 2023; MeanFlow, Geng et al, 2025) which SnapFlow extends.

If any of these authors have thoughts on applying their work to production robotics deployments, we'd love to hear them.

03 — What We Did

The plan was three phases across three hardware tiers:

Phase	Hardware	Wall-clock
Reproduce the published Pi0.5 baseline	DGX Spark (GB10 Blackwell, on desk)	14 min eval
Fine-tune Pi0.5 on LIBERO-Spatial	8× A100-80G (Azure)	~3.6 h train + eval
SnapFlow distillation from teacher to 1-step student	1× A100-80G	~20 min train + 7 min eval

Phase 1, baseline reproduction. The published lerobot/pi05-libero checkpoint ran at 95% success on LIBERO-Spatial (19/20) on Spark, matching the published number within sampling noise. This confirms our harness is sane before we change anything.

Phase 2, fine-tuning. Starting from lerobot/pi05_libero_base (the pretrained, not-yet-task-adapted checkpoint), we ran LeRobot's train_expert_only=true recipe on 8× A100, freezing the 3B PaliGemma backbone and fine-tuning only the 300M action expert. The training run hit 95% at step 2000 (~1.8 wall-clock hours) and coasted up to 100% at step 4000 (~3.6 hours). Beats the published baseline; costs about two-thirds of the published recipe.

Phase 3, distillation. Using the published lerobot/pi05-libero as teacher (not our own FT, to avoid coupling distillation results to our fine-tune), we added SnapFlow's zero-initialized target-time projection head to the action expert and trained the mixed flow-matching plus shortcut loss from the paper. 1000 steps at batch size 4 on a single A100, 20 minutes of wall-clock. At 1 inference step, the student is functionally identical to the teacher under our eval protocol.

04 — Pain Points (the Install Layer)

As part of building in the open, we always like to share our pain points.

A week-long project is not seven days of machine learning. It's about three days of ML and four days of plumbing. The plumbing bugs, in order of severity:

Silently redirected checkpoint. lerobot/pi05_libero_finetuned HTTP-307-redirects to lerobot/pi05_libero_finetuned_v044, a checkpoint explicitly marked incompatible with LeRobot v0.5. The published docs still point at the old name. Cost: ~36 hours of why is my eval returning 0%? debugging.
transformers>=5.4 silently breaks Pi0.5. LeRobot's dependency constraint is transformers>=5.3.0,<6.0.0, which uv naturally resolves to the latest compatible (5.5.4 at the time). That version's vision-tower embedding change doesn't error, it just makes all rollouts fail at 0%. Fix: pin transformers==5.3.0. Findable only via a closed GitHub issue.
MuJoCo needs MUJOCO_GL=egl on Linux. Not in the install docs. Symptom: CUDA device visible, environment constructs, rollouts produce no action. Silently.
LeRobot's --resume=true has two latent bugs. Optimizer preset is skipped when resuming, and checkpoint_path is not auto-populated from output_dir. Both are one-line fixes; I'll file a PR upstream. Customer-facing implication: plan never to resume, finish in one go or re-run from scratch.
Mid-training eval triggers NCCL watchdog. MuJoCo rollouts on one rank can exceed NCCL's default 480 s collective timeout when eval runs at step N; the watchdog tears the whole DDP cluster down. Bumping TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC to 3600 helped but didn't fully solve it. Real fix is probably to run eval only on rank 0 behind a no-op all-reduce for the rest.
rsync --delete-excluded destroyed an overnight training run. The flag does exactly what it says, deletes destination paths matching the exclude list, but the name evokes skip in most people's muscle memory. Lost a 5-hour SnapFlow checkpoint to this. Recovered by re-running on A100 in 20 minutes and getting identical numbers. Replaced the raw rsync invocation with a two-mode wrapper script (push and pull) that does not pass that flag.
Azure A100 instances put the OS on a small root partition. ~/.cache/huggingface (17 GB for Pi0.5 weights alone) fills it within minutes. Fix: symlink to /mnt/. Obvious in retrospect; not called out in LeRobot docs.

None of these are hard problems once you know about them. The difficulty is knowing they exist. Surfacing this tax is part of what the pipeline takes on.

05 — Results

Everything backed by eval_info.json artifacts committed to our repo.

Fine-tuning (LIBERO-Spatial, 20 closed-loop episodes per gate):

Checkpoint	Success	Notes
`lerobot/pi05-libero` (published)	95.0% (19/20)	baseline reproduction
Our FT @ step 2000	95.0% (19/20)	matches published, 1/3 of published training budget
Our FT @ step 4000	100.0% (20/20)	+2.5 pp over published ~97.5% baseline

Small sample size (20 episodes per gate); we expect variance across seeds.

SnapFlow distillation (teacher = lerobot/pi05-libero, same 20-episode protocol):

Inference steps	Success	Eval wall-clock
10 (Euler)	95.0% (19/20)	220 s on A100
1 (SnapFlow shortcut)	95.0% (19/20)	170 s on A100

The student at 1-NFE and the student at 10-NFE miss the same single episode (task 7, episode 1). Zero accuracy delta for a 10× reduction in action-expert forward passes.

End-to-end latency (same distilled student, three hardware tiers):

Hardware	10-NFE	1-NFE	Speedup	VLM prefix (derived)
A800-80G (SnapFlow paper)	274 ms	83 ms	3.3×	~60 ms
A100-80G (ours)	361 ms	107 ms	3.37×	~79 ms
DGX Spark GB10 (ours)	1,299 ms	798 ms	1.63×	~742 ms

The SnapFlow method reproduces at datacenter-GPU bandwidth tiers (A800 and A100 give matching E2E speedups). It partially reproduces on Spark: the action-head compression still happens, but the VLM prefix comes to dominate the forward and the E2E speedup is cut roughly in half. This is consistent with the VLM forward being bandwidth-bound at batch size 1: A100's HBM2e is ~15× faster than Spark's LPDDR5X, which would scale the VLM prefix nearly linearly.

The operational consequence: on edge hardware, compressing the VLM is the latency lever. Action-head distillation alone doesn't get a 3B-parameter policy to interactive rates on memory-bound silicon.

06 — Pain Points & Opportunities

We're all about building in public, so here is a list of pain points we encountered and are looking into. As always, we want people to fix them, do them better than us, and grow the ecosystem.

Pain points in physical AI today

The works on A100, fails on robot gap is where deployments die. A frontier VLA runs at 10 Hz on a datacenter GPU and 1 Hz on the edge chip the robot actually ships with.
Open VLAs are too big for real hardware. Pi0.5 is 3.3B parameters. SmolVLA is ~500M but gives up generality. No just right tier exists.
Memory bandwidth, not FLOPs, is the edge bottleneck. At batch size 1, the VLM forward is bandwidth-bound. A100's HBM2e is ~15× faster than Spark's LPDDR5X, and that scales the VLM prefix nearly linearly.
Action-head distillation is solved; backbone compression isn't. SnapFlow gives a 10× reduction on the action head. On memory-bound silicon the VLM prefix dominates, so end-to-end speedup collapses (3.37× on A100, 1.63× on Spark).
The install layer eats half of every project. A week of VLA work is ~3 days of ML and ~4 days of plumbing. Silent checkpoint redirects, dependency resolution landing on broken minor versions, undocumented env vars, latent bugs in --resume.

Opportunities for physical AI

Shrink the VLM backbone, not just the action head. Cross-arch distillation from a 3B teacher into a ~650M student (Gemma-270M + SigLIP-B/16) plus INT8 quantization, deployed on a Jetson Orin Nano. Haptic's week-two work.
Automated re-distillation pipelines. Every time a new SOTA VLA drops, every robotics team reruns fine-tune + distill + quantize + deploy. Standardize it.
Upstream fixes to LeRobot and friends. The --resume bugs, the DDP-eval NCCL watchdog issue, the MUJOCO_GL docs gap. One-line fixes that every user currently re-discovers.
Task-specialist distillation. Most customers don't need a generalist. They need one task reliably and cheaply on their specific hardware.

07 — What's Next

Week two's (in progress) goal is shrinking the VLM. A Pi0.5-variant student with a trimmed PaliGemma backbone (Gemma-270M plus SigLIP-B/16 instead of Gemma-2.6B plus SigLIP-400M, ~650M total params) distilled cross-arch from our 100%-FT teacher, quantized to INT8, deployed on a Jetson Orin Nano that arrives Friday. If that lands within target accuracy after quantization, the fit a real robot's budget gate closes.

We're always looking for more design partners! If you have a VLA that works on an A100+ but isn't ready for your robot, a customer task you want to specialize a generalist for and ship to real hardware, or if you simply want to have an automated fine-tuning/distilling pipeline that runs whenever a new SOTA VLA model pops up, let's talk.

Thanks to the authors of Pi0.5, SnapFlow, LIBERO, and LeRobot v0.5 for the open weights, reproducible recipes, and fast issue responses. Most of the week was spent building on your work.