PaLM-E: An Embodied Multimodal Language Model
Pith reviewed 2026-05-10 22:25 UTC · model grok-4.3
The pith
One large model can plan robotic actions, answer visual questions, and caption images across different robot bodies by interleaving sensor data with language.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose embodied language models that directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks including sequential robotic manipulation planning, visual question answering, and captioning. Our evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation
What carries the argument
Multi-modal sentences that interleave visual, continuous state estimation, and textual encodings, trained end-to-end with a pre-trained language model.
If this is right
- Can perform sequential robotic manipulation planning from varied observation modalities.
- Solves visual question answering and captioning as part of the same model.
- Shows positive transfer when trained jointly on internet-scale language, vision, and embodied data.
- Larger versions retain general language capabilities while reaching state-of-the-art on visual-language benchmarks like OK-VQA.
- Works on multiple different robot embodiments without task-specific redesign.
Where Pith is reading between the lines
- The same model could potentially interpret natural-language instructions while continuously updating its internal state from live sensors.
- Joint training on internet data may allow embodied systems to improve by scaling rather than by hand-designing new modules for each domain.
- The interleaving approach might extend to other continuous signals such as audio or force feedback in future embodiments.
- Physical deployment on unstructured environments would test whether the learned grounding survives real sensor noise and longer task horizons.
Load-bearing premise
End-to-end training of interleaved visual, state, and text encodings with a pre-trained language model will create robust grounding between words and real-world percepts that generalizes across tasks, modalities, and robot embodiments without extra engineering.
What would settle it
A new robot embodiment or sensor type where the single trained model performs no better than separately engineered models, or shows no benefit from the joint language-vision-robotics training.
read the original abstract
Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g., for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks including sequential robotic manipulation planning, visual question answering, and captioning. Our evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model benefits from diverse joint training across internet-scale language, vision, and visual-language domains. Our largest model, PaLM-E-562B with 562B parameters, in addition to being trained on robotics tasks, is a visual-language generalist with state-of-the-art performance on OK-VQA, and retains generalist language capabilities with increasing scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PaLM-E, an embodied multimodal language model that directly incorporates real-world continuous sensor modalities (visual, state estimation) into a pre-trained PaLM LLM via interleaved input encodings. These encodings are trained end-to-end alongside the LLM on multiple tasks including sequential robotic manipulation planning, visual question answering, and captioning. The central claims are that a single model can address diverse embodied reasoning tasks across observation modalities and robot embodiments, exhibits positive transfer from joint training on internet-scale language/vision/visual-language data, and that the 562B-parameter variant achieves state-of-the-art on OK-VQA while retaining generalist language capabilities.
Significance. If the empirical results hold under rigorous scrutiny, this would be a significant contribution to embodied AI and multimodal learning. It provides evidence that scaling and joint training across internet-scale and embodied domains can produce generalist models capable of grounded reasoning without task-specific engineering, potentially influencing future work on bridging LLMs with robotics and real-world perception.
major comments (3)
- [Section 4] Section 4 (Experimental Setup and Results): The reported evaluations on robotic manipulation and embodied tasks omit full details on data splits, exact baseline implementations (including whether they use the same PaLM backbone), number of runs, and error bars. This weakens support for the claims of consistent outperformance and positive transfer, as the abstract and results sections present aggregate success metrics without these controls.
- [Section 3.2] Section 3.2 (Input Encoding): The description of how continuous state estimation is tokenized and interleaved with visual and textual encodings is incomplete (no explicit discretization scheme, embedding dimension, or normalization details). This is load-bearing for the grounding claim, as the weakest assumption in the paper is that end-to-end training of these encodings will robustly link words to percepts across embodiments.
- [Table 2] Table 2 / OK-VQA results: The state-of-the-art claim for PaLM-E-562B on OK-VQA lacks a complete set of recent multimodal baselines and an ablation isolating the contribution of the embodied robotics data versus the visual-language pretraining. Without this, it is unclear whether the embodied training is responsible for the reported gains or if they stem primarily from scale.
minor comments (3)
- [Abstract] The abstract and introduction use the term 'positive transfer' without a precise definition or quantitative metric (e.g., improvement over single-task training) in the summary paragraph.
- [Figure 1] Figure 1 (model diagram) would benefit from explicit callouts showing how continuous state values are converted to tokens and interleaved in the input sequence.
- [Section 3] Notation for the multimodal sentence construction (e.g., how visual patches and state vectors are denoted) is introduced without a consolidated table of symbols.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments, which have helped us identify areas where the manuscript can be strengthened. We address each major comment below and will incorporate the suggested revisions.
read point-by-point responses
-
Referee: [Section 4] Section 4 (Experimental Setup and Results): The reported evaluations on robotic manipulation and embodied tasks omit full details on data splits, exact baseline implementations (including whether they use the same PaLM backbone), number of runs, and error bars. This weakens support for the claims of consistent outperformance and positive transfer, as the abstract and results sections present aggregate success metrics without these controls.
Authors: We agree that greater transparency in the experimental protocol is needed to support the claims of outperformance and positive transfer. In the revised manuscript, we will expand Section 4 to include: (i) explicit descriptions of all data splits for robotic manipulation and embodied tasks, (ii) confirmation that baselines share the same PaLM backbone and implementation details, (iii) the number of independent runs performed, and (iv) error bars or standard deviations on all reported success rates. These additions will allow readers to better assess the reliability of the results. revision: yes
-
Referee: [Section 3.2] Section 3.2 (Input Encoding): The description of how continuous state estimation is tokenized and interleaved with visual and textual encodings is incomplete (no explicit discretization scheme, embedding dimension, or normalization details). This is load-bearing for the grounding claim, as the weakest assumption in the paper is that end-to-end training of these encodings will robustly link words to percepts across embodiments.
Authors: We acknowledge that the current description of state encoding in Section 3.2 lacks sufficient technical detail. We will revise this section to explicitly specify the discretization scheme applied to continuous state estimates (including binning method and resulting vocabulary size), the embedding dimensionality, and the normalization steps performed prior to interleaving with visual and text tokens. These clarifications will more rigorously support the grounding mechanism across embodiments. revision: yes
-
Referee: [Table 2] Table 2 / OK-VQA results: The state-of-the-art claim for PaLM-E-562B on OK-VQA lacks a complete set of recent multimodal baselines and an ablation isolating the contribution of the embodied robotics data versus the visual-language pretraining. Without this, it is unclear whether the embodied training is responsible for the reported gains or if they stem primarily from scale.
Authors: We partially concur. Table 2 already compares against the primary multimodal models available at the time of submission. To strengthen the presentation, we will add an ablation that directly compares PaLM-E variants trained with and without the embodied robotics data, thereby isolating the contribution of joint training beyond scale and visual-language pretraining. We will also incorporate any additional recent baselines that have appeared since submission, while noting that exhaustive coverage of every concurrent work is inherently limited by publication timelines. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper presents an empirical architecture and training procedure for PaLM-E, with all central claims (task performance, cross-modal transfer, embodiment generalization) resting on reported experimental results from new end-to-end training and evaluation rather than any closed-form derivation or self-referential definition. The pre-trained PaLM component is invoked as an external starting point whose parameters are not redefined or fitted inside the present work; no equation, prediction, or uniqueness claim reduces by construction to quantities already present in the inputs or prior self-citations. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption End-to-end training on interleaved multimodal inputs will establish effective grounding between language and percepts
Forward citations
Cited by 60 Pith papers
-
From Prompt to Physical Actuation: Holistic Threat Modeling of LLM-Enabled Robotic Systems
A unified threat model for LLM-enabled robots reveals three cross-boundary attack chains from user input to unsafe physical actuation due to missing validations and unmediated crossings.
-
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
-
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
-
EgoSafetyBench: A Diagnostic Egocentric Video Benchmark for Evaluating Embodied VLMs as Runtime Safety Guards
EgoSafetyBench shows VLMs reliably spot hazard-containing videos but miss specific contextual hazards and are degraded by misleading in-scene text.
-
Adapting Generalist Robot Policies with Semantic Reinforcement Learning
SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.
-
Revisiting Parameter Redundancy in Vision-Language-Action Models: Insights from VLM-to-VLA Adaptation
VLA models from VLM adaptation can be pruned 12-30% via multi-module joint scheme based on divergence signals while keeping ~90% performance on LIBERO without post-pruning recovery, unlike standard criteria that collapse.
-
Trajectory-Level Redirection Attacks on Vision-Language-Action Models
A prompt-only attack called command-preserving trajectory redirection can steer VLA robot behavior to attacker-chosen physical outcomes while the text still appears to match the intended task.
-
LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination
Introduces LIBERO-Occ benchmark showing VLA performance drop under occlusion and Viewpoint Imagination method that generates complementary views to improve robustness without extra hardware.
-
ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies
ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-sh...
-
Colosseum V2: Benchmarking Generalization for Vision Language Action Models
Introduces Colosseum V2 benchmark for evaluating VLA model generalization in robotic manipulation with 28 tasks, revealing limitations in current methods and sim-real correlations.
-
VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis
VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from...
-
Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls
Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency sup...
-
PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments
PRISM is a tiered benchmark with 300 human-verified tasks across five photorealistic apartments that diagnoses embodied agent failures in basic ability, reasoning ability, and long-horizon ability using an agent-agnostic API.
-
ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models
ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.
-
KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning
KinDER is a new open-source benchmark that demonstrates substantial gaps in current robot learning and planning methods for handling physical constraints.
-
AeroBridge-TTA: Test-Time Adaptive Language-Conditioned Control for UAVs
AeroBridge-TTA achieves +22 pt average gains on out-of-distribution UAV dynamics mismatches by updating a latent state online from observed transitions in a language-conditioned policy.
-
Using large language models for embodied planning introduces systematic safety risks
LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
-
Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions
Creates the first egocentric screen-view movie emotion benchmark and demonstrates that cinematic models drop sharply in Macro-F1 on realistic robot-like viewing conditions while domain-specific training improves robustness.
-
Mosaic: Cross-Modal Clustering for Efficient Video Understanding
Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
-
How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace
Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.
-
KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis
KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.
-
LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset
KITScenes LongTail supplies multimodal driving data and multilingual expert reasoning traces to benchmark models on rare scenarios beyond basic safety metrics.
-
AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models
AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.
-
Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding
Mema adds a stateful memory module to vision encoders that accumulates hierarchical visual features across layers and selectively injects portions back via feedback to preserve fine-grained cues, yielding consistent g...
-
PhysMem: Scaling Test-Time Memory for Embodied Physical Reasoning
PhysMem enables VLM-based robot planners to learn and verify physical properties through test-time interaction and hypothesis testing, raising success on a brick insertion task from 23% to 76%.
-
UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models
UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.
-
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
-
Large Video Planner Enables Generalizable Robot Control
A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.
-
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
-
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
-
Modality-Inconsistent Continual Learning of Multimodal Large Language Models
The paper introduces the MICL scenario for MLLMs with modality and task shifts and proposes MoInCL using pseudo-target generation and instruction-based distillation, reporting gains over continual learning baselines o...
-
3D-VLA: A 3D Vision-Language-Action Generative World Model
3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
-
RT-H: Action Hierarchies Using Language
RT-H learns robot policies by first predicting language motions as an intermediate representation and then mapping those plus the high-level task to actions, yielding more robust multi-task performance and the ability...
-
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models
SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.
-
Learning Interactive Real-World Simulators
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
-
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
-
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
-
Voyager: An Open-Ended Embodied Agent with Large Language Models
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
-
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
-
Visual Instruction Tuning
LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
-
Path Planning in Physically Viable World Models
A physically viable world model augments 3D Gaussian splats with physics simulation to assess robot route feasibility under simulated terrain changes like flooding, revealing failures not visible in static maps.
-
ReShift: Aha-Moment-Driven Reasoning-Level Backdoor Attacks on Vision-Language Models
ReShift is a reasoning-level backdoor framework for VLMs that uses poisoned data construction and joint optimization to shift CoT trajectories on trigger while preserving surface coherence.
-
Automating the Design of Embodied AgentArchitectures
Automated architecture search for embodied agents produces directional success-rate gains on vision-language and manipulation tasks while exposing limits from simulation noise and incomplete credit assignment.
-
Trust Your Instincts: Confidence-Driven Test-Time RL for Vision-Language-Action Models
T^2VLA is a test-time reinforcement learning framework for VLAs that uses internal confidence to define intrinsic rewards via similarity to high-confidence expert demonstrations and a dual-expert bootstrapping mechanism.
-
FADE: Mitigating Hallucinations by Reducing Language-Prior Dominance in Large Vision-Language Models
FADE attenuates FFN outputs in LVLMs based on layer-wise information flow analysis to mitigate hallucinations, shown effective on POPE, CHAIR, and MME benchmarks.
-
Enhancing Part-Level Point Grounding for Any Open-Source MLLMs
A plug-in Q-Synth Module plus Attention-to-Point Decoder converts text-conditioned attention in frozen MLLMs into point heatmaps, improving part-level grounding accuracy on multiple datasets.
-
Improving Vision-Language-Action Model Fine-Tuning with Structured Stage and Keyframe Supervision
StaKe adds lightweight auxiliary heads for manipulation stage identification and next-gripper-transition keyframe prediction to VLA fine-tuning, reporting relative success rate gains of 14% in bimanual simulation and ...
-
SSI-Policy: Learning Structured Scene Interfaces for Vision-Language Robotic Manipulation
SSI-Policy uses an RGB-only Structured Scene Interface to improve LIBERO benchmark performance by nearly 15% with only 10 demonstrations per task compared to prior methods.
-
SSI-Policy: Learning Structured Scene Interfaces for Vision-Language Robotic Manipulation
SSI-Policy learns a robot-agnostic RGB-only scene interface from video to improve vision-language manipulation policies by 15% on LIBERO with only 10 demos per task.
-
Graph it first! Enabling Reasoning on Long-form Egocentric Videos through Scene Graphs
Introduces temporally grounded EgoSGs to convert long egocentric videos into compact symbolic text for MLLM-based VQA, claiming SOTA results on HD-EPIC without subsampling.
-
Graph it first! Enabling Reasoning on Long-form Egocentric Videos through Scene Graphs
Egocentric Scene Graphs convert long videos into short structured text so MLLMs can answer questions about entire sequences, achieving SOTA on HD-EPIC VQA.
-
Steering Vision-Language Models with Joint Sparse Autoencoders
JSAE jointly factorizes pooled vision and language activations in VLMs into aligned interpretable features, revealing layer-dependent asymmetry in additive steering versus suppression on three models.
-
Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views
DR-MV3D decomposes MV3D-VQA into global map construction, question-conditioned view planning, and egocentric grounding, supervised by global consistency and local trajectory rewards optimized via GRPO.
-
SCOPE: Evolving Symbolic World for Planning in Open-Ended Environments
SCOPE is a self-adaptive symbolic planning framework that refines plans and evolves symbolic world models via simulator feedback and distilled knowledge to improve long-horizon planning in open-ended embodied environments.
-
FlowDPG: Deterministic Policy Gradient on Flow Matching Policies for Real-World Manipulation
FlowDPG distills critic gradients into flow matching velocity fields to enable BPTT-free DDPG-style policy improvement and reports 92% success on a real-world dual-arm AirPods assembly task.
-
Robot Critics that Sweat the Small Stuff
Fine-tuning VLMs with pairwise progress supervision from policy rollouts improves fine-grained failure detection and boosts robot manipulation success by 11% real-world and 5.9% in simulation.
-
BIT-Nav: Brain-Inspired Trajectory Memory for Embodied Navigation
BIT-Nav augments VLMs with a Bi-GRU trajectory embedding projected as one memory token to supply structured motion history at constant token cost.
-
Vesta: A Generalist Embodied Reasoning Model
Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.
-
S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence
S-Agent augments VLMs with spatial tools, scene and agent memory for evidence accumulation on multi-view and video tasks, and produces an 8B model via SFT on its own trajectories that beats same-scale baselines.
-
Guava: An Effective and Universal Harness for Embodied Manipulation
Guava harness enables 4B open-source models to achieve performance comparable to frontier models on embodied manipulation tasks by distilling capabilities from under 2K simulation trajectories using three identified d...
Reference graph
Works this paper leans on
-
[1]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Ahn, M., Brohan, A., Brown, N., Chebotar, Y ., Cortes, O., David, B., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., et al. Do as i can, not as i say: Ground- ing language in robotic affordances. arXiv preprint arXiv:2204.01691,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Flamingo: a Visual Language Model for Few-Shot Learning
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198,
work page internal anchor Pith review arXiv
-
[3]
On the Opportunities and Risks of Foundation Models
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosse- lut, A., Brunskill, E., et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
RT-1: Robotics Transformer for Real-World Control at Scale
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817,
work page internal anchor Pith review arXiv
-
[5]
D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,
work page 1901
-
[6]
All you may need for vqa are image captions.arXiv preprint arXiv:2205.01883,
URL https://arxiv.org/ abs/2205.01883. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021a. Chen, T., Saxena, S., Li, L., Fleet, D. J., and Hinton, G. Pix2seq: A language modeling framework...
-
[7]
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., et al. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794,
work page internal anchor Pith review arXiv
-
[8]
PaLM: Scaling Language Modeling with Pathways
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311,
work page internal anchor Pith review arXiv
-
[9]
International Conference on Machine Learning (ICML) , year=
PaLM-E: An Embodied Multimodal Language Model Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P., Heek, J., Gilmer, J., Steiner, A., Caron, M., Geirhos, R., Alabdulmohsin, I., et al. Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442,
-
[10]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. arXiv preprint arXiv:1810.04805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[12]
Improving alignment of dialogue agents via targeted human judgements
Glaese, A., McAleese, N., Trebacz, M., Aslanides, J., Firoiu, V ., Ewalds, T., Rauh, M., Weidinger, L., Chadwick, M., Thacker, P., et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375,
work page internal anchor Pith review arXiv
-
[13]
Instruction-driven history-aware policies for robotic manipulations
Guhur, P.-L., Chen, S., Garcia, R., Tapaswi, M., Laptev, I., and Schmid, C. Instruction-driven history-aware policies for robotic manipulations. arXiv preprint arXiv:2209.04899,
-
[14]
Language models are general-purpose interfaces.ArXiv, abs/2206.06336, 2022
Hao, Y ., Song, H., Dong, L., Huang, S., Chi, Z., Wang, W., Ma, S., and Wei, F. Language models are general-purpose interfaces. arXiv preprint arXiv:2206.06336,
-
[15]
IEEE International Conference on Robotics and Automation (ICRA) , year =
Huang, C., Mees, O., Zeng, A., and Burgard, W. Vi- sual language maps for robot navigation. arXiv preprint arXiv:2210.05714, 2022a. Huang, W., Abbeel, P., Pathak, D., and Mordatch, I. Lan- guage models as zero-shot planners: Extracting action- able knowledge for embodied agents. arXiv preprint arXiv:2201.07207, 2022b. Huang, W., Xia, F., Xiao, T., Chan, H...
-
[16]
Vima: General robot manipulation with multimodal prompts, 2023
Jiang, Y ., Gupta, A., Zhang, Z., Wang, G., Dou, Y ., Chen, Y ., Fei-Fei, L., Anandkumar, A., Zhu, Y ., and Fan, L. Vima: General robot manipulation with multimodal prompts. arXiv preprint arXiv:2210.03094,
-
[17]
Large Language Models are Zero-Shot Reasoners
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y ., and Iwasawa, Y . Large language models are zero-shot reasoners.arXiv preprint arXiv:2205.11916,
work page internal anchor Pith review arXiv
-
[18]
The Power of Scale for Parameter-Efficient Prompt Tuning
Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691,
work page internal anchor Pith review arXiv
-
[19]
Solving Quantitative Reasoning Problems with Language Models
Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V ., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., et al. Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858,
work page internal anchor Pith review arXiv
-
[20]
VisualBERT: A Simple and Performant Baseline for Vision and Language
Li, L. H., Yatskar, M., Yin, D., Hsieh, C.-J., and Chang, K.-W. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557,
work page internal anchor Pith review arXiv 1908
-
[21]
Trocr: Transformer-based optical character recognition with pre-trained models,
Li, M., Lv, T., Chen, J., Cui, L., Lu, Y ., Florencio, D., Zhang, C., Li, Z., and Wei, F. Trocr: Transformer-based optical character recognition with pre-trained models. arXiv preprint arXiv:2109.10282,
-
[22]
arXiv preprint arXiv:2202.01771 , year=
Li, S., Puig, X., Du, Y ., Wang, C., Akyurek, E., Torralba, A., Andreas, J., and Mordatch, I. Pre-trained language models for interactive decision-making. arXiv preprint arXiv:2202.01771,
-
[23]
Code as Policies: Language Model Programs for Embodied Control
PaLM-E: An Embodied Multimodal Language Model Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., and Zeng, A. Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753,
work page internal anchor Pith review arXiv
-
[24]
Pretrained Transformers as universal computation engines
Lu, K., Grover, A., Abbeel, P., and Mordatch, I. Pretrained transformers as universal computation engines. arXiv preprint arXiv:2103.05247, 1,
-
[25]
Lynch, C. and Sermanet, P. Language conditioned imi- tation learning over unstructured data. arXiv preprint arXiv:2005.07648,
-
[26]
Interactive language: Talking to robots in real time,
Lynch, C., Wahid, A., Tompson, J., Ding, T., Betker, J., Baruch, R., Armstrong, T., and Florence, P. Interactive language: Talking to robots in real time. arXiv preprint arXiv:2210.06407,
-
[27]
International Conference on Machine Learning (ICML) , year =
Nottingham, K., Ammanabrolu, P., Suhr, A., Choi, Y ., Ha- jishirzi, H., Singh, S., and Fox, R. Do embodied agents dream of pixelated sheep?: Embodied decision making using language guided world modelling. arXiv preprint arXiv:2301.12050,
- [28]
-
[29]
Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., Gimenez, M., Sulsky, Y ., Kay, J., Springenberg, J. T., et al. A generalist agent. arXiv preprint arXiv:2205.06175,
work page internal anchor Pith review arXiv
-
[30]
Tokenlearner: What can 8 learned tokens do for images and videos?
Ryoo, M. S., Piergiovanni, A., Arnab, A., Dehghani, M., and Angelova, A. Tokenlearner: What can 8 learned tokens do for images and videos? arXiv preprint arXiv:2106.11297,
-
[31]
Sajjadi, M. S. M., Duckworth, D., Mahendran, A., van Steenkiste, S., Paveti ´c, F., Lu ˇci´c, M., Guibas, L. J., Greff, K., and Kipf, T. Object Scene Representa- tion Transformer. NeurIPS, 2022a. URL https: //osrt-paper.github.io/. Sajjadi, M. S. M., Meyer, H., Pot, E., Bergmann, U., Greff, K., Radwan, N., V ora, S., Lu ˇci´c, M., Duckworth, D., Dosovitsk...
-
[32]
arXiv preprint arXiv:2110.01517 (2021) 3 20 Xiangye Lin, Hongxin Zhang, et al
Sharma, P., Torralba, A., and Andreas, J. Skill induc- tion and planning with latent language. arXiv preprint arXiv:2110.01517,
-
[33]
Shridhar, M., Manuelli, L., and Fox, D. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pp. 894–906. PMLR, 2022a. Shridhar, M., Manuelli, L., and Fox, D. Perceiver-actor: A multi-task transformer for robotic manipulation. arXiv preprint arXiv:2209.05451, 2022b. Silva, A., Moorman, N., Silva, W., Zaidi, Z., Gopal...
-
[34]
ProgPrompt: Generating Situated Robot Task Plans using Large Language Models
Singh, I., Blukis, V ., Mousavian, A., Goyal, A., Xu, D., Tremblay, J., Fox, D., Thomason, J., and Garg, A. Prog- Prompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302 ,
work page internal anchor Pith review arXiv
-
[35]
LaMDA: Language Models for Dialog Applications
Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kul- shreshtha, A., Cheng, H.-T., Jin, A., Bos, T., Baker, L., PaLM-E: An Embodied Multimodal Language Model Du, Y ., et al. Lamda: Language models for dialog appli- cations. arXiv preprint arXiv:2201.08239,
-
[36]
Wang, Z., Cai, S., Liu, A., Ma, X., and Liang, Y . Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560,
work page internal anchor Pith review arXiv
-
[37]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. Chain of thought prompting elic- its reasoning in large language models. arXiv preprint arXiv:2201.11903,
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Robotic skill acquisition via instruction augmentation with vision-language models
Xiao, T., Chan, H., Sermanet, P., Wahid, A., Brohan, A., Hausman, K., Levine, S., and Tompson, J. Robotic skill acquisition via instruction augmentation with vision- language models. arXiv preprint arXiv:2211.11736 ,
-
[39]
Zellers, R., Holtzman, A., Peters, M., Mottaghi, R., Kem- bhavi, A., Farhadi, A., and Choi, Y . Piglet: Language grounding through neuro-symbolic interaction in a 3d world. arXiv preprint arXiv:2106.00188, 2021a. Zellers, R., Lu, X., Hessel, J., Yu, Y ., Park, J. S., Cao, J., Farhadi, A., and Choi, Y . Merlot: Multimodal neural script knowledge models. Ad...
-
[40]
Zhang, Y . and Chai, J. Hierarchical task learning from language instructions with unified transformers and self- monitoring. arXiv preprint arXiv:2106.03427,
-
[41]
1 0.5 Wikipedia text 1 0.5 (robot) Mobile Manipulator, real 6 3.1 (robot) Language Table (Lynch et al., 2022), sim and real 8 4.2 (robot) TAMP, sim 3 1.6 Table 6: Dataset sampling frequency and ratio for the “full mixture” referred to in experiments. Figure 8: Two TAMP environment test examples. Left with 6 objects (training data contains 3-5 objects), ri...
work page 2022
-
[42]
utilizes oracle, one-step affordance functions. B.2. Interactive Language Table We use the Language-Table real-world tabletop setup and simulated environment from Interactive Language (Lynch et al., 2022). Data collection. For each task, given the long horizon instruction, we prompt a labeler to enter a short horizon command every 4 seconds. We pass the s...
work page 2022
-
[43]
0.60 0.67 0.63 PaLM-E-12B from LLM+ViT LLM trained on scratch pretrain frozen Single robot n/a 0.67 0.35 0.46 Single robot 0.90 0.69 0.78 Full mixture 0.95 0.80 0.87 Full mixture 0.92 0.88 0.91 Table 10: Mobile manipulation environment: affordance prediction, showing individual precision and recall scores. E. Image Attribution The image of the New York Kn...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.