Researchers at Google’s Robotics team have released an open-source tool called CaP which generates robotic control codes using a large natural-speech corpus.
CAP (CaP) employs a hierarchical prompting technique for generating natural language descriptions of programs that outperform previous methods on the Human Evaluation Benchmark.
Technique and Experiments
Instead of generating a sequence of instructions or policies to be executed by the robotic arm, CaP directly produces Python source files for those instructions.
The Google engineers developed a set of prompts that improved code generation, including a new hierarchical prompt method.
This method achieved a new state-of-the-art score of 39.8 percent on the HumanEval test.
Remarks of Google Team
According to the Google team:
Code as Policies is a step towards robots that can change their behavior and extend their capabilities according to the situation. This can be enabled but the possibility of unexpected results from synthetic code (unless manually checked at runtime) could lead to unforeseen consequences.
We can mitigate these security issues by using controls that limit the actions that the system can perform. However, we need to be careful when adding new functionality because we don’t know if it’s safe yet.
We encourage broad discussions on how to minimize these dangers while maximizing the benefits for more general-purpose robotics.
LLMs
LLMs have been found to possess general knowledge about many topics and can solve a wide variety of NLP problems. However, they can also produce answers that, though logical, may not be useful for controlling robots.
For example, if someone asks for help cleaning up after they’ve spilled their drink, an LLM might reply “You could try vacuuming”.
A few months ago, we wrote an article about Google’s SayCan system that plans robot actions using a large language modeling algorithm. To improve the quality of the results produced by the algorithm, they introduced a new value function that indicates how probable the plan is to be successful given the current situation.
Generation of Language Model Programs
The key component of CAP is the generation of language models that map from natural language instruction from a human to computer code that executes on a robotic arm and takes perceptual inputs from its sensors and invokes control APIs.
These are automatically created by an LLM using a few-shots approach. They may include high-level control structures (such as loops and conditional statements) and hierarchically-created functions.
A low-level intermediate language (LLI) is used to generate an abstract syntax tree (AST). Then, a high-level intermediate language (HLI) is created from the AST by using a compiler. Finally, the HLI is converted into code for the target architecture.
Evaluation
Google evaluated CAP on multiple benchmarks and tests. Besides human evaluation, the researchers developed a new generation of benchmarks, RoboCoder Gen, specifically for robotic problems
They also used CaP to program physical robots to perform various real-world tasks including mobile robotics navigation and manipulations in a cooking environment and programming them to draw shapes, pick-and-place, and manipulate objects using a robotic arm.
CaPs Issues with Building Complex Structures
Google researcher Jacky Liang has responded to questions about the issue of building complex structures using blocks by saying that there are no known issues with doing so.
CaP works best when the new commands and the prompt are at similar abstraction levels. Building complex sentences is akin to moving “a few levels up” the language hierarchy, which greedy LLM decoders struggle with. It may be possible, but we’d need better ways to prompt.
Where is Code Available?
Code for reproducing the paper’s experiments is available on GitHub. An interactive demo of the code-generation technique is available on HuggingFace.
Do you agree? Let us know if you agree.