With the rapid advancements in AI and robotics, the development of highly intelligent robots capable of seamlessly interacting with the physical environment is becoming increasingly achievable. As the next AI wave, embodied AI innovations promise to revolutionize various industries and significantly impact human life.
Although promising progress has been made, generalist robots and embodied AI are still in their infancy. The team is thrilled to envision this future and are committed to developing cutting-edge foundational robotics models to accelerate its arrival.
Microsoft Research Asia would like to invest in a collaborative effort to explore Embodied AI and Large Action Models. We believe that the research on embodied AI will build a solid foundation and promising development prospect for robot intelligence, benefit human society. If you are an aspiring researcher with a zeal for exploring embodied AI and large action models, we invite you to apply to the Microsoft Research Asia StarTrack Scholars Program. Applications are now open for the 2025 program. For more details and to submit your registration, visit our official website: Microsoft Research Asia StarTrack Scholars Program – Microsoft Research
Build foundation action model for general robots
Embodied AI is more than a simple fusion of robots with LLMs or VLMs. Beyond language intelligence and cognitive abilities, action intelligence is essential for executing plans and engaging with the physical world. This form of intelligence diverges significantly from language intelligence. For instance, it necessitates dense, dexterous actions and demands high levels of spatial and physical awareness. Our research is focused on creating a new generation of foundational action models that enhance spatial and physical proficiencies in perception, reasoning, and action.

Spatial intelligence as the key to action capabilities
The fact that robots operate in a 3D physical world poses unique requirements and challenges. The team believes that spatial intelligence is crucial for developing robust action capabilities. By leveraging advanced 3D computer vision techniques, we aim to provide robots with a deep understanding of their environments. Our work includes pioneering methods for 3D human-object interaction reconstruction, enabling robots to learn from humans and navigating and manipulating the objects with unprecedented generalization and precision.
Crafting optimal model architectures
Model architecture design is crucial for advancing the capabilities of embodied AI. The modality of action differs fundamentally from language and vision. Actions are continuous, dense, and require precision in both time and space. Moreover, robotic actions frequently require timely execution to interact effectively with dynamic environments. Effective model architectures must accommodate these demands by providing robust frameworks that integrate real-time sensory feedback with decision-making processes. This integration ensures that robots can adapt their actions swiftly to changes in their surroundings, maintaining elevated levels of performance in complex and unpredictable settings.

From spatial reasoning to physical mastery
As robots transition from static observers to dynamic participants in the physical world, achieving physical mastery becomes a pivotal goal. This transformation requires moving beyond mere spatial reasoning to encompass the nuanced skills necessary for interacting with complex environments. Our journey from spatial reasoning to physical mastery involves integrating human-like multimodal sensor fusion into our models. For example, tactile and force sensors are crucial for tasks requiring delicate manipulation and feedback. By embedding these capabilities, we aim to enable robots to perform complex tasks with human-like dexterity and adaptability.
Potential research topics for StarTrack Scholars Program
The team invites scholars to explore a range of exciting research topics within the 2025 StarTrack Scholars program, including but not limited to:
- Vision-Language-Action model architecture design
- 3D human-object-environment reconstruction and understanding
- Multimodal-sensory intelligence
- World models and neural robot simulators
- Dexterous hand manipulation and reinforcement learning
- Object manipulation benchmarks
Microsoft Research Asia StarTrack Scholars advocates an open attitude, encouraging dialogue and joint experimentation with researchers from various disciplines to discover viable solutions. Now visit our official website to know more: Microsoft Research Asia StarTrack Scholars Program – Microsoft Research
Theme Team:
Baining Guo, Distinguished Scientist with Microsoft Research
Jiaolong Yang (opens in new tab), Principal Research Manager, Microsoft Research Asia
Lily Sun, Director, Accelerator Microsoft Research Asia
Yaobo Liang, Senior Researcher, Microsoft Research Asia
Bei Liu, Senior Researcher, Microsoft Research Asia
Jianlong Fu, Senior Research Manager, Microsoft Research Asia
Yu Deng (opens in new tab), Senior Researcher, Microsoft Research Asia
Fangyun Wei, Senior Research SDE, Microsoft Research Asia
Lin Luo, Research SDE 2, Microsoft Research Asia
Xi Chen, Research SDE 2, Microsoft Research Asia
References:
1. Li, Q., L, Y., et al. (2024). “CogAct-VLA: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation.” arXiv. View project webpage (opens in new tab)
2. W, W., W, F., et al. (2024). “UniGraspTransformer: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping.” arXiv. View project webpage (opens in new tab)
3. W. R., X. S., et al. (2024). “MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision.” arXiv. View project webpage (opens in new tab)
If you have any questions, please email Ms. Yanxuan Wu, program manager of the Microsoft Research Asia StarTrack Scholars Program, at [email protected]