Microsoft is proud to sponsor the 38th Conference on Neural Information Processing Systems (NeurIPS 2024), a leading global forum for machine learning and AI.
The event gathers researchers, industry leaders, and practitioners to exchange ideas, address challenges, and advance innovations to shape the future of AI. Lidong Zhou, managing director of Microsoft Research Asia, will be one of this year’s keynote speakers.
More than 100 papers by Microsoft researchers and collaborators have been accepted at NeurIPS 2024, including five oral presentations and 19 spotlight sessions. While these research projects cover a broad range of topics, a shared theme ties them together: advancing the efficiency, scalability, and robustness of machine learning models while addressing real-world challenges like human-centric interaction and cultural considerations.
Visit us at Booth #445
NeurIPS oral presentations

Not All Tokens Are What You Need for Pretraining
Recipient of “Best Paper Runner Up Award”
Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Jian Jiao, Nan Duan, Weizhu Chen
Rho-1 is a new language model that uses selective language modeling. Unlike traditional language models that predict every next token, Rho-1 selectively trains on tokens aligned with the desired distribution. This involves scoring pretraining tokens using a reference model and then training the language model with a focused loss on tokens with higher scores.
Reinforcement Learning Under Latent Dynamics: Toward Statistical and Algorithmic Modularity
Philip Amortila, Dylan J. Foster, Nan Jiang, Akshay Krishnamurthy, Zakaria Mhammedi
This research investigates reinforcement learning under general latent dynamics, demonstrating that traditional function approximation becomes intractable with rich observations unless latent pushforward coverability is present. The authors also developed efficient reductions to adapt latent Markov decision process (MDP) algorithms for complex observations, providing a foundation for a unified statistical and algorithmic theory for reinforcement learning under latent dynamics.
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark
David Romero, Chenyang Lyu, Haryo Akbarianto Wibowo, Teresa Lynn, Injy Hamed, Aditya Nanda Kishore, Aishik Mandal, Pranjal Chitale (opens in new tab), et al.
CVQA is a culturally diverse, multilingual visual question-answering benchmark that involves native speakers and cultural experts in the data collection process. It includes culturally driven images and questions from 30 countries across four continents, covering 31 languages and 13 scripts, and provides a total of 10k questions. While it is a challenging benchmark for current state-of-the-art multimodal large language models (MLLMs), it is also a tool to assess cultural capabilities and biases in these models.
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time
Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, Baining Guo
VASA is a framework for generating lifelike talking faces with visual affective skills (VAS) from a static image and audio clip. The premiere model, VASA-1, synchronizes lip movements with speech while capturing facial nuances and natural head motions, enabled by a holistic facial dynamics and head movement generation model and an expressive face latent space built from video data.
You Only Cache Once: Decoder-Decoder Architectures for Language Models
Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei
You only cache once (YOCO) is a decoder-decoder architecture for LLMs that reduces GPU memory usage by caching key-value pairs only once, while retaining global attention. A self-decoder encodes key-value caches that are reused by a cross-decoder that leverages cross-attention. This enables YOCO to speed up the prefill stage through a computation flow that allows early exit without altering the final output.
NeurIPS spotlight sessions
ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models
Jio Oh, Soyeon Kim, Junseok Seo, Jindong Wang, Ruochen Xu, Xing Xie, Steven Euijong Whang
To thoroughly analyze LLMs, the authors propose ERBench, which automatically converts any relational database into a benchmark based on the entity-relationship model.
A Study of Plasticity Loss in On-Policy Deep Reinforcement Learning
Arthur Juliani, Jordan Ash
The authors to conduct extensive experiments on plasticity loss in on-policy deep reinforcement learning and various mitigation methods.
Advancing Spiking Neural Networks for Sequential Modeling through Central Pattern Generators
Changze Lv, Dongqi Han, Yansen Wang, Xiaoqing Zheng, Xuanjing Huang, Dongsheng Li
CPG-PE is a novel positional encoding (PE) technique for spiking neural networks inspired by central pattern generators in the human brain.
Assouad, Fano, and Le Cam with Interaction: A Unifying Lower Bound Framework and Characterization for Bandit Learnability
Fan Chen, Dylan J. Foster, Yanjun Han, Jian Qian, Alexander Rakhlin, Yunbei Xu
The authors develop a unified framework for lower bound methods in statistical estimation and interactive decision making. They also propose a unified view of these distinct methodologies.
BPQP: A Differentiable Convex Optimization Framework for Efficient End-to-End Learning
Xiao Yang, Xu Yang, Weiqing Liu, Lewen Wang, Jiang Bian
To enhance efficiency, the authors reformulate the backward pass as a simplified and decoupled quadratic programming problem by leveraging the structural properties of the Karush–Kuhn–Tucker (KKT) matrix.
Compositional Generalization Across Distributional Shifts with Sparse Tree Operations
Paul Smolensky, Jianfeng Gao, Roland Fernandez
This work investigates a unified neurosymbolic system where transformations in the network can be interpreted as both symbolic and neural computation simultaneously. It extends a unified neurosymbolic architecture.
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
Edoardo Debenedetti, Javier Rando, Daniel Paleka, Fineas Silaghi, Dragos Albastroiu, Niv Cohen, Yuval Lemberg, Reshmi Ghosh, Ahmed Salem, Rui Wen, Giovanni Cherubin, Santiago Zanella-Béguelin, Robin Schmid, Victor Klemm, Takahiro Miki, Chenhao Li, Stefan Kraft, Mario Fritz, Florian Tramer, Sahar Abdelnabi, Lea Schönherr
This report summarizes insights from a capture-the-flag competition at IEEE SaTML 2024, which highlighted the challenges in defending large language model systems against malicious message attacks.
Diffusion for World Modeling: Visual Details Matter in Atari (opens in new tab)
Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce (opens in new tab), François Fleuret
This work presents DIAMOND (diffusion as a model of environment dreams), an open-source reinforcement learning agent trained in a diffusion world model.
DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents
Peter Alexander Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi, Bodhisattwa Prasad Majumder, Oyvind Tafjord, Peter Clark
DISCOVERYWORLD is an open-source virtual environment for developing and benchmarking an agent’s ability to perform complete scientific discovery cycles, with 120 diverse tasks across diverse topics.
Efficient Adversarial Training in LLMs with Continuous Attacks
Sophie Xhonneux, Alessandro Sordoni, Stephan Günnemann, Gauthier Gidel, Leo Schwinn
This research introduces an efficient approach to adversarial attacks by calculating them in the LLM’s continuous embedding space.
Generalized Linear Bandits with Limited Adaptivity
Ayush Sawarni, Nirjhar Das, Siddharth Barman, Gaurav Sinha
This paper studies the generalized linear contextual bandit problem under limited adaptivity and introduces two algorithms, B-GLinCB and RS-GLinCB, to address two prevalent settings.
Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions
Minghan Li, Heng Li, Zhi-Qi Cheng, Yifei Dong, Yuxuan Zhou, Jun-Yan He, Qi Dai, Teruko Mitamura, Alexander G. Hauptmann
This research introduces Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activities and relaxing key assumptions.
Identifying Equivalent Training Dynamics
(opens in new tab)William T. Redman, Juan M. Bello-Rivas (opens in new tab), M. Fonoberova, Ryan Mohr, I. Kevrekidis, Igor Mezić
Using advances in Koopman operator theory, the authors developed a framework for identifying conjugate and nonconjugate training dynamics.
Implicit Curriculum in Procgen Made Explicit
(opens in new tab)Kaixin Wang (opens in new tab), Xinchao Wang
This work investigates the learning process itself under the multi-level training in Procgen, which exhibits a gradual shift from easy to hard contexts, suggesting an implicit curriculum in multi-level training.
Is Behavior Cloning All You Need? Understanding Horizon in Imitation Learning
Dylan J. Foster, Adam Block, Dipendra Misra
The authors show they can achieve horizon-independent sample complexity in offline imitation learning when the range of the cumulative payoffs and an appropriate notion of supervised learning complexity for the policy class are controlled.
MInference: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu
MInference is sparse calculation method designed to accelerate pre-filling of long-sequence processing, identifying three unique patterns in long-context attention matrices that can be used for efficient sparse computation on GPUs.
The Power of Resets in Online Reinforcement Learning
Zakaria Mhammedi, Dylan J. Foster, Alexander Rakhlin
This study explores the potential of simulators through reinforcement learning with local simulator access, an RL protocol where the agent is allowed to reset to previously observed states and follow their dynamics during training.
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen Wu, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou
This research introduces VideoGUI, a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks through a hierarchical process, allowing for identification of the specific levels at which they may fail.
Voila-A: Aligning Vision-Language Models with User’s Gaze Attention
Kun Yan, Lei Ji, Zeyu Wang, Yuntao Wang, Nan Duan, Shuai Ma
The authors introduce gaze information, feasibly collected by AR or VR devices, and propose a novel approach for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications.
Microsoft at ML4H 2024
Co-located with NeurIPS is the AHLI Machine Learning for Health (ML4H) Symposium, an event that unites machine learning researchers, clinicians, and healthcare data experts to advance AI applications in healthcare. Microsoft’s contribution of four papers to this symposium underscores its commitment to improving medical imaging and clinical workflows through AI, focusing on accuracy, efficiency, and interpretability.