Xiaofeng Gao

Ph.D. Candidate in Statistics at UCLA

Boelter Hall 9401
580 Portola Plaza
University of California, Los Angeles
Los Angeles, CA, 90095
Email: xfgao at ucla dot edu
[Google Scholar]   [GitHub]


I'm an incoming Applied Scientist at Amazon. My research lies in the intersection of Robotics, Computer Vision, Machine Learning and Cognitive Science, with a focus on Human-Machine Interaction and Explainable AI. I received my PhD in Statistics from University of California, Los Angeles under the supervision of Prof. Song-Chun Zhu.

During my PhD, I also worked closely with Prof. Hongjing Lu (UCLA), Prof. Gaurav Sukhatme (USC & Amazon Alexa AI) and Dr. Tianmin Shu (MIT). Before that, I obtained a bachelor degree of Electronic Engineering at Fudan University.


07/2022: Our paper on In-situ bidirectional human-robot value alignment is published on Science Robotics.

07/2022: Our paper on Dialogue-Enabled Agents for Embodied Instruction Following is accepted by RA-L.

06/2022: I defended my PhD dissertation.

04/2022: One paper on the effects of AR-based human-machine interface on drivers' situational awareness has been accepted by IV 2022.

02/2022: Our paper on robot capability calibration is covered by TechXplore.

01/2022: Our paper "Show Me What You Can Do: Capability Calibration on Reachable Workspace for Human-Robot Collaboration" has been accepted by IEEE Robotics and Automation Letters.


    (* indicates equal contribution)
  • Learning Social Affordance Grammar from Videos: Transferring Human Interactions to Human-Robot Interactions

    Tianmin Shu, Xiaofeng Gao, Michael S. Ryoo, Song-Chun Zhu
    IEEE International Conference on Robotics and Automation (ICRA), 2017

    PDF Website
    In this paper, we present a general framework for learning social affordance grammar as a spatiotemporal AND-OR graph (ST-AOG) from RGB-D videos of human interactions, and transfer the grammar to humanoids to enable a real-time motion inference for human-robot interaction (HRI). Based on Gibbs sampling, our weakly supervised grammar learning can automatically construct a hierarchical representation of an interaction with long-term joint sub-tasks of both agents and short term atomic actions of individual agents. Based on a new RGB-D video dataset with rich instances of human interactions, our experiments of Baxter simulation, human evaluation, and real Baxter test demonstrate that the model learned from limited training data successfully generates human-like behaviors in unseen scenarios and outperforms both baselines.
      title={Learning social affordance grammar from videos: Transferring human interactions to human-robot interactions},
      author={Shu, Tianmin and Gao, Xiaofeng and Ryoo, Michael S and Zhu, Song-Chun},
      booktitle={2017 IEEE international conference on robotics and automation (ICRA)},
  • VRKitchen: an Interactive 3D Environment for Learning Real Life Cooking Tasks

    Xiaofeng Gao, Ran Gong, Tianmin Shu, Xu Xie, Shu Wang, Song-Chun Zhu
    ICML workshop on Reinforcement Learning for Real Life, 2019

    PDF Website
    One of the main challenges of applying reinforcement learning to real world applications is the lack of realistic and standardized environments for training and testing AI agents. In this work, we design and implement a virtual reality (VR) system, VRKitchen, with integrated functions which i) enable embodied agents to perform real life cooking tasks involving a wide range of object manipulations and state changes, and ii) allow human teachers to provide demonstrations for training agents. We also provide standardized evaluation benchmarks and data collection tools to facilitate a broad use in research on learning real life tasks. Video demos, code, and data will be available on the project website: sites.google.com/view/vr-kitchen.
      title={Vrkitchen: an interactive 3d virtual environment for task-oriented learning},
      author={Gao, Xiaofeng and Gong, Ran and Shu, Tianmin and Xie, Xu and Wang, Shu and Zhu, Song-Chun},
      journal={arXiv preprint arXiv:1903.05757},
  • Joint Mind Modeling for Explanation Generation in Complex Human-Robot Collaborative Tasks

    Xiaofeng Gao*, Ran Gong*, Yizhou Zhao, Shu Wang, Tianmin Shu, Song-Chun Zhu
    IEEE International Conference on Robot & Human Interactive Communication (RO-MAN), 2020

    PDF Website Talk Slides
    Human collaborators can effectively communicate with their partners to finish a common task by inferring each other's mental states (e.g., goals, beliefs, and desires). Such mind-aware communication minimizes the discrepancy among collaborators' mental states, and is crucial to the success in human ad-hoc teaming. We believe that robots collaborating with human users should demonstrate similar pedagogic behavior. Thus, in this paper, we propose a novel explainable AI (XAI) framework for achieving human-like communication in human-robot collaborations, where the robot builds a hierarchical mind model of the human user and generates explanations of its own mind as a form of communications based on its online Bayesian inference of the user's mental state. To evaluate our framework, we conduct a user study on a real-time human-robot cooking task. Experimental results show that the generated explanations of our approach significantly improves the collaboration performance and user perception of the robot.
      title={Joint Mind Modeling for Explanation Generation in Complex Human-Robot Collaborative Tasks},
      author={Gao, Xiaofeng and Gong, Ran and Zhao, Yizhou and Wang, Shu and Shu, Tianmin and Zhu, Song-Chun},
      booktitle={2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)},
  • Predicting Task-Driven Attention via Integrating Bottom-Up Stimulus and Top-Down Guidance

    Zhixiong Nan, Jingjing Jiang, Xiaofeng Gao, Sanping Zhou, Weiliang Zuo, Ping Wei, Nanning Zheng
    IEEE Transactions on Image Processing (TIP), 2021

    Task-free attention has gained intensive interest in the computer vision community while relatively few works focus on task-driven attention (TDAttention). Thus this paper handles the problem of TDAttention prediction in daily scenarios where a human is doing a task. Motivated by the cognition mechanism that human attention allocation is jointly controlled by the top-down guidance and bottom-up stimulus, this paper proposes a cognitively-explanatory deep neural network model to predict TDAttention. Given an image sequence, bottom-up features, such as human pose and motion, are firstly extracted. At the same time, the coarse-grained task information and fine-grained task information are embedded as a top-down feature. The bottom-up features are then fused with the top-down feature to guide the model to predict TDAttention. Two public datasets are re-annotated to make them qualified for TDAttention prediction, and our model is widely compared with other models on the two datasets. In addition, some ablation studies are conducted to evaluate the individual modules in our model. Experiment results demonstrate the effectiveness of our model.
      title={Predicting Task-Driven Attention via Integrating Bottom-Up Stimulus and Top-Down Guidance},
      author={Nan, Zhixiong and Jiang, Jingjing and Gao, Xiaofeng and Zhou, Sanping and Zuo, Weiliang and Wei, Ping and Zheng, Nanning},
      journal={IEEE Transactions on Image Processing},
  • Show Me What You Can Do: Capability Calibration on Reachable Workspace for Human-Robot Collaboration

    Xiaofeng Gao, Luyao Yuan, Tianmin Shu, Hongjing Lu, Song-Chun Zhu
    IEEE Robotics and Automation Letters (RA-L), 2022

    PDF Website Talk
    Aligning humans' assessment of what a robot can do with its true capability is crucial for establishing a common ground between human and robot partners when they collaborate on a joint task. In this work, we propose an approach to calibrate humans' estimate of a robot's reachable workspace through a small number of demonstrations before collaboration. We develop a novel motion planning method, REMP (Reachability-Expressive Motion Planning), which jointly optimizes the physical cost and the expressiveness of robot motion to reveal the robot's motion capability to a human observer. Our experiments with human participants demonstrate that a short calibration using REMP can effectively bridge the gap between what a non-expert user thinks a robot can reach and the ground-truth. We show that this calibration procedure not only results in better user perception, but also promotes more efficient human-robot collaborations in a subsequent joint task.
      title={Show Me What You Can Do: Capability Calibration on Reachable Workspace for Human-Robot Collaboration},
      author={Gao, Xiaofeng and Yuan, Luyao and Shu, Tianmin and Lu, Hongjing and Zhu, Song-Chun},
      journal={IEEE Robotics and Automation Letters},
  • Effects of Augmented-Reality-Based Assisting Interfaces on Drivers' Object-Wise Situational Awareness in Highly Autonomous Vehicles

    Xiaofeng Gao, Xingwei Wu, Samson Ho, Teruhisa Misu, Kumar Akash
    IEEE Intelligent Vehicles Symposium (IV), 2022

    PDF Talk Slides
    Although partially autonomous driving (AD) systems are already available in production vehicles, drivers are still required to maintain a sufficient level of situational awareness (SA) during driving. Previous studies have shown that providing information about the AD's capability using user interfaces can improve the driver's SA. However, displaying too much information increases the driver's workload and can distract or overwhelm the driver. Therefore, to design an efficient user interface (UI), it is necessary to understand its effect under different circumstances. In this paper, we focus on a UI based on augmented reality (AR), which can highlight potential hazards on the road. To understand the effect of highlighting on drivers' SA for objects with different types and locations under various traffic densities, we conducted an in-person experiment with 20 participants on a driving simulator. Our study results show that the effects of highlighting on drivers' SA varied by traffic densities, object locations and object types. We believe our study can provide guidance in selecting which object to highlight for the AR-based driver-assistance interface to optimize SA for drivers driving and monitoring partially autonomous vehicles.
      title={Effects of Augmented-Reality-Based Assisting Interfaces on Drivers' Object-wise Situational Awareness in Highly Autonomous Vehicles}, 
      author={Gao, Xiaofeng and Wu, Xingwei and Ho, Samson and Misu, Teruhisa and Akash, Kumar},
      booktitle={2022 IEEE Intelligent Vehicles Symposium (IV)}, 
  • DialFRED: Dialogue-Enabled Agents for Embodied Instruction Following

    Xiaofeng Gao, Qiaozi Gao, Ran Gong, Kaixiang Lin, Govind Thattai, Gaurav S. Sukhatme
    IEEE Robotics and Automation Letters (RA-L), 2022

    PDF Code&Data
    Language-guided Embodied AI benchmarks requiring an agent to navigate an environment and manipulate objects typically allow one-way communication: the human user gives a natural language command to the agent, and the agent can only follow the command passively. We present DialFRED, a dialogue-enabled embodied instruction following benchmark based on the ALFRED benchmark. DialFRED allows an agent to actively ask questions to the human user; the additional information in the user's response is used by the agent to better complete its task. We release a human-annotated dataset with 53K task-relevant questions and answers and an oracle to answer questions. To solve DialFRED, we propose a questioner-performer framework wherein the questioner is pre-trained with the human-annotated data and fine-tuned with reinforcement learning. We make DialFRED publicly available and encourage researchers to propose and evaluate their solutions to building dialog-enabled embodied agents.
      title={DialFRED: Dialogue-Enabled Agents for Embodied Instruction Following}, 
      author={Gao, Xiaofeng and Gao, Qiaozi and Gong, Ran and Lin, Kaixiang and Thattai, Govind and Sukhatme, Gaurav S.},
      journal={IEEE Robotics and Automation Letters}, 
  • In-situ bidirectional human-robot value alignment

    Luyao Yuan*, Xiaofeng Gao*, Zilong Zheng*, Mark Edmonds, Ying Nian Wu, Federico Rossano, Hongjing Lu, Yixin Zhu, Song-Chun Zhu
    Science Robotics, 2022

    Paper Code&Data
    A prerequisite for social coordination is bidirectional communication between teammates, each playing two roles simultaneously: as receptive listeners and expressive speakers. For robots working with humans in complex situations with multiple goals that differ in importance, failure to fulfill the expectation of either role could undermine group performance due to misalignment of values between humans and robots. Specifically, a robot needs to serve as an effective listener to infer human users’ intents from instructions and feedback and as an expressive speaker to explain its decision processes to users. Here, we investigate how to foster effective bidirectional human-robot communications in the context of value alignment—collaborative robots and users form an aligned understanding of the importance of possible task goals. We propose an explainable artificial intelligence (XAI) system in which a group of robots predicts users’ values by taking in situ feedback into consideration while communicating their decision processes to users through explanations. To learn from human feedback, our XAI system integrates a cooperative communication model for inferring human values associated with multiple desirable goals. To be interpretable to humans, the system simulates human mental dynamics and predicts optimal explanations using graphical models. We conducted psychological experiments to examine the core components of the proposed computational framework. Our results show that real-time human-robot mutual understanding in complex cooperative tasks is achievable with a learning model based on bidirectional communication. We believe that this interaction framework can shed light on bidirectional value alignment in communicative XAI systems and, more broadly, in future human-machine teaming systems.
      title={In situ bidirectional human-robot value alignment},
      author={Yuan, Luyao and Gao, Xiaofeng and Zheng, Zilong and Edmonds, Mark and Wu, Ying Nian and Rossano, Federico and Lu, Hongjing and Zhu, Yixin and Zhu, Song-Chun},
      journal={Science Robotics},
      publisher={Science Robotics}