HoloAssist is a large-scale egocentric human interaction dataset,
where two people collaboratively complete physical manipulation tasks. By augmenting the data with action and conversational annotations and observing the rich behaviors of various participants, we present key insights into how human assistants correct mistakes, intervene in the task completion procedure, and ground their instructions to the environment.
Contact-aware Skeletal Action Recognition (CaSAR) uses novel representations
of hand-object interaction that encompass spatial information:
1) contact points where the hand joints meet the objects,
2) distant points where the hand joints are far away
from the object and nearly not involved in the current action.
Our framework is able to learn how the hands touch or stay away from the objects
for each frame of the action sequence, and use this information to predict the action class.
EgoBody is a large-scale dataset of accurate 3D human body shape,
pose and motion of humans interacting in 3D scenes, with multi-modal streams from
third-person and egocentric views, captured by Azure Kinects and a HoloLens2.
Given two interacting subjects, we leverage a lightweight multi-camera rig to
reconstruct their 3D shape and pose over time.
We propose a skeletal self-supervised learning approach that uses alignment as a pretext task.
Our approach to alignment relies on a context-aware attention model that incorporates spatial
and temporal context within and across sequences and a contrastive learning formulation that
relies on 4D skeletal augmentations. Pose data provides a valuable cue for alignment and
downstream tasks, such as phase classification and phase progression, as it is robust
to different camera angles and changes in the background, while being efficient for real-time
In this paper, we propose a method to collect a dataset of two hands manipulating objects for
first person interaction recognition. We provide a rich set of annotations including action labels,
object classes, 3D left & right hand poses, 6D object poses, camera poses and scene point clouds.
We further propose the first method to jointly recognize the 3D poses of two hands manipulating
objects and a novel topology-aware graph convolutional network for recognizing hand-object interactions.
We propose a sensor-equipped food container, Smart Refrigerator, which recognizes foods and
monitors their status. We demonstrated the performance in detection of food and suggested that
automatic monitoring of food intake can provide intuitive feedback to users.