My research interests are Egocentric Vision, Action Recognition, Contextual AI, Hand-object Interaction, Video Understanding, AR/VR, Multi-modal Learning, Visual-language Models and Self-supervised Learning.
Previously, I did my PhD under the supervision of
Prof. Marc Pollefeys at ETH Zurich
and I earned my Master's degree from UCLA. I
received my Bachelor's in
Electrical Engineering from Yonsei University, Seoul, Korea
If you are interested in semester projects (ETHZ), master's theses (ETHZ), 4YP (Oxford), or personal projects related to action recognition,
egocentric vision, video understanding, and hand-object interaction that could lead to publications, feel free to email
me. We can discuss potential exciting projects.
We introduce EgoWorld, a novel two-stage framework that reconstructs egocentric view from rich exocentric observations, including depth maps, 3D hand poses, and textual descriptions.
We introduce JEGAL, a Joint Embedding space for Gestures, Audio and Language. Our semantic gesture representations can be used to perform multiple downstream tasks such as cross-modal retrieval, spotting gestured words, and identifying who is speaking solely using gestures.
We introduce EgoPressure, a novel dataset of touch contact and pressure interaction from an egocentric perspective, complemented with hand pose meshes and fine-grained pressure intensities for each contact.
We propose a novel framework that overcomes these limitations using sequence alignment via implicit clustering. Specifically, our key idea is to perform implicit clip-level clustering while aligning frames in sequences. This coupled with our proposed dual augmentation technique enhances the network's ability to learn generalizable and discriminative representations.
HoloAssist is a large-scale egocentric human interaction dataset,
where two people collaboratively complete physical manipulation tasks. By augmenting the data with action and conversational annotations and observing the rich behaviors of various participants, we present key insights into how human assistants correct mistakes, intervene in the task completion procedure, and ground their instructions to the environment.
Contact-aware Skeletal Action Recognition (CaSAR) uses novel representations
of hand-object interaction that encompass spatial information:
1) contact points where the hand joints meet the objects,
2) distant points where the hand joints are far away
from the object and nearly not involved in the current action.
Our framework is able to learn how the hands touch or stay away from the objects
for each frame of the action sequence, and use this information to predict the action class.
EgoBody is a large-scale dataset of accurate 3D human body shape,
pose and motion of humans interacting in 3D scenes, with multi-modal streams from
third-person and egocentric views, captured by Azure Kinects and a HoloLens2.
Given two interacting subjects, we leverage a lightweight multi-camera rig to
reconstruct their 3D shape and pose over time.
We propose a skeletal self-supervised learning approach that uses alignment as a pretext task.
Our approach to alignment relies on a context-aware attention model that incorporates spatial
and temporal context within and across sequences and a contrastive learning formulation that
relies on 4D skeletal augmentations. Pose data provides a valuable cue for alignment and
downstream tasks, such as phase classification and phase progression, as it is robust
to different camera angles and changes in the background, while being efficient for real-time
processing.
In this paper, we propose a method to collect a dataset of two hands manipulating objects for
first person interaction recognition. We provide a rich set of annotations including action labels,
object classes, 3D left & right hand poses, 6D object poses, camera poses and scene point clouds.
We further propose the first method to jointly recognize the 3D poses of two hands manipulating
objects and a novel topology-aware graph convolutional network for recognizing hand-object interactions.
We propose a sensor-equipped food container, Smart Refrigerator, which recognizes foods and
monitors their status. We demonstrated the performance in detection of food and suggested that
automatic monitoring of food intake can provide intuitive feedback to users.
Mentoring
Alyssa Chan (2024-2025, master thesis/4yp, "Recognising British Sign Language in Video using Deep Learning"), Msc student at Oxford
Junho Park (2024-2025, collaboration project, "EgoWorld: Translating Exocentric View to Egocentric View using Rich Exocentric Observations"), AI researcher at LG; led a submission EgoWorld
Seokjun Kim (2024-2025, collaboration project, "Teleoperating Robots in Virtual Reality"), Msc student at ETH Zurich / Neuromeka -> Now PhD student at Georgia Tech
Minsung Kang (2024-2025, semester project, "LLM-Driven Data Augmentation and Classification for Mistake Detection in Egocentric Videos"), Msc student at ETH Zurich
Tavis Siebert (2024-2025, 3DV project, "Gaze-Guided Scene Graphs for Egocentric Action Prediction"), Msc students at ETH Zurich
Dennis Baumann & Christopher Bennewitz (2024-2025, 3DV project, "Contact-Aware Action Recognition"), Msc students at ETH Zurich
Yiming Zhao (2023-2024, master thesis, "EgoPressure: A Dataset for Hand Pressure and Pose Estimation in Egocentric Vision"), Msc student at ETH Zurich; led CVPR '25 paper EgoPressure (highlight)
Boub Mischa (2023-2024, semester project, "Text-Enhanced Few-Shot Learning for Egocentric Video Action Recognition"), Msc student at ETH Zurich -> Now co-founder at Swiss Engineering Partners AG
Junan Lin & Zhichao Sun & Enjie Cao (2023, 3DV project, "CaSAR: Contact-aware Skeletal Action Recognition"), Msc students at ETH Zurich; led a technical report CaSAR
Aashish Singh (2023, semester project, "Object Pose Estimation and Tracking in Mixed Reality Appplications"), Msc student at ETH Zurich
Yanik Künzi (2023, bacher thesis, "Point Cloud-Based Tutorials for the HoloLens2"), Bsc student at ETH Zurich; led the main software development for the project’s repository