My research interests are Egocentric Vision, Action Recognition, Contextual AI, Hand-object Interaction, Video Understanding, AR/VR, Multi-modal Learning, Visual-language Models and Self-supervised Learning.
Previously, I did my PhD under the supervision of
Prof. Marc Pollefeys at ETH Zurich
and I earned my Master's degree from UCLA. I
received my Bachelor's in
Electrical Engineering from Yonsei University, Seoul, Korea
If you are interested in semester projects (ETHZ), master's theses (ETHZ, Oxford), or personal projects related to action recognition,
egocentric vision, video understanding, and hand-object interaction that could lead to publications, feel free to email
me. I have some projects listed here but we can also discuss more other potential exciting projects.
We introduce EgoPressure, a novel dataset of touch contact and pressure interaction from an egocentric perspective, complemented with hand pose meshes and fine-grained pressure intensities for each contact.
We introduce JEGAL, a Joint Embedding space for Gestures, Audio and Language. Our semantic gesture representations can be used to perform multiple downstream tasks such as cross-modal retrieval, spotting gestured words, and identifying who is speaking solely using gestures.
We propose a novel framework that overcomes these limitations using sequence alignment via implicit clustering. Specifically, our key idea is to perform implicit clip-level clustering while aligning frames in sequences. This coupled with our proposed dual augmentation technique enhances the network's ability to learn generalizable and discriminative representations.
HoloAssist is a large-scale egocentric human interaction dataset,
where two people collaboratively complete physical manipulation tasks. By augmenting the data with action and conversational annotations and observing the rich behaviors of various participants, we present key insights into how human assistants correct mistakes, intervene in the task completion procedure, and ground their instructions to the environment.
Contact-aware Skeletal Action Recognition (CaSAR) uses novel representations
of hand-object interaction that encompass spatial information:
1) contact points where the hand joints meet the objects,
2) distant points where the hand joints are far away
from the object and nearly not involved in the current action.
Our framework is able to learn how the hands touch or stay away from the objects
for each frame of the action sequence, and use this information to predict the action class.
EgoBody is a large-scale dataset of accurate 3D human body shape,
pose and motion of humans interacting in 3D scenes, with multi-modal streams from
third-person and egocentric views, captured by Azure Kinects and a HoloLens2.
Given two interacting subjects, we leverage a lightweight multi-camera rig to
reconstruct their 3D shape and pose over time.
We propose a skeletal self-supervised learning approach that uses alignment as a pretext task.
Our approach to alignment relies on a context-aware attention model that incorporates spatial
and temporal context within and across sequences and a contrastive learning formulation that
relies on 4D skeletal augmentations. Pose data provides a valuable cue for alignment and
downstream tasks, such as phase classification and phase progression, as it is robust
to different camera angles and changes in the background, while being efficient for real-time
processing.
In this paper, we propose a method to collect a dataset of two hands manipulating objects for
first person interaction recognition. We provide a rich set of annotations including action labels,
object classes, 3D left & right hand poses, 6D object poses, camera poses and scene point clouds.
We further propose the first method to jointly recognize the 3D poses of two hands manipulating
objects and a novel topology-aware graph convolutional network for recognizing hand-object interactions.
We propose a sensor-equipped food container, Smart Refrigerator, which recognizes foods and
monitors their status. We demonstrated the performance in detection of food and suggested that
automatic monitoring of food intake can provide intuitive feedback to users.
Postdoc Mobility Fellowship, Swiss National Science Foundation (155K USD) 2024
Distinguished Paper Award (HoloAssist), Egovis @ CVPR'24 2024
Grant, Swiss National Science Foundation, “Beyond Frozen Worlds: Capturing functional 3D Digital
Twins from the Real World” Role: Project Conceptualization PI. Prof. Marc Pollefeys (2M USD) 2023
Scholarship, Recipient of Korean Government Scholarship from NIIED (150K USD) 2018
Scholarship, Yonsei International Foundation 2016
IBM Innovation Prize, Startup Weekend, Technology Competition 2015
Best Technology Prize, Internet of Things (IoT) Hackathon by the government of Korea 2014
Best Laboratory Intern, Yonsei Institute of Information and Communication Technology 2014
Scholarship, Yonsei University Foundation,Korean Telecom Group Foundation 2014, 2011, 2010
Creative Prize, Startup Competition, Yonsei University 2014
Talks
2025/01: Video Understand Team @ Naver
2024/12: GSDS @ Seoul Nation University
2024/07: Y.Sato Lab @ University of Tokyo
2024/05: Visual Intelligence Lab @ Yonsei University
2024/01: CVLAB @ Korea University
2023/11: Intelligent Robotics Laboratory @ University of Birmingham
2023/08: NVIDIA Research, Taiwan
2023/06: Microsoft Mixed Reality & AI Lab, Zurich
2023/06: AIoT Lab, Seoul National University
2023/05: KASE Open Seminar, ETH Zurich
2022/03: Context-Aware Sequence Alignment using 4D Skeletal Augmentation. Applied Machine
Learning Days (AMLD) @EPFL
& Swiss JRC [Link]
2021/10: H2O: Two Hands Manipulating Objects for First Person Interaction Recognition.
ICCV 2021 Workshop on
Egocentric Perception, Interaction and Computing (EPIC) [Link|Video]
2021/04: H2O: Two Hands Manipulating Objects for First Person Interaction Recognition.
Swiss Joint Research Center
(JRC) Workshop 2021 [Link|Video]