Data Entry: Please note that the research database will be replaced by UNIverse by the end of October 2023. Please enter your data into the system https://universe-intern.unibas.ch. Thanks
In real-world datasets, specifically in TV recordings, videos are often multi-person and multi-angle, which poses significant challenges for gesture recognition and retrieval. In addition to being of interest to linguists, gesture retrieval is a novel and challenging application for multimedia retrieval. In this paper, we propose a novel method for spatio-temporal gesture retrieval based on visual and pose information which can retrieve similar gestures in multi-person scenes through continuous shots. The attention-aware features, extracted from human pose keypoints, together with a sophisticated pre-processing module, alleviate the susceptibility of gesture retrieval to background noise and occlusion. We have evaluated our method on a subset of the NewsScape Dataset. Our experimental results demonstrate the effectiveness of the proposed method in retrieving similar results in occluded scenes as measured by the quality of the top 5 results.