Oral Session 3A

December 22,   10:45 AM to 12:45 PM

Chair: Aditya Nigam

15 MERANet: Facial Micro-Expression Recognition using 3D Residual Attention Network
December 22,   10:45:00 to 11:00:00
Authors: Gajjala Viswanatha Reddy (IIIT Sri City); Sai Prasanna Teja Reddy (The University of Chicago); Snehasis Mukherjee (Shiv Nadar University, Greater Noida, India)*; Shiv Ram Dubey (Indian Institute of Information Technology, Sri City, Chittoor)
Abstract: Micro-expression has emerged as a promising modality in affective computing due to its high objectivity in emotion detection. Despite the higher recognition accuracy provided by the deep learning models, there are still significant scope for improvements in micro-expression recognition techniques. The presence of micro-expressions in small-local regions of the face, as well as the limited size of available databases, continue to limit the accuracy in recognizing micro-expressions. In this work, we propose a facial micro-expression recognition model using 3D residual attention network named MERANet to tackle such challenges. The proposed model takes advantage of spatial-temporal attention and channel attention together, to learn deeper fine-grained subtle features for classification of emotions. Further, the proposed model encompasses both spatial and temporal information simultaneously using the 3D kernels and residual connections. Moreover, the channel features and spatio-temporal features are re-calibrated using the channel and spatio-temporal attentions, respectively in each residual module. Our attention mechanism enables the model to learn to focus on different facial areas of interest. The experiments are conducted on benchmark facial micro-expression datasets. A superior performance is observed as compared to the state-of-the-art for facial micro-expression recognition on benchmark data.
Presenting Author: Snehasis Mukherjee
Lab/Author homepage: https://cse.snu.edu.in/people/faculty/snehasis_mukherjee
Paper: https://doi.org/10.1145/3490035.3490260
Joining link to attend this talk
December 22,   10:45:00 to 11:00:00
22 CT-DANN: Co-teaching Meets DANN For Wild Unsupervised Domain Adaptation
December 22,   11:00:00 to 11:15:00
Authors: Rahul Bansal (Flipkart); Soma Biswas (Indian Institute of Science, Bangalore)*
Abstract: Unsupervised domain adaptation aims at leveraging supervision from an annotated source domain for performing tasks like classification/segmentation on an unsupervised target domain. However, a large enough related dataset with clean annotations may not be always available in real scenarios, since annotations are usually obtained from crowdsourcing, and thus are noisy. Here, we consider a more realistic and challenging setting, wild unsupervised domain adaptation (WUDA), where the source domain annotations can be noisy. Standard domain adaptation approaches which directly use these noisy source labels and the unlabeled targets for the domain adaptation task perform poorly, due to severe negative transfer from the noisy source domain. In this work, we propose a novel end-to-end framework, termed CT-DANN (Co-teaching meets DANN), which seamlessly integrates a state-of-the-art approach for handling noisy labels (Co-teaching) with a standard domain adaptation framework (DANN). CT-DANN effectively utilizes all the source samples after accounting for both their noisy labels as well as transferability with respect to the target domain. Extensive experiments on three benchmark datasets with different types and levels of noise and comparison with state-of-the-art WUDA approach justify the effectiveness of the proposed framework.
Presenting Author: Soma Biswas
Paper: https://doi.org/10.1145/3490035.3490262
Joining link to attend this talk
December 22,   11:00:00 to 11:15:00
49 NTU-X: An Enhanced Large-scale Dataset for Improving Pose-based Recognition of Subtle Human Actions
December 22,   11:15:00 to 11:30:00
Authors: Neel Trivedi (IIIT-Hyderabad)*; Anirudh Thatipelli (International Institute of Information Technology, Hyderabad); Ravi Kiran Sarvadevabhatla (IIIT Hyderabad)
Abstract: The lack of fine-grained joints (facial joints, hand fingers) is a fundamental performance bottleneck for state of the art skeleton action recognition models. Despite this bottleneck, community's efforts seem to be invested only in coming up with novel architectures. To specifically address this bottleneck, we introduce two new pose based human action datasets - NTU60-X and NTU120-X. Our datasets extend the largest existing action recognition dataset, NTU-RGBD. In addition to the 25 body joints for each skeleton as in NTU-RGBD, NTU60-X and NTU120-X dataset includes finger and facial joints, enabling a richer skeleton representation. We appropriately modify the state of the art approaches to enable training using the introduced datasets. Our results demonstrate the effectiveness of these NTU-X datasets in overcoming the aforementioned bottleneck and improve state of the art performance, overall and on previously worst performing action categories.
Presenting Author: Neel Trivedi
Lab/Author homepage: https://skeleton.iiit.ac.in/ntux
Code: https://github.com/skelemoa/ntu-x
Paper: https://doi.org/10.1145/3490035.3490270
Joining link to attend this talk
December 22,   11:15:00 to 11:30:00
51 Towards Interpretable Facial Emotion Recognition
December 22,   11:30:00 to 11:45:00
Authors: Sarthak MR. Malik (Indian institute of Technology Roorkee); Puneet Kumar (Indian Institute of Technology Roorkee)*; Balasubramanian Raman (Indian Institute of Technology Roorkee)
Abstract: In this paper, an interpretable deep-learning-based system has been proposed for facial emotion recognition. A novel approach to interpret the proposed system's results, Divide & Conquer based Shapley additive explanations (DnCShap), has also been developed. The proposed approach computes 'Shapley values' that denote the contribution of each image feature towards a particular prediction. The Divide and Conquer algorithm has been incorporated for computing the Shapley values in linear time instead of the exponential time taken by the existing interpretability approaches. The experiments performed on four facial emotion recognition datasets, i.e., FER-13, FERG, JAFFE, and CK+, resulted in the emotion classification accuracy of 62.62%, 99.68%, 91.97%, and 99.67%, respectively. The results show that DnCShap has consistently interpreted the highly relevant facial features for the emotion classification for various datasets.
Presenting Author: Puneet Kumar
Lab/Author homepage: https://balarsgroup.github.io/
Code: https://github.com/MIntelligence-Group/InterpretableFER
Paper: https://doi.org/10.1145/3490035.3490271
Joining link to attend this talk
December 22,   11:30:00 to 11:45:00
118 Handling Ambiguous Annotations for Facial Expression Recognition in the Wild
December 22,   11:45:00 to 12:00:00
Authors: Darshan Gera (SSSIHL)*; VIKAS G N (Student); Balasubramanian S (SSSIHL)
Abstract: Annotation ambiguity due to subjectivity of annotators, crowd-sourcing, inter-class similarity and poor quality of facial expression images has been a key challenge towards robust Facial Expression Recognition (FER). Recent deep learning (DL) solutions for this problem select clean samples for training by using two or more networks simultaneously. Based on the observation that wrongly annotated samples have inconsistent predictions compared to clean samples when transformed using different augmentations, we propose a simple and effective single network FER framework robust to noisy annotations. Specifically, we qualify an image to be clean (correctly labeled) if the Jenson-Shannon (JS) divergence between its ground truth distribution and the predicted distribution for its weak augmented version is smaller than a threshold. The threshold is dynamically tuned. The qualified clean samples facilitate supervision during training. Further, to learn hard samples (correctly labeled but difficult to classify), we enforce consistency between the predicted distributions of weak and strong augmented versions of every training image through a consistency loss. Comprehensive experiments on FER datasets like RAFDB, FERPlus, curated FEC and AffectNet in the presence of both synthetic and real noisy annotation settings demonstrate the robustness of the proposed method. The source codes are publicly available at https://github.com/1980x/HandlingAmbigiousFERAnnotations.
Presenting Author: Vikas G.N
Code: https://github.com/1980x/HandlingAmbigiousFERAnnotations
Dataset: https://github.com/1980x/CCT/tree/main/FECdataset
Paper: https://doi.org/10.1145/3490035.3490289
Joining link to attend this talk
December 22,   11:45:00 to 12:00:00
169 Selective Mixing and Voting Network for Semi-supervised Domain Generalization
December 22,   12:00:00 to 12:15:00
Authors: Ahmad Arfeen (IISc bangalore); Titir Dutta (Indian Institute of Science, Bangalore)*; Soma Biswas (Indian Institute of Science, Bangalore)
Abstract: Domain generalization (DG) addresses the problem of generalizing classification performance across any unknown domain, by leveraging training samples from multiple source domains. Currently, the training process of the state-of-the-art DG-methods is dependent on a large amount of labeled data. This restricts the application of the models in many real-world scenarios, where collecting such a data-set is an expensive and difficult task. Thus, in this paper, we attempt to explore the problem of Semi-supervised Domain Generalization (SSDG), where the training set contains only a few labeled data, in addition to a large number of unlabeled data from multiple domains. This is relatively unexplored in literature and poses a considerable challenge to the state-of-the-art DG models, since their performance degrades under such condition. To address this scenario, we propose a novel Selective Mixing and Voting Network (SMV-Net), which effectively extracts useful knowledge from the set of unlabeled training data, available to the model. Specifically, we propose a mixing strategy on selected unlabeled samples on which the model is confident about their predicted class labels to achieve a domain-invariant representation of the data, which generalizes effectively across any unseen domain. Secondly, we also propose a voting module, which can further comment on the prediction of the test samples, using references from the few labeled training set, despite of their domain-gap. Finally, we introduce a test time mixing strategy to re-look at the top class-predictions and re-order them if required to further boost the classification performance. Extensive experiments on two popular DG-datasets demonstrate the usefulness of the proposed framework.
Presenting Author: Ahmad Arfeen
Paper: https://doi.org/10.1145/3490035.3490303
Joining link to attend this talk
December 22,   12:00:00 to 12:15:00
185 Realistic Talking Face Animation with Speech-Induced Head Motion
December 22,   12:15:00 to 12:30:00
Authors: Sandika Ms. Biswas (Tata Consultancy Services)*; Sanjana Sinha (TCS); Dipanjan MR Das (TCS); Brojeshwar Bhowmick (Tata Consultancy Services)
Abstract: The recent advancements on talking face generation from speech have mostly focused on lip synchronization, realistic facial movements like eye blinks, eye brow motions but do not generate meaningful head motions according to the speech. This results in a lack of realism, especially in long speech. A very few recent methods try to animate the head motions, but they mostly rely on a short driving head motion video. In general, the prediction of head motion is largely dependent upon the prosodic information of the speech at a current time window. In this paper, we propose a method for generating speech-driven realistic talking face animation which has speech-coherent head motions with accurate lip sync, natural eye-blink, and high fidelity texture. In particular, we propose an attention-based GAN network to identify the highly correlated audio with the speaker's head motion and learn the relationship between the prosodic information of the speech and the corresponding head motions. Experimental results show that our animations are significantly better in terms of output video quality, realism of head movements, lip sync, and eye-blinks when compared to state-of-the-art methods, both qualitatively and quantitatively. Moreover, our user study shows that our speech-coherent head motions make the animation more appealing to the users.
Presenting Author: Sandika Biswas
Paper: https://doi.org/10.1145/3490035.3490305
Joining link to attend this talk
December 22,   12:15:00 to 12:30:00
193 Discriminative Multiscale CNN Network for Smartphone Based Robust Gait Recognition
December 22,   12:30:00 to 12:45:00
Authors: Sonia Das (National Institute of technology Rourkela)*; Sukadev Meher (NIT Rourkela); Upendra Kumar Sahoo (National Institute of Technology, Rourkela)
Abstract: A smartphone-based gait recognition system is very interesting research in surveillance. Its goal is to recognize a target user from their walking pattern using the inertial signal. However, the performance in realistic scenarios is unsatisfactory due to several covariate factors such as carrying conditions, different surface types, wearing different shoes, wearing different clothes, and also unconstrained placing of mobile phone during walking which affects gait sample data captured by sensors. Recently, many traditional single-scale CNN networks are employed for sensor-based gait recognition. However, these have limited capability to classify only normal gait samples without covariate factors. To address these challenges, in this paper, a novel discriminative Multiscale CNN network (DMSCNN) is designed to introduce both local and global feature extraction procedures for improving classification accuracy. At first, the proposed network discovers the coarse-grained features (local feature) using multiscale CNN analysis to handle different covariate-based variation effects and highlights the significance of local features with respect to class-specific samples by incorporating a class-specific weight update network in order to find discriminative local features. Further fused them to get global features for improving the overall recognition rate. The experiments are performed to evaluate the robustness of the proposed model using four benchmark datasets. The result shows that the proposed model achieves higher accuracy in identification as compared to other state-of-art methods.
Presenting Author: Sonia Das
Paper: https://doi.org/10.1145/3490035.3490308
Joining link to attend this talk
December 22,   12:30:00 to 12:45:00

December 20December 21December 22
Session 1A Session 2A Session 3A
Session 1B Session 2B Session 3B
Session P1 Session P2 Vision India
Plenary 1 Plenary 3 Plenary 4
Plenary 2    
List of Accepted Papers
Conference Program