Audio self-supervised learning: A surveySimilar to humans' cognitive ability to generalize knowledge and skills, self-supervised learning (SSL) targets discovering general representations from large-scale data. This, through the use of pre-trained SSL models for downstream tasks, alleviates the need for human annotation, which is an expensive and time-consuming task. Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing. Comprehensive reviews summarizing the knowledge in audio SSL are currently missing. To fill this gap, we provide an overview of the SSL methods used for audio and speech processing applications. Herein, we also summarize the empirical works that exploit audio modality in multi-modal SSL frameworks and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain. Finally, we discuss some open problems and point out the future directions in the development of audio SSL.
A Hierarchical Attention Network-Based Approach for Depression Detection from Transcribed Clinical InterviewsThe high prevalence of depression in society has given rise to a need for new digital tools that can aid its early detection. Among other effects, depression impacts the use of language. Seeking to exploit this, this work focuses on the detection of depressed and non-depressed individuals through the analysis of linguistic information extracted from transcripts of clinical interviews with a virtual agent. Specifically, we investigated the advantages of employing hierarchical attention-based networks for this task. Using Global Vectors (GloVe) pretrained word embedding models to extract low-level representations of the words, we compared hierarchical local-global attention networks and hierarchical contextual attention networks. We performed our experiments on the Distress Analysis Interview Corpus - Wizard of Oz (DAIC-WoZ) dataset, which contains audio, visual, and linguistic information acquired from participants during a clinical session. Our results using the DAIC-WoZ test set indicate that hierarchical contextual attention networks are the most suitable configuration to detect depression from transcripts. The configuration achieves an Unweighted Average Recall (UAR) of .66 using the test set, surpassing our baseline, a Recurrent Neural Network that does not use attention.
A Novel Attention-Based Gated Recurrent Unit and its Efficacy in Speech Emotion RecognitionNotwithstanding the significant advancements in the field of deep learning, the basic long short-term memory (LSTM) or Gated Recurrent Unit (GRU) units have largely remained unchanged and unexplored. There are several possibilities in advancing the state-of-art by rightly adapting and enhancing the various elements of these units. Activation functions are one such key element. In this work, we explore using diverse activation functions within GRU and bi-directional GRU (BiGRU) cells in the context of speech emotion recognition (SER). We also propose a novel Attention ReLU GRU (AR-GRU) that employs attention-based Rectified Linear Unit (AReLU) activation within GRU and BiGRU cells. We demonstrate the effectiveness of AR-GRU on one exemplary application using the recently proposed network for SER namely Interaction-Aware Attention Network (IAAN). Our proposed method utilising AR-GRU within this network yields significant performance gain and achieves an unweighted accuracy of 68.3% (2% over the baseline) and weighted accuracy of 66.9 % (2.2 % absolute over the baseline) in four class emotion recognition on the IEMOCAP database.
MuSe 2020 Challenge and WorkshopMultimodal Sentiment Analysis in Real-life Media (MuSe) 2020 is a Challenge-based Workshop focusing on the tasks of sentiment recognition, as well as emotion-target engagement and trustworthiness detection by means of more comprehensively integrating the audio-visual and language modalities. The purpose of MuSe 2020 is to bring together communities from different disciplines; mainly, the audio-visual emotion recognition community (signal-based), and the sentiment analysis community (symbol-based). We present three distinct sub-challenges: MuSe-Wild, which focuses on continuous emotion (arousal and valence) prediction; MuSe-Topic, in which participants recognise 10 domain-specific topics as the target of 3-class (low, medium, high) emotions; and MuSe-Trust, in which the novel aspect of trustworthiness is to be predicted. In this paper, we provide detailed information on MuSe-CAR, the first of its kind in-the-wild database, which is utilised for the challenge, as well as the state-of-the-art features and modelling approaches applied. For each sub-challenge, a competitive baseline for participants is set; namely, on test we report for MuSe-Wild a combined (valence and arousal) CCC of .2568, for MuSe-Topic a score (computed as 0.34 * UAR + 0.66 * F1) of 76.78 % on the 10-class topic and 40.64 % on the 3-class emotion prediction, and for MuSe-Trust a CCC of .4359.
Face mask recognition from audio: The MASC database and an overview on the mask challenge