The ICSI Meeting CorpusAdam Janin, Don Baron, Jane A. Edwards et al.|2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03).|2003 We have collected a corpus of data from natural meetings that occurred at the International Computer Science Institute (ICSI) in Berkeley, California over the last three years. The corpus contains audio recorded simultaneously from head-worn and table-top microphones, word-level transcripts of meetings, and various metadata on participants, meetings, and hardware. Such a corpus supports work in automatic speech recognition, noise robustness, dialog modeling, prosody, rich transcription, information retrieval, and more. We present details on the contents of the corpus, as well as rationales for the decisions that led to its configuration. The corpus were delivered to the Linguistic Data Consortium (LDC).
The SuperSID project: exploiting high-level information for high-accuracy speaker recognitionD.A. Reynolds, W.D. Andrews, Jessica K. Campbell et al.|2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03).|2004 The area of automatic speaker recognition has been dominated by systems using only short-term, low-level acoustic information, such as cepstral features. While these systems have indeed produced very low error rates, they ignore other levels of information beyond low-level acoustics that convey speaker information. Recently published work has shown examples that such high-level information can be used successfully in automatic speaker recognition systems and has the potential to improve accuracy and add robustness. For the 2002 JHU CLSP summer workshop, the SuperSID project (http://www.clsp.jhu.edu/ws2002/groups/supersid/) was undertaken to exploit these high-level information sources and dramatically increase speaker recognition accuracy on a defined NIST evaluation corpus and task. The paper provides an overview of the structure, data, task, tools, and accomplishments of this project. Wide ranging approaches using pronunciation models, prosodic dynamics, pitch and duration features, phone streams, and conversational interactions were explored and developed. We show how these novel features and classifiers indeed provide complementary information and can be fused together to drive down the equal error rate on the 2001 NIST extended data task to 0.2% - a 71% relative reduction in error over the previous state of the art.
Speaker normalization on conversational telephone speechThis paper reports on a simplified system for determining vocal tract normalization. Such normalization has led to significant gains in recognition accuracy by reducing variability among speakers and allowing the pooling of training data and the construction of sharper models. But standard methods for determining the warp scale have been extremely cumbersome, generally requiring multiple recognition passes. We present a new system for warp scale selection which uses a simple generic voiced speech model to rapidly select appropriate frequency scales. The selection is sufficiently streamlined that it can moved completely into the front-end processing. Using this system on a standard test of the Switchboard Corpus, we have achieved relative reductions in word error rates of 12% over unnormalized gender-independent models and 6% over our best unnormalized gender-dependent models.
The ICSI Meeting Project: Resources and ResearchThis paper provides a progress report on ICSI’s Meeting Project, including both the data collected and annotated as part of the project, as well as the research lines such materials support. We include a general description of the official “ICSI Meeting Corpus”, as currently available through the Linguistic Data Consortium, discuss some of the existing and planned annotations which augment the basic transcripts provided there, and describe several research efforts that make use of these materials. The corpus supports wideranging efforts, from low-level processing of the audio signal (including automatic speech transcription, speaker tracking, and work on far-field acoustics) to higher-level analyses of meeting structure, content, and interactions (such as topic and sentence segmentation, and automatic detection of dialogue acts and meeting “hot spots”). 1.
Robust Speaker Segmentation for Meetings: The ICSI-SRI Spring 2005 Diarization SystemXavier Anguera, Chuck Wooters, Barbara Peskin et al.|Lecture notes in computer science|2006