Alana: Social Dialogue using an Ensemble Model and a Ranker trained on User FeedbackIoannis Papaioannou, Amanda Cercas Curry, Jose L. Part et al.|Edinburgh Napier Research Repository (Edinburgh Napier University)|2017 We describe our Alexa prize system (called ‘Alana’) which consists of an ensemble of bots, combining rule-based and machine learning systems, and using a contextual ranking mechanism to choose system responses. This paper reports on the version of the system developed and evaluated in the semi-finals of the competition (i.e. up to 15 August 2017), but not on subsequent enhancements. The ranker for this system was trained on real user feedback received during the competition, where we address the problem of how to train on the noisy and sparse feedback obtained during the competition. In order to avoid initial problems of inappropriate and boring utterances coming from big datasets such as Reddit and Twitter, we later focussed on ‘clean’ data sources such as news and facts. We report on experiments with different ranking functions and versions of our NewsBot. We find that a multiturn news strategy is beneficial, and that a ranker trained on the ratings feedback from users is also effective. Our system continuously improved using the data gathered over the course over the competition (1 July – 15 August) . Our final user score (averaged user rating over the whole semi-finals period) was 3.12, and we achieved 3.3 for the averaged user rating over the last week of the semi-finals (8-15 August 2017). We were also able to achieve long dialogues (average 10.7 turns) during the competition period. In subsequent weeks, after the end of the semi-final competition, we have achieved our highest scores of 3.52 (daily average, 18th October), 3.45 (weekly average on 23 and 24 October), and average dialogue lengths of 14.6 turns (1 October), and median dialogue length of 2.25 minutes (average for 7 days on 10th October).
A Comprehensive Evaluation of Incremental Speech Recognition and Diarization for Conversational AIAutomatic Speech Recognition (ASR) systems are increasingly powerful and more accurate, but also more numerous with several options existing currently as a service (e.g. Google, IBM, and Microsoft). Currently the most stringent standards for such systems are set within the context of their use in, and for, Conversational AI technology. These systems are expected to operate incrementally in real-time, be responsive, stable, and robust to the pervasive yet peculiar characteristics of conversational speech such as disfluencies and overlaps. In this paper we evaluate the most popular of such systems with metrics and experiments designed with these standards in mind. We also evaluate the speaker diarization (SD) capabilities of the same systems which will be particularly important for dialogue systems designed to handle multi-party interaction. We found that Microsoft has the leading incremental ASR system which preserves disfluent materials and IBM has the leading incremental SD system in addition to the ASR that is most robust to speech overlaps. Google strikes a balance between the two but none of these systems are yet suitable to reliably handle natural spontaneous conversations in real-time.
Learning how to learn: An adaptive dialogue agent for incrementally learning visually grounded word meaningsYanchao Yu, Arash Eshghi, Oliver Lemon|Edinburgh Napier Research Repository (Edinburgh Napier University)|2017 We present an optimised multi-modal dialogue agent for interactive learning of visually grounded word meanings from a human tutor, trained on real human-human tutoring data. Within a life-long interactive learning period, the agent, trained using Reinforcement Learning (RL), must be able to handle natural conversations with human users, and achieve good learning performance (i.e. accuracy) while minimising human effort in the learning process. We train and evaluate this system in interaction with a simulated human tutor, which is built on the BURCHAK corpus – a Human-Human Dialogue dataset for the visual learning task. The results show that: 1) The learned policy can coherently interact with the simulated user to achieve the goal of the task (i.e. learning visual attributes of objects, e.g. colour and shape); and 2) it finds a better trade-off between classifier accuracy and tutoring costs than hand-crafted rule-based policies, including ones with dynamic policies.
Information density and overlap in spoken dialogueIncrementally Learning Semantic Attributes through Dialogue InteractionEnabling a robot to properly interact with users plays a key role in the effective deployment of robotic platforms in domestic environments. Robots must be able to rely on interaction to improve their behaviour and adaptively understand their operational world. Semantic mapping is the task of building a representation of the environment, that can be enhanced through interaction with the user. In this task, a proper and effective acquisition of semantic attributes of targeted entities is essential for the task accomplishment itself. In this paper, we focus on the problem of learning dialogue policies to support semantic attribute acquisition, so that the effort required by humans in providing knowledge to the robot through dialogue is minimized. To this end, we design our Dialogue Manager as a multi-objective Markov Decision Process, solving the optimisation problem through Reinforcement Learning. The Dialogue Manager interfaces with an online incremental visual classifier, based on a Load-Balancing Self-Organizing Incremental Neural Network (LB-SOINN). Experiments in a simulated scenario show the effectiveness of the proposed solution, suggesting that perceptual information can be properly exploited to reduce human tutoring cost. Moreover, a dialogue policy trained on a small amount of data generalises well to larger datasets, and so the proposed online scheme, as well as the real-time nature of the processing, are suited for an extensive deployment in real scenarios. To this end, this paper provides a demonstration of the complete system on a real robot.