Tutorial descriptions

Sunday, 20 August - Morning Tutorials, 09:00-12:30

Statistical Parametric Speech Processing: Solving Problems with the Model-based Approach

Organizers:

  • Mads Græsbøll Christensen, Aalborg University
  • Assistant Prof. Jesper Rindom Jensen, Aalborg University
  • Assistant Prof. Jesper Kjær Nielsen, Aalborg University.

Abstract:

Parametric speech models have been around for many years but have always had their detractors. Two common arguments against such models are that it is too difficult to find their parameters and that the models do not take the complicated nature of real signals into account. In recent years, significant advances have been made in speech models and robust and computationally efficient estimation using statistical principles, and it has been demonstrated that, regardless of any deficiencies in the model, the parametric methods outperform the more commonly used non-parametric methods (e.g., autocorrelation-based methods) for problems like pitch estimation. The application of these principles, however, extend way beyond that problem. In this tutorial, state-of-the-art parametric speech models and statistical estimators for finding their parameters will be presented and their pros and cons discussed. The merits of the statistical, parametric approach to speech modeling will be demonstrated via a number of number of well-known problems in speech, audio and acoustic signal processing. Examples of such problems are pitch estimation for non-stationary speech, distortion-less speech enhancement, noise statistics estimation, speech segmentation, multi-channel modeling, and model-based localization and beamforming with microphone arrays.

Deep Learning for Dialogue Systems

Organizers:

  • Yun-Nung Chen, National Taiwan University, Taipei, Taiwan
  • Asli Celikyilmazy, Microsoft Research, Redmond, WA
  • Dilek Hakkani-Tur, Google Research, Mountain View, CA

Abstract:

In the past decade, goal-oriented spoken dialogue systems (SDS) have been the most prominent component in today’s virtual personal assistants (VPAs). Among these VPAs, Microsoft’s Cortana, Apple’s Siri, Amazon Alexa, Google Home, and Facebook’s M, have incorporated SDS modules in various devices, which allow users to speak naturally in order to finish tasks more efficiently. The traditional conversational systems have rather complex and/or modular pipelines. The advance of deep learning technologies has recently risen the applications of neural models to dialogue modeling. Nevertheless, applying deep learning technologies for building robust and scalable dialogue systems is still a challenging task and an open research area as it requires deeper understanding of the classic pipelines as well as detailed knowledge on the benchmark of the models of the prior work and the recent state-of-the-art work. Thus, this tutorial is designed to focus on an overview of the dialogue system development while describing most recent research for building dialogue systems, and summarizing the challenges. The goal of this tutorial is to provide the audience with developing trend of the dialogue systems, and a roadmap to get them
started with the related work.

Insights from qualitative research: an introduction to the phonetics of talk-in-interaction

Organizers:

  • Richard Ogden Department of Language & Linguistic Science, Centre for Advanced Studies in Language & Communication, University of York, UK.
  • Jan Gorisch Department of Pragmatics, Institute for the German Language (IDS), Mannheim, Germany.
  • Gareth Walker School of English, University of Sheffield, UK.
  • Meg Zellers Department of Linguistics: English, University of Stuttgart, Germany

Abstract:

This tutorial will provide an overview of the methods and findings of Conversation Analysis (CA)
through hands-on analysis of conversational data, exploring how qualitative analysis can inform
quantitative analyses of speech. Analysis will focus on how speakers in conversation use the phonetic shape of their talk to provide recognisable places for others to take turns and which features are recognised as providing such opportunities. The tutorial will be led by experts working at the interface of CA and phonetics.

Real-world ambulatory monitoring of vocal behavior

Organizer:

  • Daryush D. Mehta, Center for Laryngeal Surgery and Voice Rehabilitation, Massachusetts General Hospital

Abstract:

Many of us often take verbal communication for granted. Individuals suffering from voice disorders
experience significant communication disabilities with far-reaching social, professional, and personal consequences. This tutorial provides an overview of long-term, ambulatory monitoring of daily voice use and in-depth discussions of interdisciplinary research spanning biomedical technology, signal processing, machine learning, and clinical voice assessment. Innovations in mobile and wearable sensor technologies continue to aid in the quantification of vocal behavior that can be used to provide real-time monitoring and biofeedback to facilitate the prevention, diagnosis, and treatment of behaviorally based voice disorders.

Creating Speech Databases of Less-Resourced Languages: A CLARIN Hands-On Tutorial

Organizers:

  • Christoph Draxler, Institute of Phonetics and Speech Communication, Ludwig Maximilian University Munich, Germany
  • Florian Schiel, Institute of Phonetics and Speech Communication, Ludwig Maximilian University Munich, Germany
  • Thomas Kisler, Institute of Phonetics and Speech Communication, Ludwig Maximilian University Munich, Germany

Abstract:

The creation of speech databases for spoken language research and development, especially for less-resourced languages, is a time-consuming and largely manual task. In this tutorial we present a workflow comprising the specification, recording, transcription, segmentation and the publication of spoken language. In the tutorial, we will demonstrate how to use a) semi-automatic tools and b) crowdsourcing wherever possible to speed up the process. We will conclude by showing how such speech databases may be employed to adapt existing tools and services to new languages, thus facilitating access to these languages.

Sunday, 20 August - Afternoon Tutorials, 13:30-17:00

Deep Learning for Text-to-Speech Synthesis, using the Merlin toolkit

Organizers:

  • Simon King, Centre for Speech Technology Research, University of Edinburgh, UK
  • Oliver Watts, Centre for Speech Technology Research, University of Edinburgh, UK
  • Srikanth Ronanki, Centre for Speech Technology Research, University of Edinburgh, UK
  • Zhizheng Wu, Apple Inc, USA

Abstract:

This tutorial will combine the theory and practical application of Deep Neural Networks (DNNs) for Text-to-Speech (TTS). It will illustrate how DNNs are rapidly advancing the performance of all areas of TTS, including waveform generation and text processing, using a variety of model architectures. We will link the theory to implementation with the Open Source Merlin toolkit.
http://www.cstr.ed.ac.uk/projects/merlin

Computational modeling of language acquisition

Organizers:

  • Naomi Feldman, University of Maryland, MD
  • Emmanuel Dupoux, Ecole des Hautes Etudes en Sciences Sociales, France
  • Okko Räsänen, Aalto University, Finland

Abstract:

Children learn their native language simply by interacting with their environment. Computational modeling of language acquisition aims to understand the information processing principles underlying the human capability to learn spoken languages without formal instruction. In addition to its basic scientific value, understanding of human language acquisition may aid in the development of more advanced spoken language capabilities for machines. The goal of this tutorial is to introduce participants to the basics of computational cognitive modeling, especially in the context of learning linguistic structures from real acoustic speech without labeled training data, and to provide an overview of the ongoing state-of-the-art research in the area.

Latest Advances in Computational Speech and Audio Analysis: Big Data, Deep Learning, and Whatnots

Organizers:

  • Björn W. Schuller, Imperial College London, U.K. & Univeristy of Passau, Germany & audEERING Gmbh, Germany
  • Nicholas Cummins, Univeristy of Passau, Germany

Abstract:

Conventional speech-based recognition and classification systems learn from information captured in hand-engineered features. These features have been purposely designed and meticulously refined over decades to capture certain aspects of either speech production, acoustic properties or phonetic information inherent in speech. However, the feature representation paradigm is currently changing: the advent of newer learning paradigms such as deep neural networks and marked increases in computer power have resulted in a shift away from hand-crafted feature representations – they can now be determined by the system itself during the learning process, albeit often at the requirement of large(r) amounts of data. At the same time, speech and audio analysis is becoming broader and increasingly holistic, targeting the extraction of a broad range of aspects inherent in the signal simultaneously. In this regard, this tutorial will cover the most important aspects related to the latest advances around “big data” and “deep learning” to name but the two major aspects in recent computational speech and audio analysis; from new feature representation paradigms through to tools needed to collect the big data needed to fully harness and realise their potential. Besides covering these topics on a theoretical level, this tutorial will feature hands-on experience in which participants will receive training to use relevant state-of-the-art toolkits. These include scripts for end-2-end learning, openSMILE and openXBOW for feature representations, CURRENNT and others for deep learning, and openCoSy and iHEARu-PLAY for rapid learning data acquisition by efficient social media mining and its annotation by gamified dynamic cooperative crowd-learning.

Modelling Situated Multi-modal Interaction with the Furhat Robot Head

Organizers:

  • Gabriel Skantze, KTH Speech Music and Hearing & Furhat Robotics, Sweden
  • André Pereira, Furhat Robotics, Sweden

Abstract:

Spoken face-to-face communication is likely to be the most important means of interaction with robots in the future. In addition to speech technology, this also requires the use of visual information in the form of facial expressions, lip movement and gaze. Human-robot interaction is also naturally situated, which means that the situation in which the interaction takes place is of importance. In such settings, there might be several speakers involved (multi-party interaction), and there might be objects in the shared space that can be referred to. Recent years have seen an increased interest in modeling such communication for human-robot interaction.

In this tutorial, we will start by providing the theoretical background of spoken face-to-face interaction and how this applies to human-robot interaction. We will then go through the state-of-the-art of the different technologies needed and how this kind of interaction can be modeled. To make the tutorial as concrete as possible, we will use the Furhat social robot platform in the tutorial, in order to show how different interaction patterns can be implemented, and (depending on the number of participants) give hands-on exercises on how to program human-robot interaction for a social robot.