Machine Learning for Auditory Hierarchy

William Coleman, Technological University Dublin

Document Type Theses, Ph.D

Abstract

Audio content is today consumed in a plethora of ways. This may be on stereo headphones, via a home cinema system, in the car or on a smart speaker. The format used to deliver the content may be an MP3 or WAV, FLAC, AIFF, OGG, or any number of various other video and audio streaming options. The content may be a game, music or drama & current affairs broadcasting.

Audio content is predominantly delivered in a stereo audio file of a static, pre-formed mix. The content creator makes volume, position and effects decisions, generally for presentation in stereo speakers, but has no control ultimately over how the content will be consumed. This leads to poor listener experience when, for example, a feature film is mixed such that the dialogue is at a low level relative to the sound effects. Consumers can complain that they must turn the volume up to hear the words, but back down again because the effects levels are too loud. Addressing this problem requires a television mix optimised for the stereo speakers used in the vast majority of homes, which is not always available.

The concept of object-based audio envisages content delivery not via a fixed mix, but as a series of auditory objects which can be flexibly controlled individually. This method would increase the flexibility available to creators such that they could design sound mixes for multiple consumption paradigms. A package of audio content could then come provided with a menu of mix configurations, giving consumers the option of choosing which to use. Object-based audio could also be used to automate content decisions in an informed manner for different scenarios. If a television mix is required for a film where none is available,

a model could be applied to automate an appropriate mix which balances dialogue and effects levels. If it became necessary to reduce the amount of data transmitted, variable compression could be applied to objects, selectively reducing data file sizes. In this way, the most important objects could be reproduced at the highest quality with no file compression. Those less critical could be rendered at lower quality, having been heavily compressed. From these examples, it follows that an ability to predict the importance of auditory objects would be useful as it would permit the selective treatment of assets for both creative and delivery strategies.

This thesis provides a research roadmap for a machine learning investigation of auditory hierarchy, and thus serves two communities. For those from a machine learning background, it introduces perceptual auditory theory and gives insight into how humans perceive sound. For those from an audio background, it provides insight into common machine learning methods and best practices. To begin, perceptual audio research is reviewed and a theory of auditory hierarchy is offered, which outlines factors relevant to hierarchical classification in the context of modern media consumption paradigms. A review of audio machine learning research is then presented, which frames hierarchical prediction as a problem complicated by the subjective nature of the labelling task, distinct from other prediction problems such as environmental sound classification where correct sound identification results in an objective label. The nature of auditory hierarchy is then explored via a number of experiments. The machine learning techniques employed are exploratory and provide insight into the performance of common methods. This is with the intention of illuminating a problem area which to date has not received widespread interest from the machine learning community. It is hoped that the experiments described in this work will thus inform further applications of machine learning methods to auditory hierarchy.

The first experiment described in this work is a perceptual labelling task, which investigates the inherent sound hierarchy between a small corpus of isolated sounds. A subsequent machine learning analysis produces promising results, achieving a foreground recall score of 93.3%, but the size of dataset used is noted as an issue, highlighting the requirement for a larger dataset of hierarchically labelled sounds. For this reason, Active Learning methods for minimising the manual effort required to label large numbers of experimental stimuli are investigated. It is found that labels can be predicted to high degrees of accuracy (95.5% of the total possible) by selecting just a small percentage (1.7%) of the most informative instances. This method is then used in tandem with data augmentations to build a corpus of 100,000 instances with hierarchical labels. The performance of Support Vector Machine (SVM) and Convolutional Neural Network (CNN) algorithms on a sound hierarchy prediction task using different feature representations is then presented.

It is found in this case that the performance of the CNN is superior (82.2% average class accuracy), but it is noted that this is not greatly superior to that of an SVM (77.5%) trained on a smaller dataset. This is an interesting result, as it suggests that the manual effort required to label datasets large enough for deep learning algorithms may not be justified for every application.