Latent Force Models for Sound

William J. Wilkinson | | @wil_j_wil

This webpage accompanies the DAFx 2017 submission "Latent Force Models for Sound: Learning Modal Synthesis Parameters and Excitation Functions from Audio Recordings".

Here we present some interactive examples to help demonstrate our work. Below is a synthesis model capable of morphing between two sounds, such as a clarinet and an oboe (Figure 5 in the paper). Use the slider to adjust the morph position, draw the latent input function as you please (mind your ears!), and press the button to play. You can also listen to the original recordings and load the corresponding learnt latent functions.

Use these buttons to switch between models

How it works:

Latent force modelling tracks correlation between signals by assuming they come about as a result of a common input function passing through some input-output process. If this input-output process ("the model") represents the system's physical behaviour, then it's possible to gain an accurate mapping from a high-dimensional set of observed signals to the unobserved (latent) input function. Inclusion of such physical mechanisms has the added bonus of providing us with an intuitive way to interact with the sound and perform resynthesis.

We perform sinusoidal analysis on an audio recording using the Spear software. We then apply a simple peak picking algorithm to extract the most prominent modes (often the harmonics) of the signal. The following data represents 6 modes of a clarinet note:

From now onwards we assume the frequencies to be fixed and focus on modelling the amplitude behaviour. Based on our knowledge about the physical production of sound, analysis of real audio data, and utilising existing research on modal sound synthesis, we define a model for how the amplitude of $M$ modes behave over time: $$\frac{{dx_i(t)}}{dt} + D_i\text{Re}\left\{ x_i^\gamma(t) \right\} = S_{i}g(u(t)), \hspace{1cm} i=1,...,M$$ where $x_i$ is the amplitude of the $i^\text{th}$ mode, $u$ is the latent input function, and $g(u)=\text{log}(1+e^{u})$ is the "softplus" rectification function that forces the input to be positive. $\gamma$ is a user-defined parameter that alters the "linearity" of the decay of a sinusoidal mode. $S_i$ and $D_i$ represent our physical parameters relating to the stiffness and damping of the $i^\text{th}$ mode.

Our aim is to learn the model parameters $\{S_i,D_i\}_{i=1}^M$ from our data whilst simultaneously inferring the latent input behaviour $u$. To do so we constrain the possible values of $u$ by assuming it is the output of a Gaussian process (a type of random process that ensures some level of smoothness by assuming any finite collection of data points make up a multivariate Gaussian distribution). Having done so, we make an initial guess at our parameters. Given this initial guess, we plug our data into the model and run a Kalman filtering algorithm which tells us the most likely distribution of $u$, and tells us the overall likelihood of our data given the parameters that we guessed. These parameters are then iteratively optimised by maximising this likelihood value using gradient descent.

Below is a plot of the predicted values of $u$ for the clarinet recording, as well as the rectified function $g(u)$ which is then used as the input to the model to reproduce the observed outputs:

We can see that the reproduced modes capture much of the behaviour present in the original data, but that they are smoothed and some of the variance is not captured. Importantly, variable damping rates have been learnt for each mode such that they decrease to zero at different rates in the absence of input. This is a feature that could not be learnt with a simpler dimensionality reduction such as PCA.

Now that we have learnt our model, we can easily reproduce the sound of the modes by applying the amplitude curves to a set of sinusoids at the appropriate frequencies. This operation can easily be run in real time. We can also alter the sound by replacing the learnt input function with some user-drawn input (as with the interactive example above). Furthermore, we can morph between models learnt from different recordings by pairing up the modes and linearly interpolating the parameters $S_i$ and $D_i$ and logarithmically interpolating the sinusoids' frequency values.