Inheritance Model

A generative model for raw audio with decaying memory as dynamic context representation.

This project is a part of my Master Thesis: "Compositional Approach to Machine Learning for Generating Electroacoustic Music."

The goal is for the model to learn the timbre / instrumentation of the training data (microstructure) and be able to infer a different compositional structure (macrostructure) — here, the sound quality is secondary.

Visit the github repository to look at the code.

The Model Architecture

The design of the model is inspired by the human auditory system. It consists of two main layers: memory layer and motoric layer. Despite the names, they are not programmed to mimic exactly how human's memory and motoric system work.

In the memory layer, information about all previous timesteps is processed using GRU.

memory layer:



The input to this layer is, unfortunately, not the full resolution of all previous timesteps — if so, the later the input, the longer the input sequence is, consuming too much computer memory. Instead, the input is compressed into a fixed n-sample long memory buffer with decaying resolution from the most recent samples to the earliest samples. For instance, if we have a sequence of 20 samples containing the values {0, 1, 2, …, 18, 19}, and the length of the buffer is 10 samples, then the memory buffer will contain the values {0, 5, 9, 12, 14, 15, 16, 17, 18, 19}. This behavior is taken away from one of Schacter’s The Seven Sins of Memory, transience. It is one of the memory retrieval failures that happen due to the passage of time — the memory was encoded correctly at first but fades over time.

The motoric layer is where the encoding of the memory is processed to condition the next sample prediction. This layer runs at a different rate from the memory layer — for every n-sample memory, the motoric layer runs m times. This means all steps from the current step until the next m steps of the motoric layer are conditioned with the same vector from the memory layer. The rate of the motoric layer corresponds to the chunk size mentioned in the previous section. Two GRUs are stacked in this layer, and the output is the sum of both cells, representing the probability distribution of the output.

motoric layer:



Training Data

The training data is preprocessed to 8 bit (mu-law) and 16kHz due to limited computation power.

The following are the excerpts to the training data:

Senar


Music, Female, Hipster


Note that each audio track lasts for 10 minutes and is used to trained one example model to produce one example output.

Output

The models are each set to output 4-minute audio samples. The results are then have to be cut to find the right moment of beginning and ending. Selected output samples from the model trained on:

Senar


Music, Female, Hipster


The results are lightly post-processed with a bit of reverb and equalization.

Evaluation of the Outcome

Overall, the system still has a lot of limitations due to computational resources.

The system did not learn how to end a composition. That’s why I had to find the right offset to cut the generated sample.

When using the system, the compositional process does not feel like how composing music used to feel like. It lacks immediate auditory feedback while developing the system.

However, the musical outcome sounds promising. Compositionally, it produces intelligible microstructure and provides a sense of temporal development. Musically, it resembles a nostalgic event.