The design of the model is inspired by the human auditory system. It consists of two main layers: memory layer and motoric layer. Despite the names, they are not programmed to mimic exactly how human's memory and motoric system work.
In the memory layer, information about all previous timesteps is processed using GRU.
memory layer:
The input to this layer is, unfortunately, not the full resolution of all previous timesteps — if so, the later the input, the longer the input sequence is, consuming too much computer memory. Instead, the input is compressed into a fixed n-sample long memory buffer with decaying resolution from the most recent samples to the earliest samples. For instance, if we have a sequence of 20 samples containing the values {0, 1, 2, …, 18, 19}, and the length of the buffer is 10 samples, then the memory buffer will contain the values {0, 5, 9, 12, 14, 15, 16, 17, 18, 19}. This behavior is taken away from one of Schacter’s The Seven Sins of Memory, transience. It is one of the memory retrieval failures that happen due to the passage of time — the memory was encoded correctly at first but fades over time.
The motoric layer is where the encoding of the memory is processed to condition the next sample prediction. This layer runs at a different rate from the memory layer — for every n-sample memory, the motoric layer runs m times. This means all steps from the current step until the next m steps of the motoric layer are conditioned with the same vector from the memory layer. The rate of the motoric layer corresponds to the chunk size mentioned in the previous section. Two GRUs are stacked in this layer, and the output is the sum of both cells, representing the probability distribution of the output.
motoric layer:
The training data is preprocessed to 8 bit (mu-law) and 16kHz due to limited computation power.
The following are the excerpts to the training data:
Senar
Music, Female, Hipster
Note that each audio track lasts for 10 minutes and is used to trained one example model to produce one example output.
Overall, the system still has a lot of limitations due to computational resources.
The system did not learn how to end a composition. That’s why I had to find the right offset to cut the generated sample.
When using the system, the compositional process does not feel like how composing music used to feel like. It lacks immediate auditory feedback while developing the system.
However, the musical outcome sounds promising. Compositionally, it produces intelligible microstructure and provides a sense of temporal development. Musically, it resembles a nostalgic event.