One of the difficult problems in using machine learning to generate sequences, such as melodies, is creating long-term structure. Long-term structure comes very naturally to people, but it’s very hard for machines. Basic machine learning systems can generate a short melody that stays in key, but they have trouble generating a longer melody that follows a chord progression, or follows a multi-bar song structure of verses and choruses. Likewise, they can produce a screenplay with grammatically correct sentences, but not one with a compelling plot line. Without long-term structure, the content produced by recurrent neural networks (RNNs) often seems wandering and random.
But what if these RNN models could recognize and reproduce longer-term structure? Could they produce content that feels more meaningful – more human? Today we’re open-sourcing two new Magenta models, Lookback RNN and Attention RNN, both of which aim to improve RNNs’ ability to learn longer-term structures. We hope you’ll join us in exploring how they might produce better songs and stories.
Lookback RNN introduces custom inputs and labels. The custom inputs allow the model to more easily recognize patterns that occur across 1 and 2 bars. They also help the model recognize patterns related to where in the measure an event occurs. The custom labels make it easier for the model to repeat sequences of notes without having to store them in the RNN’s cell state. The type of RNN cell used in this model is an LSTM.
In our introductory model, Basic RNN, the input to the model was a one-hot vector of the previous event, and the label was the target next event. The possible events were note-off (turn off any currently playing note), no event (if a note is playing, continue sustaining it, otherwise continue silence), and a note-on event for each pitch (which also turns off any other note that might be playing). In Lookback RNN, we add the following additional information to the input vector:
In addition to inputting the previous event, we also input the events from 1 and 2 bars ago. This allows the model to more easily recognize patterns that occur across 1 and 2 bars, such as mirrored or contrasting melodies.
We also input whether the last event was repeating the event from 1 or 2 bars before it. This signals if the last event was creating something new, or just repeating an already established melody. This allows the model to more easily recognize patterns associated with being in a repetitive or non-repetitive state.
We also input the current position within the measure (as done previously by Daniel Johnson), allowing the model to more easily learn patterns associated with 4/4 time music. These inputs are 5 values that can be thought of as a binary step clock.