Final Project Update


Our goal in this update was to see if we could produce coherent music using GANs, as well as create an image encoding that captures polyphonic music. We learned that sparse data, such as the one hot vector encoding that we were using, is not good input data for GANs and produces very poor results.


For the second part of this project we wanted to continue our exploration of finding an encoding of the image data. Previously, we used a monophonic one-hot encoding to convert the Music-XML files, but this time we also explored polyphonic encoding. Polyphonic encoding would allow the generated music to produce multiple sounds simultaneously – monophonic could represent a simple melody, while polyphonic could represent songs with chords or bass lines accompanying the melody. Additionally, we expanded upon our previous work to incorporate a new method of music generation. In the last update, we used Markov chains to produce new music, and came up with fairly good results, but this time we decided to explore GANs.

Because GANs are based on top of convolutional neural networks, they are able to create relationships between pixels that Markov chains are unable to do – Markov chains can only look at previous states to determine the next note to pick, while GANs could discover a specific functional relationship between a set of notes.


New Polyphonic Encoding: Our previous encoding scheme represented a song as an image where columns represented discrete notes and rows represented the pitch of the note, with one row being dedicated to encoding rests. Each column was one-hot, representing one note at one pitch at one time interval, and the coloring of the single non-black pixel indicated the length of the note. This approach works well as a simple encoding scheme, but has one crucial downside - it can’t represent multiple notes being played at the same time – this occurs frequently in music, either through chords, or multiple instruments harmonizing.

To combat this problem for this update we developed an alternative “polyphonic” (multi-note) encoding scheme to represent a song visually. Here’s a sample song:

In this encoding rows still represent pitch. The major change is the function of each column; in our previous encoding, each column contained a single note with the pixel color representing the length of that note. Instead, in this encoding, each column represents a single sixteenth note. A horizontal line of yellow pixels represents a note that is held for multiple timesteps - two pixels represent an eighth note and four a quarter note. The red pixel is used to mark the end of a note. Since each column is now of a uniform time, we’re able to have multiple concurrent notes – notice how for a given column of time, multiple simultaneous pitches are playing. This also gives each song a more interesting and less constrained visual character, helping the generative CV approaches we use later.


GANs are an unsupervised machine learning algorithm and consist of two neural networks. One of the neural networks, the generator, generates data, while the other neural network, the discriminator, decides whether data fed into the algorithm from the generated data is similar to the training data.

GANs have had many cool uses in the past few years from transferring style of painter onto an image to generating realistic headshots of people. In the same light of using previous images to create novel images, we hoped that GANs would help us create novel images based on the music encoded images that it trained on.

We used Deep Convolutional Generative Adversarial Network, or DCGANs, to create our images. DCGANs are most famous for being able to generate photorealistic images of faces. DCGANs are also notable for using convolutional neural networks (CNNs), which have become a popular approach for computer vision and deep learning problems.

We hoped that by training a DCGAN on large datasets of input images, the generator would be able to approximate the true data distribution to produce new images that are similar to our inputs. Then, we can discretize and decode the result images to get a generated song.

Experiment and Results

We trained several DCGANs and experimented with varying the image encoding scheme and network sizes in the generator and discriminator.

Our data consisted of 800 classical songs downloaded from MuseScore, which is a site where users may upload transcriptions of music. These songs were converted with both the monophonic and polyphonic encoding schemes. Then, the image representations were tiled to 24 x 24 images, as most open source implementations of GANs are built to support square images.

gif of gan training

For the monophonic encoding scheme, we generated approximately 10,000 training images. Training the DCGAN on these images converged fairly quickly after about 2000 iterations. The DCGAN was able to pick up that the primary colors in the input images were black, red, and blue, and the results for generated images consisted mainly of these colors. However, the training images were extremely sparse, since the columns were one hot vectors to represent the sequences of notes and vectors. The GAN failed to replicate this, and the outputs contain a lot of noise. In addition, tweaking the network sizes in the generator and discriminator did not result in significant improvements in output results. Thus, while the GAN was able to somewhat successfully mimic the color distribution of the input images, it failed to replicate the underlying image structure – namely, that only one note can be played at a time.

We also experimented with training the DCGAN using images created via the polyphonic encoding scheme. Here, the training image dataset consisted of 24,000 images. The loss function of the generator and discriminator took more iterations to converge in this case. This increase in convergence time likely occurred because the input images of the polyphonic scheme were more feature rich than those from the monophonic scheme. The increased amount of training images also likely contributed to this. The GAN also failed to pick up on some of the patterns in the input images. For example, in our training images, rows of yellow pixels are always followed by red pixels. However, this is not evident in the generated images.

For both sets of training data, we ran into issues with mode collapse.

The first issue occurs when the GAN reached Nash Equilibrium, and as a result fails to converge. Nash Equilibrium occurs when the actions of discriminator or generator no longer matter because the outcome of who will win does not change. The mode collapse occurs when only a few modes, or in our case one mode, of the data are generated. This occurs when generator continuously fools the discriminator when focusing on one mode. Within the context of our data, mode collapse means that, despite an increase in training iterations, the exact same output was generated.

It is also likely that our results from GANs were poor because GANs tend to work well for natural images. Research projects with GANs often involve images taken with a camera, such as photos of people or landscapes. Our input images are not natural and our encoding and decoding methods for representing music impose many assumptions and rules on the images. The GAN was not able to pick up on these qualities.

In addition, we trained the GANs on 24 x 24 images because it was most straightforward to run the convolutional neural networks in the generator and discriminator on square inputs. However, it is possible that better results could have been achieved by training on longer sequences, by manipulating our GAN to take in 24x48 or 24x96 images.

Overall, our GAN results were poor due to issues with mode collapse and the rigidness of our image representation encoding methods.

Conclusion and Future Work

Overall, we built two encoding schemes for MusicXML files, one supporting polyphonic and one monophonic. After building a corpus of thousands of input images for each encoding, we fed the inputs into GANs and variable order Markov models to generate outputs with what we hoped would have a similar style to the input. We had varying results which you can listen to through the SoundCloud links, but overall, the project provided us a great opportunity to learn about various image processing techniques, specifically in the field of image generation.

We had fun experimenting with different models to create music. To further explore this problem it would be interesting to see if other deep learning approaches like variable autoencoders or pixelCNN would have produced better results. It would also be interesting to explore other input forms and see if, for example, a Char-RNN that treats notes as characters performs better or worse – this wouldn’t be as acceptable for this project as it would take in music embedded as text, and there would be no images.


Midterm Project Update

By Ehsan Asdar (easdar3), Ashwini Iyer (aiyer65), Matthew Kaufer (mkaufer3), Nidhi Palwayi (spalwayi3), and Kexin Zhang (kzhang323)


Our project focuses on music generation through image processing techniques. We developed a method for encoding songs in images and trained variable order Markov models on these image representations. Our experiments produced the best results when the model was trained on a few songs with a distinct style.

Sample of song embedded in image


Novel music generation is a large area of active research. For this project, we wanted to experiment with generative computer vision approaches we have been learning in this course (along with more advanced deep learning based CV algorithms) to see if they can also serve as effective methods of generating music. To apply these CV approaches, we created an experimentation pipeline that converts input MusicXML music formats into images, generates songs based off of these inputs, encoded as images, and then converts these output images back into audio for playback. The generated music is evaluated by its ability to sustain musical patterns throughout the song.


Generative music has several creative applications. Producers and musicians can employ generative techniques to synthesize new music samples and develop new songs based on similar work. For example, given a trained model, one can generate different segments of music given previous note histories.

Existing Approaches

Some existing works that we researched include Wavenet, projects from Google’s Magenta, and Cycle GAN audio. While our idea is not unique, our project is still meaningful, since we plan to explore image representations of songs and experiment with a variety of techniques for music generation.


To generate music, we developed the following procedures:

  1. A program to download MusicXML files in bulk
  2. A program to monophonically encode songs as images
  3. A pipeline to train variable order Markov models on given image representations of songs
  4. A pipeline to generate song images of variable length from a given variable order Markov model, mimicking texture synthesis
  5. A process to convert the generated images to audio

Collecting Data

For our input data, we wrote a program to search and download MusicXML songs in bulk from the MuseScore API. We downloaded a total of 800 songs that were composed for a solo pianist. Our song corpus included traditional classical music, movie scores from John Williams, and songs from Undertale, a video game with a distinctive soundtrack.

Image Representation

First, we developed an image representation for songs. The idea is that songs can be encoded as images where the columns represent time and the rows represent notes.

We considered two formats for the input audio: midi and MusicXML files. Midi files encode data with millisecond differences between notes, while MusicXML preserves the note’s musical value itself — for instance, Midi might say a note lasted 100 milliseconds, while MusicXML might say that a note was a quarter note. Both represent pitch as an integer.

Although we initially planned to use midi files, we found that midi files did not allow us to distinguish between different note durations (quarter notes, half notes, etc.) with ease. That means that, with midi files, we would need to manually quantize note durations, leading to a large number of previous states to consider, a large state space for the Markov chain, and potential for quantization errors. However, MusicXML files have information on the notes themselves, which makes the image representation more compact and accurate. The rows of the image were used to encode the note, with a range of two octaves, while the width of the image represents the temporal sequence of notes in the song’s melody. Each column is a one hot vector, and we used the RGB channels of the “on” pixels to represent the type and the duration of the note. In particular, blue pixels represent notes while red pixels represent rests, and the intensity of the blue or red channel is scaled to represent the duration of the note.

One other thing to note is that all input songs were normalized to the key of C — if songs of different keys were input into the Markov model, transitions between notes could be out of key, and sound dissonant.

Training and Running Markov Models

With our image representations of songs, we can treat sequences of notes in the image as textures and use these as inputs for a Markov chain, a common texture synthesis technique. We figured this would be a good approach for generative music, since songs often have patterns and repetition that can be represented well in a Markov model. We opted to use a variable order model instead of a classical n-order Markov model. This allowed us to consider more than one previous state, which is good because patterns in music usually rely on some musical context. In addition, the variable order model affords more flexibility, since the number of past states considered can vary up to a maximal order. This is useful as the number of past notes to consider varies throughout a piece of music and is not fixed in quantity.

We used the Python vomm library for training variable order Markov models and generating new song images based on the inputs. The inputs are columns of the song images, and the model predicts next values, which are also columns of an image. These predicted values form the image representation of the output song. This output image can then be decoded to construct a Midi file.

Experiments and Results

We tested our image generation model on several training sets. On smaller training sets of songs with distinct, repetitive phrases, the variable order Markov model performed well. This is because the generated piece tended to audibly mimic motifs from the few input pieces. For example, we trained the model with a variable order of five on a small set of Star Wars soundtrack songs — in this case, the Markov model generated music that incorporated common melodic lines from the inputs. With a small set of songs from Undertale, the model produced similar results.

However, when we used larger training datasets, the results were poor. For example, when we trained the models on sets of 20 and 100 classical songs, the generated music sounded disjoint and mangled.

This makes sense, since a smaller set of distinct songs has more patterns and a more recognizable motif that the Markov chain is able to represent well. With larger training sets, there are less small common patterns that the listener can distinguish between, and since the Markov model can only go so far back in time, the model is unable to capture the overall essence of all the pieces.

Conclusion and Future Work

Our next steps include experimenting with other approaches, such as GANs or recurrent neural networks, and comparing the results to the Markov model results. In addition, we plan to add improvements to our current approach. Notably, our method for encoding music into images can only represent pieces monophonically, which means that harmony notes and underlying chords are discarded. We plan to develop and implement an alternative encoding scheme that supports polyphonic pieces, which may yield more interesting results.

Overall, our work so far has shown that we can effectively represent songs as images, and that Markov models can produce fairly good results on small, distinctive sets of songs.


carykh. “3 Neural Nets Battle to Produce the Best Jazz Music.” YouTube, YouTube, 7 June 2017,

Project Proposal

By Ehsan Asdar (easdar3), Ashwini Iyer (aiyer65), Matthew Kaufer (mkaufer3), Nidhi Palwayi (spalwayi3), and Kexin Zhang (kzhang323)

Problem Statement

Generative music is a very interesting field – who doesn’t want to hear a new Mozart composition? We propose two methods of generating music through image generation, Markov Models and GANs, and will compare their results.

Think of a piece of music; it has notes that vary in pitch, and each note has a quantizable start and end time. Consider embedding a musical piece with three instruments into an image. The columns of the image can represent some temporal moment, while the rows of the image can represent a pitch. The RGB channels can each represent an individual instrument (you could splurge and do four instruments if you used the alpha channel too).

Now that we have some means of embedding music as images, we can treat the lines of notes in the images as textures.

Since we know something about texture generation, we can apply these methods to generating music. This is nice, since Markov Models are inherently repetitive, and could create some of the repetition that music entails.

However, Markov Models are so 19xx – anyone that’s anyone uses deep learning. Here, we will use both Markov Models and GANs to generate music embedded into images, and then compare the results.


  • Download lots of classical MIDI files
    • Transpose them into the same key, for the purposes of the Markov Models
    • Pieces should only have a single melody and an underlying chord progression
  • Create a program to convert these MIDI files to images
  • Train several Markov Models of different resolutions from the input MIDI images
  • Train a GAN from the input MIDI images
  • Convert the outputs of the Markov Models and the GAN from image back to MIDI
  • Listen to the MIDI files for a qualitative analysis
  • Look at the tonality of pieces and the beat distribution to analyze musicality of generated pieces

Experiments and Results

The experimentation involves generating multiple Markov Models of different resolutions, as well as architecting and training a GAN. We will also experiment with the length of the output of the generated music to see if longer musical phrases fall apart quicker (to play with the curse of dimensionality).

We’ll use a dataset of classical music written by Antonio Vivaldi in 4/4, since his music has a distinctive style and composition that will allow us to draw parallels between the generated and original music (more on our evaluation criteria later).

The code that will be written is

  • A converter from MIDI to our image encoding
  • A converter from our image encoding back to MIDI
  • A way to train Markov Models with these image representations of MIDI
  • A way to generate images from these Markov Models
  • A GAN to be trained with these image representations of MIDI
  • Analysis software to calculate a generated song’s
    • Tonality
    • Contour
    • Contour similarity across measures
    • Empty space

Some code we’re not going to write from scratch:

  • TensorFlow or Pytorch based implementation of GANs

Experiments will involve

  • Generating music through the various Markov Models
    • Varying the size of the Markov Models
    • Varying the length of the images generated
  • Generating music through the GAN
    • Varying the size of the images generated/discriminated by the GAN

Our evaluation criteria involves measuring stylistic similarity of the generated music to that of the original classical music dataset. We’ll write code that compares music on the metrics of tonality, contour, and consistency of time signature. We will also qualitatively analyze the outputs.


  • To get the midi files we will use these websites: and


  • This youtuber trained different deep learning networks to create jazz music: