CS 184: Computer Graphics and Imaging, Spring 2018

Final Project: Music Visualizer

Final Report

Jarry Xiao, Jacky Lu, Lucy Chen


Abstract

For our final project, we created a real-time music visualizer, which extracts various elements of a particular song and displays them in a visually appealing way. We focused primarily on representing the strength of the constituent frequencies and overall volume of the left and right channels for a particular input audio signal, and we created a couple different visualizations to do so. Our first visualization was a particle simulator, which we adapted from an existing fire simulation. For this visual, we chose to represent the constituent frequencies of an input audio signal as a set of fire pillars (thirty-two per channel) where frequencies increase from the center outward. The width of each fire pillar varies with the magnitudes of the frequencies it contains, while its height corresponds to the volume of that range of frequencies. The hue of each flame also varies; however, this depends on how the frequency bins are distributed amongst the fire pillars. Additionally, we made a blob visualization which utilizes displacement mapping to represent these elements. For this visual, we rendered a sphere object and map the discrete fourier transform onto the sphere by scaling each vertex along its normal. We use a variety of schemes to map the 128 length fourier transform onto a sphere with varying number of vertices, creating interesting movements and shapes in three dimensions.

Technical Approach

It's natural to think about music in terms of beats and frequencies, but to visualize how these musical components change over time, it can be useful to think about music across both the time and frequency domains. The Fourier transform is a mathematical operation that is based on the premise that any real (or complex) valued function can be represented as a linear combination of sinusoids. The Fourier transform can transform can "extract" the coefficients of the sinusoids that compose any real-valued function (often times this function is referred to as a "time-domain" signal). Each coefficient can be represented as:

$H(\omega) = \int_{-\infty}^\infty f(t)e^{-i\omega t} dt$ where $f(t)$ represents the time domain signal.

However, because we are working with digital music files, we use the Discrete Fourier Transform instead (this means we no longer need an infinite number of sinuouds to represent our time domain signal). Due to a breakthrough in computation, we can compute the DFT of a discrete time-domain signal in $O(N\log N)$, where $N$ represents the length of the signal, using the Fast Fourier Transform. For a time domain signal of length $N$, the $kth$ coefficient can be expressed as:

$X[k] = \sum_{n=0}^{N-1} x[n]e^{\frac{2\pi ki}{N}n}$.

Instead of storing the coefficients directly in an array, what is often done (as was done in our project) was to store the magnitudes of each coefficient $|X[k]|$. This is slightly more space efficient because it can be shown that $X[k]$ and $X[N-1-k]$ are complex conjugates, which further implies that they have the same magnitude. Therefore, you only need to store 50% of the actual computed magnitudes if you don't care about the complex component of each coefficient. In our project, we use these magnitudes to determine the state of the music that is playing.

To compute the DFT for a mp3 files, we opted to use Javascript's WebAudioAPI. This package essentially allows you to create a dependency graph for audio processing. One of our first ideas for music visualization was to be able to see the effect of stereo in different songs. Therefore, we use what is called a Splitter node to examine both the left and right channels of the mp3 file. Then we create a ScriptNode which streams in the song one chunk at a time. We then run an FFT via an AnalyserNode on the raw data from both channels of this audio chunk and pass that data on to our visualizer. In the process, we also implemented a functionality to play/pause and stop the current song. We do this by keeping track of the relative time at which the track was last paused and seeking to that point in the track if play is ever called again. Additionally, we compute the average volume of each audio channel by computing the average of the magnitudes output by the FFT for that channel, giving us a concrete value we can use later in our visualization. Thus, our process combines elements of both the time and frequency domain for dynamic visualization of audio.

In order to visualize these effects, we created two different representations: a fire particle simulation and blob visualization.

Particle Simulation:
Our particle-based visualization is partially inspired by a fire simulator we discovered online (linked in the references). The original fire simulation generated a single flame based on a variety of factors, of which we retained the variance in the initial positions of the particles as well as the base speed at which they travel and die out. The particles are removed when they have reached certain threshold conditions. Because our goal was to visualize the processed frequency data, we made a number of major alterations to the code. Instead of a singular flame, we split the particle generators into sixty-four separate points spread evenly across the screen (thirty-two per channel) to visualize the frequency data. Each pillar corresponds to a particular range of frequencies in our DFT chunk. For each of these 64 locations, we generate around 30 particles from the base of that screen, and also give each particle an initial velocity. In order to maintain a blurring effect, we make this velocity a random variable, where the variance is a function of channel's volume. Additionally, the radius of each particle at some bin is a direct function of the magnitude of the corresponding frequency. As a result the fire pillar corresponding to a dominant frequency range and audio channel will be both taller and wider as a result. One other feature that is influeced by magnitude is the height of the pillar–as the magnitude increases, the longer the particles will survive, which means that the pillar will appear taller. The way we implement this is by changing the color transparency over time as a function of magnitude. When a particle is completely transparent it is marked for removal. Because particles corresponding to high magnitudes stay opaque for longer, their pillars appear taller. This is interesting from a frequency analysis perspective because it indicates correlation between the frequency components within short time intervals. Visually, we observe this effect as a continuous change in a motion and color that seems to "follow" the beat of the music. Without the correlation between sequential DFT's, the visualization becomes less aesthetically interesting.

We also created two different variants of the fire simulation by dividing the 128 total frequency bins generated by the DFT among each set of fire pillars differently. The first variant splits them evenly, with each pillar representing a group of four consecutive bins. For both variants, the left bins represent the left audio channels and the right bins represent the right audio channels. In each channel, the lowest frequencies are located in the middle of the screen and the highest frequencies are near the edge. For this visualization, we also modified how the color of each generated particle was determined. Instead of semi-randomly determining the color as the original simulation did, we base each particle’s color off the magnitude of the corresponding frequency component. This value is then interpolated into a color range specified by either the left or right audio channel. The data type of our DFT coefficeints is uint8, which means that the maximum value in the DFT is 255. Therefore we scale each magnitude by $\frac{1}{255}$ and then use that fraction to interpolate into the range of HSV colors. For the left channel, we used values between 0 and 180, and for the right channel, we used values between -180 and 0. Visually, this is still very interesting, because though there is no notion of assign colors to frequencies, a continuous spectrum will usually appear on screen. This is again due to correlation, but this time it is correlation among different components of the DFT. In most audio data, low frequencies will dominate the frequency spectrum. In general, these frequencies become much less prominent as they increase. As a result, even though the color is not a function of frequency, we will witness a color gradient across frequencies in this visualization. Abrupt color changes represent fast changes in the frequency magnitude over time. This leads to effects like "pulsing" along with the beat.

While the naturally occurring color gradient is an interesting visualization of correlation in music, the large magnitude of the lowest frequences tend to dominate the visuals. To prevent this it would be nice to find a way to amplify the higher frequencies. The second variant was inspired by the way frequencies are actually grouped musically, where each octave up represents a doubling in frequency. As such, this one splits the bins logarithmically, with fewer bins grouped together at lower frequencies and more bins grouped at higher. We achieve this split by first computing the total “width” of the buckets by taking the base-2 logarithm of the number of frequency bins produced by the FFT (128 here) then dividing the result by 32 (the number of fire pillars per channel) to get the “width” per pillar. Now that we have the lower and upper bounds on each pillar, we can distribute the FFT’s frequency bins by taking the base-2 logarithm of their bin number and placing them accordingly, with the zeroth bin always placed in the first fire pillar. Because songs have a tendency to be dominated by lower frequencies, resulting in a large amount of red in the final visualization as these frequencies are less grouped. As such, we also modified the color calculations on particle generation to simply determine a particle’s color based on its corresponding frequency range’s distance from the lowest frequency range present, providing better overall contrast at the cost of representing signal strength through only the size and speed of each particle.

Blob Visualization:

Our blob simulation was inspired by the shader portion of project 3 as well as various projects found online. Our goal was to create a 3D visualization using a 2D waveform, which proved to be both challenging and an interesting exploration of web graphics. Because of the open ended nature of our project, we ended up taking steps in many directions at once. A couple ideas we had were generating terrains using the FFT waveform, so the user could 'travel' through the music in a way, a simple 3D wave generation that could help the user visualize the physical wavelengths of the music as well as a physical simulation in the form of some kind of mass-spring system that could experience forces based on the beat of the music. We wanted to create a 3D object that could distort and change based on the input music, and we eventually settled on the idea of manipulating a 3D object's geometry to create an interesting visualization. Because our first simulation very clearly visualizes the magnitude of each wavelength, we decided that a more abstract visualization would be more apt for our second attempt at audio visualization.

Eventually we settled on the use of a sphere as our "base" object to manipulate. We chose a sphere for a couple reasons: their geometries are simple and fairly predictable, and are unique in that their radial distance is uniform in all normal directions. Thus it made a good baseline for us to "graph" our FFT. We came up with a few schemes to perform vertex manipulation based on the FFT, and we ran into the problem of mapping a very specific 128 length spectrum to a 3D object with a much different number of vertices.

Our first naive implementation was to physically graph each point to a wavelength in the FFT. We did this by scaling a vertex in its normal direction by a value directly proportional to the FFT at a certain wavelength. To fit the sphere, we created an interpolation scheme - we generated intermediate values in between each entry in our FFT and this scheme created a fairly unique visualization that resembles a spiral. Our next (succesful) visualization involved performing the opposite; instead of generating new FFT entries, we downsampled the FFT into only 32 samples and scaled each vertex in each horizontal ring in the geometry by the same FFT value. It was here we found that the vast majority of songs are dominated by lower frequencies, and are incredibly sparse at higher frequencies. We took the technical constraint of being unable to match the number of vertices and number of entries in our FFT (even with our interpolation schemes) as an opportunity to filter the extremely low and high frequencies from our FFT. Our result was a much more interesting visualization that wasn't dominated by a single value and left certain parts of our sphere 'blank'. As a final visualization, we scaled the vertices along a single axis based on the values of the FFT in an attempt to create "wings" that emanated from our sphere visualization. With the rendering scheme we chose, we thought it created a cool effect that we chose to leave in our visualization.

Our other large challenge was creating the lighting and effects we wanted in this visualization. After discovering THREEjs postprocessing libraries, we added a bloom effect to our visualization as well as a bright red lighting that would contrast very starkly to the empty black background. In addition, we implemented a couple additional features. The user can toggle and toggle off rendering and choose to display the rendered object or just the mesh of the object. The user is able to toggle this as well as the different visualizations as the song is playing, and see the differences between each visualization live. In addition, the user can pan, zoom and rotate the object with mouse inputs.

Results

Fire Simulation
- Frequency bins evenly split among fire pillars

Song Used: Jazz by Mick Jenkins

- Frequency bins logarithmically split among fire pillars

Song Used: Someday My Prince Will Come by Miles Davis


Blob Visualization


Song Used: Rough Soul (YUNG BAE Family Cookout Version) by Goldlink

References

- Original fire particle simulation
- Web Audio API Nodes: AnalyserNode, ScriptProcessorNode, ChannelSplitterNode

Contributions

Jarry wrote the code to process the audio files and set up the Audio Computation graph in Javascript. He also made the naive first visualization (seen in the first milestone report). For the more graphically intensive part of the project, he drew inspiration from the fire simulation found online decided to use the frequency analsis from the audio to power the particle simulation. He worked on getting the particle simulation to correspond to the music, and he also made the particles change properties after being exposed to different frequencies to provide dynamic visual effects.

Lucy took the particle simulator after Jarry's changes and created the variant that uses a logarithmic bin-splitting algorithm instead of dividing the frequency bins amongst the fire pillars evenly (along with minor changes to particle generation to account for color contrast issues). She also implemented the song selection feature in the UI and tried to do some initial work on integrating shaders after the audio processing (this was ultimately scrapped in favor of the particle simulation).

Jacky helped tune the particle simulation and added parameters that would help the simulation appear to bounce with the music. He also created the blob visualization and created the various interpolation schemes from which the FFT was sampled onto the sphere. In addition, he built and styled the UI so that each simulation could run on the same site and integrated user inputs with functionality in the visualizations.

Link to final project video:

Link to final demo