Learning How to Speak Drum and BAIss

11 min readFeb 7, 2023

The story of music and AI is a new one and a rapidly progressing one. But let’s start at the very beginning.

One of the preliminary uses of AI with music was genre classification. The first approach involved a simple feature extractor that would convert a song or sound portion into an n-dimensional vector where each number represents some quantity about the music. For instance, one dimension might represent the centroid of the frequency domain, while others might represent various MFCCs (Mel Frequency Cepstral Coefficients).

My first experiment involved training a simple KNN (K-nearest neighbors) model on 1000 audio samples of 10 genres with various levels of complexity to see how well they perform at genre classification.

Model 0: I started with a 1-dimensional feature vector containing just the centroid. This model had an accuracy of about 17%.

Model 1: Then I added flux and rms for a total of 3 dimensions, and it performed much better with an accuracy of about 30%.

Model 2: Model 2 was 8 dimensions, 5 of which were MFCCs. This model had an accuracy of about 34%.

Model 3: Model 3 now includes 20 MFCCs for a total of 23 dimensions. This model had an accuracy of about 43%.

Model 4: Model 4 was the same as Model 3, with 2 more dimensions added for 25% roll-off and 75% roll-off. The accuracy was also around 43%.

The second experiment was finding a way to use this classification system to generate new music. I wanted to find a way to convert my voice into a drum and bass track.

So far, my project kinda does that.

Meet dnb-synthesis-mic.ck:

// input: pre-extracted model file
string DRUM_FEATURES_FILE;
string BASS_FEATURES_FILE;
// if have arguments, override filename
if( me.args() > 1 )
{
    me.arg(0) => DRUM_FEATURES_FILE;
    me.arg(1) => BASS_FEATURES_FILE;
}
else
{
    // print usage
    <<< "usage: chuck mosaic-synth-mic.ck:INPUT", "" >>>;
    <<< " |- INPUT: drum model file : bass model file", "" >>>;
}
//------------------------------------------------------------------------------
// unit analyzer network: *** this must match the features in the features file
//------------------------------------------------------------------------------
// audio input into a FFT
adc => FFT fft;
// a thing for collecting multiple features into one vector
FeatureCollector combo => blackhole;
// add spectral feature: Centroid
fft =^ Centroid centroid =^ combo;
// add spectral feature: Flux
fft =^ Flux flux =^ combo;
// add spectral feature: RMS
fft =^ RMS rms =^ combo;
// add spectral feature: MFCC
fft =^ MFCC mfcc =^ combo;


//-----------------------------------------------------------------------------
// setting analysis parameters -- also should match what was used during extration
//-----------------------------------------------------------------------------
// set number of coefficients in MFCC (how many we get out)
// 13 is a commonly used value; using less here for printing
20 => mfcc.numCoeffs;
// set number of mel filters in MFCC
10 => mfcc.numFilters;

// do one .upchuck() so FeatureCollector knows how many total dimension
combo.upchuck();
// get number of total feature dimensions
combo.fvals().size() => int NUM_DIMENSIONS;

// set FFT size
// 4096 => fft.size;
15207 => fft.size;
// set window type and size
Windowing.hann(fft.size()) => fft.window;
// our hop size (how often to perform analysis)
// (fft.size()/2)::samp => dur HOP;
(fft.size())::samp => dur HOP;
// how many frames to aggregate before averaging?
// (this does not need to match extraction; might play with this number)
4 => int NUM_FRAMES;
// how much time to aggregate features for each file
fft.size()::samp * NUM_FRAMES => dur EXTRACT_TIME;


//------------------------------------------------------------------------------
// unit generator network: for real-time sound synthesis
//------------------------------------------------------------------------------
// how many max at any time?
2 => int NUM_VOICES_BASS;
2 => int NUM_VOICES_DRUMS;
// a number of audio buffers to cycel between
SndBuf buffers_bass[NUM_VOICES_BASS]; SndBuf buffers_drums[NUM_VOICES_DRUMS]; ADSR envs[NUM_VOICES_BASS];
// set parameters
for( int i; i < NUM_VOICES_BASS; i++ )
{
    // connect audio
    // buffers_bass[i] => envs[i] => pans[i] => dac;
    buffers_bass[i] => NRev rev => Pan2 pan => dac;
    0.8 => buffers_bass[i].gain;
    Math.random2f(-.75,.75) => pan.pan;
    Math.random2f(0,.5) => rev.mix;
    // set chunk size (how to to load at a time)
    // this is important when reading from large files
    // if this is not set, SndBuf.read() will load the entire file immediately
    fft.size() => buffers_bass[i].chunks;
    
    // randomize pan => pans[i].pan;
    // set envelope parameters
    envs[i].set( EXTRACT_TIME, EXTRACT_TIME/2, 1, EXTRACT_TIME );
}
for( int i; i < NUM_VOICES_DRUMS; i++ )
{
    // connect audio
    // buffers_bass[i] => envs[i] => pans[i] => dac;
    buffers_drums[i] => Pan2 panR => dac;
    // 0.5 => panR.pan;
    // set chunk size (how to to load at a time)
    // this is important when reading from large files
    // if this is not set, SndBuf.read() will load the entire file immediately
    fft.size() => buffers_drums[i].chunks;

}

//------------------------------------------------------------------------------
// load feature data; read important global values like numPoints and numCoeffs
//------------------------------------------------------------------------------
// values to be read from file
0 => int numPointsDrums; // number of points in data
0 => int numPointsBass;
0 => int numCoeffs; // number of dimensions in data
// file read PART 1: read over the file to get numPoints and numCoeffs
<<< "LOADING FILES" >>>;
loadFile( DRUM_FEATURES_FILE, 1 ) @=> FileIO @ fin_drum;
loadFile( BASS_FEATURES_FILE, 0 ) @=> FileIO @ fin_bass;
<<< "LOADED FILES", numPointsBass, numPointsDrums >>>;
// check
if( !fin_drum.good() ) me.exit();
if( !fin_bass.good() ) me.exit();
// check dimension at least
if( numCoeffs != NUM_DIMENSIONS )
{
    // error
    <<< "[error] expecting:", NUM_DIMENSIONS, "dimensions; but features file has:", numCoeffs >>>;
    // stop
    me.exit();
}


//------------------------------------------------------------------------------
// each Point corresponds to one line in the input file, which is one audio window
//------------------------------------------------------------------------------
class AudioWindow
{
    // unique point index (use this to lookup feature vector)
    int uid;
    // which file did this come file (in files arary)
    int fileIndex;
    // starting time in that file (in seconds)
    float windowTime;
    
    // set
    fun void set( int id, int fi, float wt )
    {
        id => uid;
        fi => fileIndex;
        wt => windowTime;
    }
}

// array of all points in model file
AudioWindow windows[numPointsBass + numPointsDrums];
// unique filenames; we will append to this
string files[0];
// map of filenames loaded
int filename2state[0];
// feature vectors of data points
float inFeaturesBass[numPointsBass][numCoeffs];
float inFeaturesDrums[numPointsDrums][numCoeffs];
// generate array of unique indices
int uids_bass[numPointsBass]; for( int i; i < numPointsBass; i++ ) i => uids_bass[i];
int uids_drums[numPointsDrums]; for( int i; i < numPointsDrums; i++ ) i => uids_drums[i];

int uids_playing[NUM_VOICES_BASS + NUM_VOICES_DRUMS]; for( int i; i < uids_playing.size(); i++ ) -1 => uids_playing[i];

// use this for new input
float features[NUM_FRAMES][numCoeffs];
// average values of coefficients across frames
float featureMean[numCoeffs];


//------------------------------------------------------------------------------
// read the data
//------------------------------------------------------------------------------
readData( fin_drum, 1 );
readData( fin_bass, 0 );

//------------------------------------------------------------------------------
// set up our KNN object to use for classification
// (KNN2 is a fancier version of the KNN object)
// -- run KNN2.help(); in a separate program to see its available functions --
//------------------------------------------------------------------------------
KNN2 knn_drums;
KNN2 knn_bass;
// k nearest neighbors
2 => int K;
// results vector (indices of k nearest points)
int knnResultDrums[K];
int knnResultBass[K];
// knn train
knn_drums.train( inFeaturesDrums, uids_drums );
knn_bass.train( inFeaturesBass, uids_bass );


// used to rotate sound buffers
0 => int which_bass;
0 => int which_drums;



fun void synthesize_both( int uid_drums, int uid_bass, int loop_num)
{
    if (checkIfLooping(uid_drums, which_drums) == 0) {
        buffers_drums[which_drums] @=> SndBuf @ sound;
    // increment and wrap if needed
        which_drums++; if( which_drums >= buffers_drums.size() ) 0 => which_drums;

    // get a referencde to the audio fragment to synthesize
        windows[uid_drums] @=> AudioWindow @ win;
    // get filename
    // chout <= files[0];
        files[win.fileIndex] => string filename;
        <<< filename, win.fileIndex, uid_drums >>>;
    // load into sound buffer
        filename => sound.read;
        chout <= filename <= " ";
        sound.loop(1);
        chout <= "synthsizing drum window:";
        chout <= win.uid <= "["
              <= win.fileIndex <= ":"
              <= win.windowTime <= ":POSITION="
              <= sound.pos() <= "]";
        chout <= IO.newline();


    } else {
        chout <= "ALREADY PLAYING" <= IO.newline();
    }


    // if (checkIfLooping(uid_bass, which_bass + NUM_VOICES_BASS) == 0) {

        buffers_bass[which_bass] @=> SndBuf @ sound;
        envs[which_bass] @=> ADSR @ envelope;
        which_bass++; if( which_bass >= buffers_bass.size() ) 0 => which_bass;

        windows[uid_bass + numPointsDrums] @=> AudioWindow @ win;
        files[win.fileIndex] => string filename;
        filename => sound.read;
        chout <= filename <= " ";
        0 => sound.pos;
    
        chout <= "synthsizing bass window:";
        chout <= win.uid <= "["
          <= win.fileIndex <= ":"
          <= win.windowTime <= ":POSITION="
          <= sound.pos() <= "]";
        chout <= IO.newline();

        envelope.keyOn();
        30000::samp => now;
        envelope.keyOff();
        envelope.releaseTime() => now;

    // } else {
    //     chout <= "ALREADY PLAYING" <= IO.newline();
    //     chout <= uid_drums, which_bass, NUM_VOICES_BASS;
    // }
}

fun int checkIfLooping(int uid, int whichIndex) {
    for (0 => int i; i < uids_playing.size(); i++) {
        if (uids_playing[i] == uid) {
            return 1;
        }
    }
    uid => uids_playing[whichIndex];
    return 0;
}

//------------------------------------------------------------------------------
// real-time similarity retrieval loop
//------------------------------------------------------------------------------
0 => int loop_num;
while( true )
{
    // aggregate features over a period of time
    for( int frame; frame < NUM_FRAMES; frame++ )
    {
        //-------------------------------------------------------------
        // a single upchuck() will trigger analysis on everything
        // connected upstream from combo via the upchuck operator (=^)
        // the total number of output dimensions is the sum of
        // dimensions of all the connected unit analyzers
        //-------------------------------------------------------------
        combo.upchuck();  
        // get features
        for( int d; d < NUM_DIMENSIONS; d++) 
        {
            // store them in current frame
            combo.fval(d) => features[frame][d];
        }
        // advance time
        2 * 15206::samp => now;
    }
    
    // compute means for each coefficient across frames
    for( int d; d < NUM_DIMENSIONS; d++ )
    {
        // zero out
        0.0 => featureMean[d];
        // loop over frames
        for( int j; j < NUM_FRAMES; j++ )
        {
            // add
            features[j][d] +=> featureMean[d];
        }
        // average
        NUM_FRAMES /=> featureMean[d];
    }
    
    //-------------------------------------------------
    // search using KNN2; results filled in knnResults,
    // which should the indices of k nearest points
    //-------------------------------------------------
    knn_bass.search( featureMean, K, knnResultBass );
    knn_drums.search( featureMean, K, knnResultDrums );
        
    // SYNTHESIZE THIS
        // spork ~ synthesize_both( knnResultDrums[Math.random2(0,knnResultDrums.size()-1)],
        // knnResultBass[Math.random2(0,knnResultBass.size()-1)],
        // loop_num);
    spork ~ synthesize_both( knnResultDrums[0],knnResultBass[0],loop_num);
    loop_num++;
    // if (loop_num % 1 == 0) {
    //     spork ~ synthesize_bass( knnResultBass[Math.random2(0,knnResultBass.size()-1)]);
    // }
    // if (loop_num % 4 == 0) {
    //     spork ~ synthesize_drums( knnResultDrums[Math.random2(0,knnResultDrums.size()-1)]);
    // }
    // 15207::samp => now;
}
//------------------------------------------------------------------------------
// end of real-time similiarity retrieval loop
//------------------------------------------------------------------------------




//------------------------------------------------------------------------------
// function: load data file
//------------------------------------------------------------------------------
fun FileIO loadFile( string filepath , int isDrums)
{
    // reset
    if (isDrums == 1) {
        0 => numPointsDrums; 
    } else {
        0 => numPointsBass; 
    }
    0 => numCoeffs;

    // load data
    FileIO fio;
    if( !fio.open( filepath, FileIO.READ ) )
    {
        // error
        <<< "cannot open file:", filepath >>>;
        // close
        fio.close();
        // return
        return fio;
    }

    string str;
    string line;
    // read the first non-empty line
    while( fio.more() )
    {
        // read each line
        fio.readLine().trim() => str;
        // check if empty line
        if( str != "" )
        {
            if (isDrums == 1) {
                numPointsDrums++; 
            } else {
                numPointsBass++;
            }
            str => line;
        }
    }

    // a string tokenizer
    StringTokenizer tokenizer;
    // set to last non-empty line
    tokenizer.set( line );
    // negative (to account for filePath windowTime)
    -2 => numCoeffs;
    // see how many, including label name
    while( tokenizer.more() )
    {
        tokenizer.next();
        numCoeffs++;
    }
    
    // see if we made it past the initial fields
    if( numCoeffs < 0 ) 0 => numCoeffs;
    
    // check
    if( (isDrums == 1 && numPointsDrums == 0) || (isDrums == 0 && numPointsBass == 0) || numCoeffs <= 0 )
    {
        <<< "no data in file:", filepath >>>;
        fio.close();
        return fio;
    }
    
    // print
    <<< "# of drum data points:", numPointsDrums, " # of bass data points: ", numPointsBass, "dimensions:", numCoeffs >>>;
    
    // done for now
    return fio;
}


//------------------------------------------------------------------------------
// function: read the data
//------------------------------------------------------------------------------
fun void readData( FileIO fio, int isDrums )
{
    // rewind the file reader
    fio.seek( 0 );
    
    // a line
    string line;
    // a string tokenizer
    StringTokenizer tokenizer;
    
    // points index
    0 => int index;
    // file index
    0 => int fileIndex;
    // file name
    string filename;
    // window start time
    float windowTime;
    // coefficient
    int c;
    
    // read the first non-empty line
    while( fio.more() )
    {
        // read each line
        fio.readLine().trim() => line;
        // check if empty line
        if( line != "" )
        {
            // set to last non-empty line
            tokenizer.set( line );
            // file name
            tokenizer.next() => filename;
            // window start time
            tokenizer.next() => Std.atof => windowTime;
            // have we seen this filename yet?
            if( filename2state[filename] == 0 )
            {
                // append
                filename => string sss;
                files << sss;
                // new id
                files.size() => filename2state[filename];
            }
            // get fileindex
            filename2state[filename]-1 => fileIndex;
            // set
            if (isDrums == 1) {
                windows[index].set( index, fileIndex, windowTime );
            } else {
                windows[index + numPointsDrums].set( index, fileIndex, windowTime );
            }

            // zero out
            0 => c;
            // for each dimension in the data
            repeat( numCoeffs )
            {
                // read next coefficient
                if (isDrums == 0) {
                    tokenizer.next() => Std.atof => inFeaturesBass[index][c];   
                } else {
                    tokenizer.next() => Std.atof => inFeaturesDrums[index][c];   
                }
                // increment
                c++;
            }
            
            // increment global index
            index++;
        }
    }
}

This monstrosity of a file converts your microphone’s input into a drum and bass track by computing two KNNs of a bunch of drum and bass samples periodically and plays them in time so they are all synchronized.

Essentially the program takes the input from the microphone, finds the dnb samples (1 drum and 1 bass) that have feature vectors most similar to the input, then plays them. Here it is kinda working:

I was impressed by the program's robustness and how well it mimicked what my voice sounded like. Next, I had to think about improvements that I could make to the model that I had already created.

One lacking quality was the ability to tweak hyperparameters while running the program. I wanted the ability to turn on and off certain bass and drum tracks so I could better create shifts in dynamics throughout a performance. For this, I added the ability to increment or decrease the number of bass and drum tracks played simultaneously. There could be, for example, 1 bass track running while 4 drum tracks are running.

My original program was also limited in the fact that it had a fixed synth window that was also relatively long. So, in part two, I added the ability to change how long the synth window was for the bass and drum tracks individually. Now, you can synthesize the bass and drums at different rates while they are still always in sync.

Controls (on keyboard):

number of bass tracks: w increases, and s decreases

number of drum tracks: e increases, and d decreases

bass synth window size: r doubles it, and f halves it

drum synth window size: t doubles it, and g halves it

The video below demonstrates the sonic results of my final result. The video in the background is kind of irrelevant, as it just shows my friend messing around with the program. The text flying over the screen represents how often each drum and bass window is synthesized.

Many things could be improved about this program; however, I do not have infinite time. If I did, I would add envelopes to the various sounds to remove any pops while it is playing, and I would also add a low pass filter that could be controlled with your voice (I think that would sound pretty cool since it would sound like it was coming out of your mouth).

But the most interesting part of this project was how it felt to create music using it. It didn’t feel like playing a smart instrument, but it also didn’t feel like I was playing a dumb instrument. There were essentially 9 inputs to the instrument: 8 keys and my voice. I had to play around with it for a while before I figured out interesting ways to transition between sections and add dynamics to arrangements. But there was also this aspect to the instrument that was inherently mysterious. Maybe I just don’t know how to play it well enough, but part of me thinks that the underlying math that it uses is inherently not intuitive. We don’t listen to sounds and hear the centroid of the MFCCs of a sound, so in some regards, the sounds that came out of the program were surprising.

When using dnb-synthesis-mic.ck, it felt like I was performing while also experiencing a performance. It was, in some ways, a duet between man and machine. It was a partner dance where I took the lead, but she took me places I didn’t even consider.

It’s not a perfect tool by any means, but I think the experience of using dnb-synthesis-mic.ck perfectly walks the line between incorporating enough AI elements and maintaining control through various knobs and inputs.

I hope you enjoyed my project!

Learning How to Speak Drum and BAIss

Written by Matthew Reed