, , , , , , , , , , , , ,

In my last post, I used an existing Iris flower data set to train and test a PyBrain Neural Network. With the network in place, I was able classify an Iris flower into one of three species: Iris setosa, Iris versicolor, Iris virginica.

But can I create my own audio data set to train and test a PyBrain Neural Network? And once the network is in place, will it be able to classify the spoken words Yes and No correctly? Let’s find out!

Create an audio data set

First, we need to create an audio data set.

I will speak the word Yes or No into the microphone attached to my PC. I can then extract some audio properties related to each word which will be used to train and test the neural network.

Here’s a graph of the word Yes spoken into my microphone:


Amplitude is on the Y axis and Time is on the X axis.

But what audio properties shall I use? What about if I take eight vertical slices of the graph, from where the word starts to where it finishes. In each of those slices I can calculate the top peak as a percentage of the overall top peak. I end up with the following values to train my network with:

[90, 100, 55, 21, 32, 85, 99, 60]

As you can tell, the array of values mimic the graph – the highest values occur at the beginning and end of the word, whilst at the middle of the word the values drop.

Okay, so what about the word No? Here’s the graph:


And here’s the array of top peaks, one for each of the eight vertical slices:

[44, 42, 91, 100, 76, 85, 59, 28]

As you can tell, the highest values occur at the middle of the word.

So now we are ready to create our audio data set. I will speak the word Yes into the microphone 50 times and then speak the word No into the microphone 50 times (what fun!). I will then have a data set to train and test a PyBrain Neural Network.

Let’s take a peek at the Python code to create the data set:

# create neural network data
def create_data(text_to_speech):

    # ask user to say the word Yes or No
    text_to_speech("Say the word Yes or No")

    # save spoken word as wav data
    recognizer = sr.Recognizer()   
    with sr.Microphone() as source:
        print "listening..."
        audio = recognizer.listen(source)

    with open(WAV_FILE, "wb") as f:

    # get target data (and bail out if not Yes or No)
    target = _get_target(recognizer, audio)
    if target == None: return (None,None)

    # get input data
    input = _get_input()

    return (input,target)

First, the Python Text To Speech package is used to instruct us through the computer speakers to say the word Yes or No.

Using the Python Speech Recognition library, I am able to record my spoken word and save it to a WAV file on my disk.

Finally, the neural network input and target values are obtained from the spoken word.

Here’s the method we use to get the neural network target values:

# get neural network target data
def _get_target(recognizer, audio):
    target = None
    text = None

    # use Google Speech Recognition to resolve audio to Yes or No
        text = recognizer.recognize_google(audio).lower()
        print text
    except sr.UnknownValueError:
        print "Google Speech Recognition could not understand audio"
    except sr.RequestError:
        print "Could not request results from Google Speech Recognition service"

    if not text: return

    # obtain target data i.e. 0 for Yes, 1 for No
        target = AUDIO_CLASSES.index(text)
    except ValueError:

    return target

As you can see, we use the Google Speech Recognition service to determine if the word we uttered was Yes or No. But why are we using Google – I thought we were using own neural network to classify Yes and No? Let me explain…

Google Speech Recognition service is simply being used to confirm the target of the neural network input values i.e. 0 for Yes, 1 for No. If Google is unable to recognise the word we spoke into the microphone as Yes or No, then the audio properties are not added to our data set. Google is acting as quality assurance for our data set.

Okay, let’s take a look at the code to generate our neural network input values:

# get neural network input data
def _get_input():
    # load audio data (wav format)
    rate, data = wavfile.read(WAV_FILE)
    data = data / (2.**15)

    # get start and end position of word (word will be either Yes or No)
    thresold_level = np.amax(data) / THRESHOLD_DIVISION
    print "Thresold level: {}".format(thresold_level)

    start_pos = 0
    end_pos = 0
    prev_val = 0
    for idx, val in enumerate(data):
        if start_pos == 0 and val >= thresold_level:
            start_pos = idx
        if prev_val >= thresold_level and val < thresold_level:
            end_pos = idx

        prev_val = val

    print "Start position: {}".format(start_pos)
    print "End position: {}".format(end_pos)

    # get top peak in each slice
    slice_size = (end_pos - start_pos) / NUMBER_OF_SLICES
    print "Slice size: {}".format(slice_size)

    top_peaks = []

    for i in range(NUMBER_OF_SLICES):
        end_pos = start_pos + slice_size

        slice = data[start_pos:end_pos]

        start_pos = end_pos

    print "Top peaks: {}".format(top_peaks)

    # obtain input data as percentages e.g. [47, 100, 74, 70, 32, 16, 35, 41]
    # top peak in each slice will be a percentage of overall top peak
    input = []
    overall_top_peak = np.amax(top_peaks)

    for val in top_peaks:
        input_item = int((val / overall_top_peak) * 100)
    print "Input: {}".format(input)

    return input

First, we load the WAV file of the word we spoke into the microphone.

Next, we find the start and end position of the word in our audio data (after all, we don’t want to extract properties from all that silence on the graph, before and after the word is spoken). A threshold value is used to help us determine the start and end positions.

We then take eight vertical slices of our audio data, and find the top peak (highest value) in each.

Finally, we obtain our neural network input values by calculating the top peak of each slice as a percentage of the overall top peak. Using percentages means that it doesn’t matter if the word is spoken loud or quiet, the values will always be relative.

All we need now is a way of saving our neural network input and target values to file:

# save input and target data
def save_data(input, target):
    if input == None or target == None: return

    with open(DATA_FILE, "a") as f:
        f.write(str(input) + '[{}]'.format(target) + '\n')

And loading all the neural network inputs and targets:

# load all input and target data
def load_data():
    inputs = []
    targets = []

    with open(DATA_FILE) as f:
        lines = f.readlines()

    for line in lines:

        line_parts = line.rstrip('\n').split('][')

        input = map(int, line_parts[0][1:].split(','))
        target = int(line_parts[1][:1])


    return (inputs,targets

Here’s some neural network inputs and targets in my data set (the first three rows for the word Yes, the last three rows for the word No):

[81, 96, 100, 65, 13, 22, 37, 28][0]
[52, 67, 100, 85, 78, 95, 92, 50][0]
[87, 90, 100, 43, 16, 23, 30, 29][0]
[39, 60, 73, 95, 100, 89, 60, 30][1]
[46, 87, 100, 87, 91, 82, 78, 41][1]
[39, 66, 87, 100, 93, 88, 89, 55][1]

Build a PyBrain Neural Network

We now have an audio data set to train and test our PyBrain Neural Network.

My previous post, Iris Classifier using PyBrain Neural Network, has all the detail on building a neural network. Here’s the tweaks I made to train and test the audio data set…

My neural network will have 8 input values (one for each of our top peaks) and 2 output classes (for the classification of our spoken word to Yes or No), plus a hidden layer of 5:

# network constants

During training, I will use the PyBrain trainUntilConvergence method:


Classify audio data

We now have a PyBrain Neural Network trained and tested on our audio data set.

All that is left to do is add a new Audio Classifier feature to SaltwashAR, the Python Augmented Reality application, and start speaking the words Yes and No through our computer microphone.

But will our neural network successfully classify our spoken words as Yes or No? Let’s take a look at the video:

Marvellous! Sporty Robot asks me to say the word Yes or No. I say the word Yes and it gets correctly classified (as do the following utterances No, No and Yes).

Bear in mind that although Google Speech Recognition is being used to accept our words as valid for classification, it is our own neural network that is determining whether the word is Yes or No. We could do away with Google Speech Recognition during classification, and just keep it as a handy way of gathering targets during the creation of our audio data set.

All the Audio Classifier code can be found at SaltwashAR on GitHub. It’s easy to install SaltwashAR, and you can even add your own features! The SaltwashAR Wiki is at hand to help.

Next steps

So what is next for the Audio Classifier feature?

We could use other audio properties – alongside top peaks – to help classify words. Each person will speak the words Yes and No with their own accent and speed, using a particular type of microphone, so a more sophisticated approach is required.

And we can add more words – perhaps the word Maybe could be added to the data set next? And for each word, let’s add it to the data set far more than 50 times.

But it’s a decent start. And all those utterances of Yes and No have left me feeling rather assertive!



If you want to display a graph for the words Yes or No, simply add the following code after reading the WAV file into rate and data variables:

time = np.arange(len(data))*1.0/rate      
plt.plot(time, data)

You’ll need the import statement: from pylab import *

And the graph code needs to be running in the main application process, not in a thread.