, , , , , , , , , , , , ,

In my last post, Voice recognition with Python, I wrote some Python code that could detect the difference between the words Yes and No when spoken into a microphone. Here’s the word Yes, plotted:


And here’s the word No:


The code found the highest peak on the graph and then counted all the peaks to the left of it. As you can see, the word Yes has fewer left peaks than the word No. It was a fairly crude method, but worked quite well.

‘But I need her to understand more than Yes and No,’ Arkwood moaned. You see, I promised I would build a robot girlfriend for him. And now I need a more robust method to extend her vocabulary, so that she can swoon at my buddy’s lewd utterances through the microphone.

Okay, let’s take another approach. Consider the following code:

from scipy.io import wavfile
from pylab import *

class AudioAnalysis(object):

    LEVEL_THRESHOLD = 0.3    
    SLICE_SIZE = 2205

    def is_yes(self, wave_file):
        # read voice file
        rate, data = wavfile.read(wave_file)
        data = data / (2.**15)

        # initalise variables
        data_len = len(data)
        start_pos = 0
        end_pos = 0
        is_blob = False
        blob_count = 0

        # ignore silence before word 
        for idx, val in enumerate(data):
            if val > self.LEVEL_THRESHOLD:
                end_pos = idx

        # loop through our data, a slice at a time
        while True:
            # index the next slice 
            start_pos = end_pos
            end_pos += self.SLICE_SIZE

            # bail out if no more slices
            if end_pos > data_len:

            # get top 3 values in slice
            top_vals = np.sort(data[start_pos:end_pos])[-3:]

            # if threshold breached, increment the blob count
            if np.all(top_vals > self.LEVEL_THRESHOLD) and is_blob == False:
                blob_count += 1
                is_blob = True
            elif np.all(top_vals < self.LEVEL_THRESHOLD) and is_blob == True:
                is_blob = False

        if blob_count == self.YES_BLOBCOUNT:
            return True
        elif blob_count == self.NO_BLOBCOUNT:
            return False
            return None

The is_yes method is receiving an audio wav file, which is the word that Arkwood has spoken into the microphone attached to my PC. It reads the audio file as per the previous post, assigning the data and sampling rate to variables and converting the data to a range from -1 to 1.

Here’s where things change. If you take a gander at the Yes graph above, you’ll see that it comprises of two blobs. The word No, on the other hand, has just the one blob. What would happen if we took vertical slices of our graph – could we use those slices to calculate the number of blobs? Let’s give it a go!

Once we have initialized some variables, we loop through our data until we find the start of the word. After all, we don’t want to be doing our calculations on the sound of silence.

Now we drop into a while loop. First up, we take a slice of data (the size of the slice is 2205, which is a fraction of the sampling rate 44100).

After checking that we have not come to the end of our data, we sort our slice of data and grab the top 3 values. These top values are the peaks in the vertical slice of graph we are working on – and having 3 top values gives us confidence that we are not dealing with a single blip in the sound, but instead an actual word.

Finally, we check if all our top values are above a threshold of 0.3. If they are, then we have found a blob! We update our blob count and set is_blob to True (this is important, as we don’t want to recount the blob on the next loop). If all our top values are below the threshold then we are no longer in a blob, and so set is_blob to False (ready for detecting the next blob).

That’s it. Our is_yes method will use the blob count to determine whether the word is Yes or No (or neither, in which case it returns the value None).

Time for a demo. I ask Arkwood to say the word Yes into the microphone. The program will output some values, to let me know how it’s got on:



Hurray! We can see from the output that two blobs have been counted, and thus the word Yes has been correctly identified.

Our first slice is at data index 11427, which is roughly a quarter of the way along our Time axis. Its top 3 values put it above the threshold, so we increment our blob count (we’ve found the start of a blob).

Our blob ends at data index 20247, when all our top values fall below the threshold. However, we detect a second blob a couple of slices later, at index 24657.

It’s worth noting that our second blob only just scrapes above the threshold of 0.3. Maybe we need to lower the threshold?

Let’s see how the word No fares:



The word No has been correctly identified with one blob count. Perfect.

Our first slice is at data index 12811, almost 40% along the Time axis. We stay above the threshold for four slices before finally dipping at index 21631.

‘Did it work?’ Arkwood ask me, ‘Will I be able to talk dirty to my robot girlfriend?’

Not yet. But things are looking rosy. Previously I was able to use peaks in a graph to separate the words Yes and No. Now I have used vertical slices to tell the words apart.

‘Don’t worry,’ I replied, ‘You will soon seduce your fembot under the covers with a plethora of filthy words.’

Steam shot out of Arkwood’s ears and his face turned crimson. Saliva trickled out the corner of his mouth. Cleary he was thinking about android copulation. Goodness, his bony legs has turned to wobbly jelly.