In my last post, Voice recognition with Python, I wrote some Python code that could detect the difference between the words Yes and No when spoken into a microphone. Here’s the word Yes, plotted:
And here’s the word No:
The code found the highest peak on the graph and then counted all the peaks to the left of it. As you can see, the word Yes has fewer left peaks than the word No. It was a fairly crude method, but worked quite well.
‘But I need her to understand more than Yes and No,’ Arkwood moaned. You see, I promised I would build a robot girlfriend for him. And now I need a more robust method to extend her vocabulary, so that she can swoon at my buddy’s lewd utterances through the microphone.
Okay, let’s take another approach. Consider the following code:
from scipy.io import wavfile from pylab import * class AudioAnalysis(object): LEVEL_THRESHOLD = 0.3 SLICE_SIZE = 2205 YES_BLOBCOUNT = 2 NO_BLOBCOUNT = 1 def is_yes(self, wave_file): # read voice file rate, data = wavfile.read(wave_file) data = data / (2.**15) # initalise variables data_len = len(data) start_pos = 0 end_pos = 0 is_blob = False blob_count = 0 # ignore silence before word for idx, val in enumerate(data): if val > self.LEVEL_THRESHOLD: end_pos = idx break # loop through our data, a slice at a time while True: # index the next slice start_pos = end_pos end_pos += self.SLICE_SIZE # bail out if no more slices if end_pos > data_len: break # get top 3 values in slice top_vals = np.sort(data[start_pos:end_pos])[-3:] # if threshold breached, increment the blob count if np.all(top_vals > self.LEVEL_THRESHOLD) and is_blob == False: blob_count += 1 is_blob = True elif np.all(top_vals < self.LEVEL_THRESHOLD) and is_blob == True: is_blob = False if blob_count == self.YES_BLOBCOUNT: return True elif blob_count == self.NO_BLOBCOUNT: return False else: return None
The is_yes method is receiving an audio wav file, which is the word that Arkwood has spoken into the microphone attached to my PC. It reads the audio file as per the previous post, assigning the data and sampling rate to variables and converting the data to a range from -1 to 1.
Here’s where things change. If you take a gander at the Yes graph above, you’ll see that it comprises of two blobs. The word No, on the other hand, has just the one blob. What would happen if we took vertical slices of our graph – could we use those slices to calculate the number of blobs? Let’s give it a go!
Once we have initialized some variables, we loop through our data until we find the start of the word. After all, we don’t want to be doing our calculations on the sound of silence.
Now we drop into a while loop. First up, we take a slice of data (the size of the slice is 2205, which is a fraction of the sampling rate 44100).
After checking that we have not come to the end of our data, we sort our slice of data and grab the top 3 values. These top values are the peaks in the vertical slice of graph we are working on – and having 3 top values gives us confidence that we are not dealing with a single blip in the sound, but instead an actual word.
Finally, we check if all our top values are above a threshold of 0.3. If they are, then we have found a blob! We update our blob count and set is_blob to True (this is important, as we don’t want to recount the blob on the next loop). If all our top values are below the threshold then we are no longer in a blob, and so set is_blob to False (ready for detecting the next blob).
That’s it. Our is_yes method will use the blob count to determine whether the word is Yes or No (or neither, in which case it returns the value None).
Time for a demo. I ask Arkwood to say the word Yes into the microphone. The program will output some values, to let me know how it’s got on:
Hurray! We can see from the output that two blobs have been counted, and thus the word Yes has been correctly identified.
Our first slice is at data index 11427, which is roughly a quarter of the way along our Time axis. Its top 3 values put it above the threshold, so we increment our blob count (we’ve found the start of a blob).
Our blob ends at data index 20247, when all our top values fall below the threshold. However, we detect a second blob a couple of slices later, at index 24657.
It’s worth noting that our second blob only just scrapes above the threshold of 0.3. Maybe we need to lower the threshold?
Let’s see how the word No fares:
The word No has been correctly identified with one blob count. Perfect.
Our first slice is at data index 12811, almost 40% along the Time axis. We stay above the threshold for four slices before finally dipping at index 21631.
‘Did it work?’ Arkwood ask me, ‘Will I be able to talk dirty to my robot girlfriend?’
Not yet. But things are looking rosy. Previously I was able to use peaks in a graph to separate the words Yes and No. Now I have used vertical slices to tell the words apart.
‘Don’t worry,’ I replied, ‘You will soon seduce your fembot under the covers with a plethora of filthy words.’
Steam shot out of Arkwood’s ears and his face turned crimson. Saliva trickled out the corner of his mouth. Cleary he was thinking about android copulation. Goodness, his bony legs has turned to wobbly jelly.