I promised Arkwood, my squalid Belgian buddy, that I would build him a robot girlfriend. In my last post, Text To Speech using Python, I wrote some Python code that allowed his girlfriend to speak to him. Today I shall let him speak to her.
Let’s keep things simple to start with. Arkwood will be able to respond to his girlfriend with the words Yes, Maybe or No.
Here’s the word Yes, spoken into a microphone attached to my PC:
Here’s the word Maybe:
And here’s the word No:
But how is my Python code going to tell apart these three words? Well, looking at the graphs, the words Yes and Maybe have two blobs whereas the word No has one blob. Easy. We can identify the word No by counting the blobs.
But how can we tell apart the words Yes and Maybe? The second blob on the Maybe graph is cone-shaped, so let’s identify the word Maybe using pattern matching. Fantastic. We’ll be able to tell apart the three words.
Here’s the main program, which allows Arkwood to have a conversation with his girlfriend:
from audioplay import AudioPlay from audiorecord import AudioRecord from audioanalysis import AudioAnalysis from constants import * from time import sleep audio_play = AudioPlay() audio_record = AudioRecord() audio_analysis = AudioAnalysis() while True: # girlfriend's question to Arkwood audio_play.text_to_speech("Do you love me") # record Arkwood's answer voice_file = audio_record.voice() # convert Arkwood's answer to text answer = audio_analysis.speech_to_text(voice_file) # girlfriend's emotional outpouring if answer == YES: audio_play.text_to_speech("I love you") if answer == MAYBE: audio_play.text_to_speech("Please love me") elif answer == NO: audio_play.text_to_speech("I hate you") # give Arkwood a break before nagging him again sleep(30)
First up, Arkwood’s mechanical paramour asks him “Do you love me”, harnessing the code from my previous post. We are using Text To Speech to convert words to sound, playing her question through the computer speakers.
Next, Arkwood responds to her question by saying Yes, Maybe or No through the microphone. My post Audio with Python provides detail of how to record snippets of audio.
Now we get to the part that this post is all about – Speech To Text. We are analysing the recorded audio and detecting the word Yes, Maybe or No. More on this later.
Lastly, we check the answer Arkwood has given his girlfriend. If he has responded Yes then his android sweetheart says “I love you” through the computer speakers. If he responds “Maybe” then she says “Please love me”. If he responds No, she says “I hate you”.
Time for a demo…
Cool. Debug output confirms the conversation between our two lovers (albeit the lover that doesn’t smell of cabbage doesn’t actually exist).
Converting speech to text
So how exactly did we analyse the word that Arkwood spoke through the microphone? Let’s take a look at the speech_to_text method of our AudioAnalysis class:
from scipy.io import wavfile from pylab import * from constants import * class AudioAnalysis(object): LEVEL_THRESHOLD = 0.3 SLICE_SIZE = 2205 MAYBE_PATTERN = [3,2,1] def speech_to_text(self, wave_file): # read voice file rate, data = wavfile.read(wave_file) data = data / (2.**15) # get blob indices and count blob_indices = self._blob_indices(data) blob_count = len(blob_indices) # get pattern match is_pattern_match = False if blob_count == 2: is_pattern_match = self._is_pattern_match(data[blob_indices:], self.MAYBE_PATTERN) # return speech to text if blob_count == 1: return NO elif blob_count == 2 and is_pattern_match: return MAYBE elif blob_count == 2 and not is_pattern_match: return YES else: return ""
First, we read the audio file that contains the word Arkwood spoke into the microphone.
Next, we pass the audio data into our _blob_indices method. The method will return the starting index of each blob detected in the word. Our blob count is simply the count of these indices.
If we have found two blobs then we pass our second blob into the _is_pattern_match method, along with the Maybe pattern we want to match on. The method will return True or False, depending on whether the pattern has been matched.
Great! With our blob count and pattern match in hand, we can tell the words Yes, Maybe and No apart. Our speech_to_text method will return the text representation of the word it has detected in the audio file.
Let’s take a gander at the private _blob_indices method:
def _blob_indices(self, data): # initalize variables data_len = len(data) start_pos = 0 end_pos = 0 is_blob = False blob_indices =  # ignore silence before word for idx, val in enumerate(data): if val > self.LEVEL_THRESHOLD: end_pos = idx break # loop through our data, a slice at a time while True: # index the next slice start_pos = end_pos end_pos += self.SLICE_SIZE # bail out if no more slices if end_pos > data_len: break # get top peak in slice slice = data[start_pos:end_pos] top_peak = np.amax(slice) # get top peak at left of slice left_slice = data[(start_pos - (self.SLICE_SIZE/2)) : (start_pos - 100)] left_top_peak = np.amax(left_slice) # determine if blob detected blob_detected = top_peak > self.LEVEL_THRESHOLD and left_top_peak > self.LEVEL_THRESHOLD # if blob detected, increment the blob count if blob_detected and is_blob == False: blob_indices.append(start_pos) is_blob = True elif not blob_detected and is_blob == True: is_blob = False # return blob indices return blob_indices
After initializing some variables, we loop through the audio until we reach a threshold, which lets us find the starting point of the word Arkwood has uttered.
Next, we analyse the word a vertical slice at a time, until we detect the end of the file.
For each slice, we get the value of its highest peak. We also get the highest peak of a thinner slice to its left. If both these peaks are above the threshold then we can be sure that we have found a blob.
That’s it. We just keep looping through the audio, finding the start of blobs when the threshold is met (and the end of blobs when the threshold is not met).
Note: this blob detection algorithm is based on code I wrote for my post Voice recognition with Python (Mark II). Using a left slice has offered up better accuracy in blob detection.
What about pattern matching? Here’s the private _is_pattern_match method:
def _is_pattern_match(self, data, pattern): # initalize variables data_len = len(data) start_pos = 0 end_pos = 0 blob_pattern =  # loop through our data, a slice at a time while True: # index the next slice start_pos = end_pos end_pos += self.SLICE_SIZE # bail out if no more slices if end_pos > data_len: break # update blob pattern with top peak in slice top_peak = np.amax(data[start_pos:end_pos]) blob_pattern.append(int(top_peak*10)) # return true if pattern found in blob, otherwise false return ''.join(map(str, pattern)) in ''.join(map(str, blob_pattern))
The method is passed the data from our second blob, along with the Maybe pattern we want to match on. The pattern is [3,2,1].
As with our _blob_indices method, we loop through the data a vertical slice at a time. By detecting the highest peak in each slice, we are able to build up a pattern of our second blob.
All that’s left to do is check whether the blob pattern contains the pattern we want to match on. If it does then we have found the cone-shaped Maybe word. Otherwise we have found the Yes word.
Putting speech to text through its paces
Well, that’s a whistle-stop tour of the code. My brain hurts now.
Let’s see how each of the graphs perform in our Speech To Text method, by examining some debug output. Here’s the word Yes:
We’ve found two blobs but have not matched the pattern. Yes has been correctly identified!
The word Maybe:
We’ve found two blobs and have matched the pattern. Maybe has been correctly identified!
And the word No:
We’ve found one blob. No has been correctly identified!
So there you have it. Arkwood is now able to whisper sweet nothings into the lobe of a soulless valentine.
We are pattern-matching on specific amplitude values. Going forward, I’d like to match on patterns that are independent of the actual volume.
Here’s the Constants file for our words Yes, Maybe and No:
# constants YES = "yes" NO = "no" MAYBE = "maybe"