, , , , , , , , ,

I promised Arkwood, my squalid Belgian buddy, that I would build him a robot girlfriend. In my last post, Text To Speech using Python, I wrote some Python code that allowed his girlfriend to speak to him. Today I shall let him speak to her.

The conversation

Let’s keep things simple to start with. Arkwood will be able to respond to his girlfriend with the words Yes, Maybe or No.

Here’s the word Yes, spoken into a microphone attached to my PC:


Here’s the word Maybe:


And here’s the word No:


But how is my Python code going to tell apart these three words? Well, looking at the graphs, the words Yes and Maybe have two blobs whereas the word No has one blob. Easy. We can identify the word No by counting the blobs.

But how can we tell apart the words Yes and Maybe? The second blob on the Maybe graph is cone-shaped, so let’s identify the word Maybe using pattern matching. Fantastic. We’ll be able to tell apart the three words.

Here’s the main program, which allows Arkwood to have a conversation with his girlfriend:

from audioplay import AudioPlay
from audiorecord import AudioRecord
from audioanalysis import AudioAnalysis
from constants import *
from time import sleep

audio_play = AudioPlay()
audio_record = AudioRecord()
audio_analysis = AudioAnalysis()

while True:
    # girlfriend's question to Arkwood
    audio_play.text_to_speech("Do you love me")

    # record Arkwood's answer
    voice_file = audio_record.voice()

    # convert Arkwood's answer to text
    answer = audio_analysis.speech_to_text(voice_file)

    # girlfriend's emotional outpouring
    if answer == YES:
        audio_play.text_to_speech("I love you")
    if answer == MAYBE:
        audio_play.text_to_speech("Please love me")
    elif answer == NO:
        audio_play.text_to_speech("I hate you")

    # give Arkwood a break before nagging him again

First up, Arkwood’s mechanical paramour asks him “Do you love me”, harnessing the code from my previous post. We are using Text To Speech to convert words to sound, playing her question through the computer speakers.

Next, Arkwood responds to her question by saying Yes, Maybe or No through the microphone. My post Audio with Python provides detail of how to record snippets of audio.

Now we get to the part that this post is all about – Speech To Text. We are analysing the recorded audio and detecting the word Yes, Maybe or No. More on this later.

Lastly, we check the answer Arkwood has given his girlfriend. If he has responded Yes then his android sweetheart says “I love you” through the computer speakers. If he responds “Maybe” then she says “Please love me”. If he responds No, she says “I hate you”.

Time for a demo…


Cool. Debug output confirms the conversation between our two lovers (albeit the lover that doesn’t smell of cabbage doesn’t actually exist).

Converting speech to text

So how exactly did we analyse the word that Arkwood spoke through the microphone? Let’s take a look at the speech_to_text method of our AudioAnalysis class:

from scipy.io import wavfile
from pylab import *
from constants import *

class AudioAnalysis(object):
    SLICE_SIZE = 2205
    MAYBE_PATTERN = [3,2,1]

    def speech_to_text(self, wave_file):
        # read voice file
        rate, data = wavfile.read(wave_file)
        data = data / (2.**15)
        # get blob indices and count
        blob_indices = self._blob_indices(data)
        blob_count = len(blob_indices)

        # get pattern match
        is_pattern_match = False
        if blob_count == 2:
            is_pattern_match = self._is_pattern_match(data[blob_indices[1]:], self.MAYBE_PATTERN)

        # return speech to text
        if blob_count == 1:
            return NO
        elif blob_count == 2 and is_pattern_match:
            return MAYBE
        elif blob_count == 2 and not is_pattern_match:
            return YES
            return ""

First, we read the audio file that contains the word Arkwood spoke into the microphone.

Next, we pass the audio data into our _blob_indices method. The method will return the starting index of each blob detected in the word. Our blob count is simply the count of these indices.

If we have found two blobs then we pass our second blob into the _is_pattern_match method, along with the Maybe pattern we want to match on. The method will return True or False, depending on whether the pattern has been matched.

Great! With our blob count and pattern match in hand, we can tell the words Yes, Maybe and No apart. Our speech_to_text method will return the text representation of the word it has detected in the audio file.

Blob detection

Let’s take a gander at the private _blob_indices method:

def _blob_indices(self, data):

    # initalize variables
    data_len = len(data)
    start_pos = 0
    end_pos = 0
    is_blob = False
    blob_indices = []

    # ignore silence before word 
    for idx, val in enumerate(data):
        if val > self.LEVEL_THRESHOLD:
            end_pos = idx
    # loop through our data, a slice at a time
    while True:
        # index the next slice 
        start_pos = end_pos
        end_pos += self.SLICE_SIZE

        # bail out if no more slices
        if end_pos > data_len:
        # get top peak in slice
        slice = data[start_pos:end_pos]
        top_peak = np.amax(slice) 
        # get top peak at left of slice            
        left_slice = data[(start_pos - (self.SLICE_SIZE/2)) : (start_pos - 100)]
        left_top_peak = np.amax(left_slice) 

        # determine if blob detected
        blob_detected = top_peak > self.LEVEL_THRESHOLD and left_top_peak > self.LEVEL_THRESHOLD

        # if blob detected, increment the blob count
        if blob_detected and is_blob == False:
            is_blob = True
        elif not blob_detected and is_blob == True:
            is_blob = False

    # return blob indices
    return blob_indices

After initializing some variables, we loop through the audio until we reach a threshold, which lets us find the starting point of the word Arkwood has uttered.

Next, we analyse the word a vertical slice at a time, until we detect the end of the file.

For each slice, we get the value of its highest peak. We also get the highest peak of a thinner slice to its left. If both these peaks are above the threshold then we can be sure that we have found a blob.

That’s it. We just keep looping through the audio, finding the start of blobs when the threshold is met (and the end of blobs when the threshold is not met).

Note: this blob detection algorithm is based on code I wrote for my post Voice recognition with Python (Mark II). Using a left slice has offered up better accuracy in blob detection.

Pattern matching

What about pattern matching? Here’s the private _is_pattern_match method:

def _is_pattern_match(self, data, pattern):
    # initalize variables
    data_len = len(data)
    start_pos = 0
    end_pos = 0
    blob_pattern = []

    # loop through our data, a slice at a time
    while True:
        # index the next slice 
        start_pos = end_pos
        end_pos += self.SLICE_SIZE

        # bail out if no more slices
        if end_pos > data_len:
        # update blob pattern with top peak in slice
        top_peak = np.amax(data[start_pos:end_pos])

    # return true if pattern found in blob, otherwise false
    return ''.join(map(str, pattern)) in ''.join(map(str, blob_pattern))

The method is passed the data from our second blob, along with the Maybe pattern we want to match on. The pattern is [3,2,1].

As with our _blob_indices method, we loop through the data a vertical slice at a time. By detecting the highest peak in each slice, we are able to build up a pattern of our second blob.

All that’s left to do is check whether the blob pattern contains the pattern we want to match on. If it does then we have found the cone-shaped Maybe word. Otherwise we have found the Yes word.

Putting speech to text through its paces

Well, that’s a whistle-stop tour of the code. My brain hurts now.

Let’s see how each of the graphs perform in our Speech To Text method, by examining some debug output. Here’s the word Yes:


We’ve found two blobs but have not matched the pattern. Yes has been correctly identified!

The word Maybe:



We’ve found two blobs and have matched the pattern. Maybe has been correctly identified!

And the word No:


We’ve found one blob. No has been correctly identified!

So there you have it. Arkwood is now able to whisper sweet nothings into the lobe of a soulless valentine.



We are pattern-matching on specific amplitude values. Going forward, I’d like to match on patterns that are independent of the actual volume.

Here’s the Constants file for our words Yes, Maybe and No:

# constants
YES = "yes"
NO = "no"
MAYBE = "maybe"