, , , , , , , , , , , , ,

Arkwood, my morbid Belgian fiend, has long had a hankering for android romance. Indeed, to appease his pent-up frustrations, I built him his very own virtual girlfriend. The fembot made use of Google’s Speech To Text and Text To Speech services. But now Arkwood wants me to bring my development efforts in-house, so that he can get a more, well, custom experience.

Luckily, in my last post Audio with Python, I began to play about with recording, analysing and playing sound files in Python. So, to get things up and running with his mechanical paramour, let’s see if we can analyse Arkwood’s voice as he whispers sweet nothings to her.

‘I only want to say Yes or No to her,’ Arkwood sniped, whilst watching snooker on the TV set, ‘I don’t have time for her wiffle-waffle.’

Fair enough, it makes my job easier. And everyone knows that other than Yes and No, all words of a language are basically superfluous. Try it for yourself. I functioned quite adequately in society for a whole month using simply Yes and No to all that was asked of me. I did not starve. I was still able to board buses. But, alas, I digress…

Here’s Arkwood saying the word Yes into the microphone attached to my PC, three times. I used the AudioRecord class from the previous post to record the voice samples:




And here’s Arkwood saying No three times:




We can see immediately that the word Yes is a different shape to No. Yes seems to be split into two blobs, and begins with a peak. The word No, on the other hand, is one blob that peaks nearer the middle.

Let’s start by analysing the peaks, to tell the difference between Yes and No. We will grab the highest peak, and then calculate the percentage of peaks to the left of it. Looking at the graphs, the word Yes should have a much lower percentage of left peaks than the word No.

Of course, this is all fairly crude. For a start, the highest peak may be a blip in the recording. And we’re not accounting for the different ways to say Yes and No, by different people. I’m not even sure my old Fostex microphone is up to the job. But it’s just to get the ball rolling.

Here’s the code that will be used to analyse the voice file:

from scipy.io import wavfile
from pylab import *

class AudioAnalysis(object):


    def is_yes(self, wave_file):

        # read voice file
        rate, data = wavfile.read(wave_file)
        data = data / (2.**15)

        # get peak indices
        peak_indices = (-data).argsort()[:len(data)/5]
        top_peak_index = peak_indices[0]

        # count left peaks
        left_peaks = 0
        for index in peak_indices:
            if index < top_peak_index:
                left_peaks += 1
        # determine if voice is saying Yes or No
        left_peaks_percentage = left_peaks/float(len(peak_indices))*100 

        if left_peaks_percentage < self.PERCENTAGE_THRESHOLD:
            return True

        return False

Our AudioAnalysis class has one method, is_yes. Dead simple really – it takes a path to our voice wav file and reads in its data and sampling rate. We do a calculation on the data, so that all values are between -1 and 1.

Next we sort our data in descending order and grab the peaks in the graph (which is 20% of all audio samples). We also grab the top peak, which is the first element in the array.

Now we loop through all our peaks. If the index of a peak is to the left of our top peak then we increment our left_peaks count.

All that’s left to do is determine if Arkwood has said the word Yes or No. We calculate the percentage of peaks to the left of our highest peak – if it’s below a threshold of 30% then Arkwood has said the word Yes. Otherwise he’s said No.

Okay, time for a demo. I ask Arkwood to say Yes into the microphone and print out debug values:


Hurray! Our left peaks only account for 16% of all peaks on the graph – and since it’s below the threshold of 30%, we have correctly identified the word Yes. Take a look at the graph again – you can clearly see that most to the peaks happen to the right of the highest peak of 0.89.

Now I ask Arkwood to say the word No:


The highest peak occurs between 0.3 and 0.4 seconds, at a value of 0.78. The percentage of peaks to the left of the highest peak is 55%. The threshold of 30% is breached, and the word No has been correctly identified!

So there you have it. Arkwood can now have the most basic of conversations with his virtual girlfriend – which is just the way he wants it.

‘Don’t you fret my fiend, I will soon have a robotic Marilyn Monroe built to serve your every whim!’ I said.

Arkwood curled his lip and retorted, ‘I don’t want a peroxide! Give me a lush brunette any day.’

I quite forgot. Arkwood had a bad experience with Claire from the fishmongers, and ever since has had it with blondes.



Here’s the main program that will run our AudioRecord and AudioAnalysis classes:

from audiorecord import AudioRecord
from audioanalysis import AudioAnalysis
from time import sleep

audio_record = AudioRecord()
audio_analysis = AudioAnalysis()

while True:
    # girlfriend's question to Arkwood
    print "Do you still love me?"

    # record Arkwood's response
    voice_file = audio_record.voice()

    # inspect response to determine if Yes or No
    is_yes = audio_analysis.is_yes(voice_file)

    # Arwood's answer to girlfriend
    if is_yes:
        print "Darling, I am madly in love with you!"
        print "Snooker's on TV. Can't you bother me later?"
    # give Arkwood a break before nagging him again


And the code to plot our graphs:

time = np.arange(len(data))*1.0/rate      
plt.plot(time, data)
plt.xlabel('Time (sec)')