Tags

, , , , , , , , , , , , ,

An almighty crash awoke my bones from slumber. Putting the broadsheet to one side of the armchair, I rushed through to the kitchen with cheeks aflush. Arkwood, my troubled Belgian friend, stood upon the tiled floor with smashed crockery at his feet. Bare feet, I may add.

‘Don’t move!’ I screamed. ‘You will shred your soles.’

‘My soul is already shredded,’ he replied, a smart play on words for someone so forlorn.

I understood. You see, I had promised to build him a robot girlfriend to cast aside his blues. In my last post, Speech To Text using Python, I employed a few techniques to afford him conversation with his android sweetheart.

‘But it’s not enough!’ he wailed. ‘Your program only allows me to say three words to her, and even then it’s flaky.’

Brutal but true. My Python code can recognise just the words Yes, Maybe and No spoken into a microphone attached to my PC. If it’s to be of use to an extended vocabulary then surely it must grow. And weeds be scythed.

So let’s go beyond our analysis of audio wav files, Amplitude plotted against Time. Let’s obtain the frequency content of the audio, plotting Power against Frequency.

First up, I ask Arkwood to say the word Yes into the microphone attached to my PC. The microphone in question is my new Rode NT-USB, which is a step up from an old Fostex M521 with a buggered right channel. My Audio with Python post has the Python code to record a snippet of voice to a single channel.

So, now that we have recorded wav files of the words Yes, Maybe and No, how do we get the frequency content? Sam Carcagno’s blog post provides the code along with a smashing explanation. There is also a useful stackoverflow post, with code from HYRY that matches up.

Sam’s code makes use of the Fast Fourier Transform algorithm, yielding the magnitude of the frequency components.

Here’s the word Yes spoken three times. Our left-hand side graph shows the Amplitude plotted against Time. On the right-hand side we’ve plotted our frequency content, Power against Frequency:

AmplitudeTime_Graph_Voice_Yes1
AmplitudeTime_Graph_Voice_Yes2
AmplitudeTime_Graph_Voice_Yes3
PowerFrequency_Graphs_Voice_Yes1
PowerFrequency_Graphs_Voice_Yes2
PowerFrequency_Graphs_Voice_Yes3

Here’s the word No spoken three times:

AmplitudeTime_Graph_Voice_No1
AmplitudeTime_Graph_Voice_No2
AmplitudeTime_Graph_Voice_No3
PowerFrequency_Graphs_Voice_No1
PowerFrequency_Graphs_Voice_No2
PowerFrequency_Graphs_Voice_No3

And here’s the word Maybe spoken three times:

AmplitudeTime_Graph_Voice_Maybe1
AmplitudeTime_Graph_Voice_Maybe2
AmplitudeTime_Graph_Voice_Maybe3
PowerFrequency_Graphs_Voice_Maybe1
PowerFrequency_Graphs_Voice_Maybe2
PowerFrequency_Graphs_Voice_Maybe3

If any lunatics amongst you want to hear the words being uttered, I’ve cradled them gently in my repository.

‘Can I speak more words to my girlfriend yet?’ Arkwood asked me.

I showed him the frequency graphs, pointing to patterns that could separate the words. I told him that I would be writing some Python code to analyse the frequency domain data, much as I had done with the time domain data.

‘It’s always jam tomorrow with you!’ he snapped, storming out of the kitchen, leaving me to glue the plates. I shall toss them off the roof.