, , , , , , , , , , , , ,

Arkwood is in a rage. You see, my sordid Belgian buddy is in love with Daphne, the plump spotty girl who works down the chippy. But, alas, she has fallen for Wayne, the deep fat frier.

‘That horrid scoundrel is plying Daphne with free cod and chips,’ Arkwood cried, ‘And now her Facebook page is full of pictures of them together!’

It is a sad tale. So when he asked me to help in removing Wayne from the photographs, I dutifully agreed. I unpacked my Raspberry Pi computer and began writing some Python code.

‘Okay,’ I said to him, ‘Imagine that the following picture is of Wayne and Daphne at a Ramones concert:

Of course, it’s actually a picture of The Dude from the film, The Big Lebowski, and a Japanese maneki-neko. But let’s pretend it’s Wayne and Daphne.

Here’s what the code does. It loads up the image and then uses voice commands, via a microphone attached to the Raspberry Pi, to edit the image.

I say the word ‘right’ into the microphone and an editing box moves right on the image:


I say ‘right’ again and it moves a little more:


One more ‘right’ and I have the box exactly where I want it:


Now I say ‘cut’ and the code uses OpenCV GrabCut to extract the boxed foreground from the image:


Hurray! Wayne is no longer in the picture.

All that’s left to do is collect the final picture and present it to Arkwood:


‘What do you think?’ I asked him.

‘That’s fantastic!’ he enthused, spittle collecting on this bottom lip. ‘With a single imperial decree, I can remove Wayne from all her Facebook snaps!’

It’s true, using a keyboard and mouse for such things can feel quite pathetic.



Here’s a screenshot of my Raspberry Pi, running the code:


If you click on the image, you’ll see that Google got my command for GrabCut wrong in the first instance. Presumably it thought I said ‘cunt’ rather than ‘cut’.

Here’s the code. First up, the main program:

import cv2
from datetime import datetime
from speech import Speech
from grabcut import GrabCut

speech = Speech()
grabcut = GrabCut()

# voice commands
LEFT = "left"
RIGHT = "right"
GRABCUT = "cut"

# constants

# initialise variables
img = cv2.imread('images/wayne_and_daphne.jpg')
img_height, img_width, _ = img.shape
left_bar = 0
right_bar = img_width / 2
rect = (left_bar,0,right_bar,img_height)

# loop forever
while True:

    # get voice command from Arkwood
    speech.text_to_speech("Please command me")
    command = speech.speech_to_text('/home/pi/PiAUISuite/VoiceCommand/speech-recog.sh')

    # handle left command
    if command == LEFT and left_bar - PIXEL_SHIFT >= 0:
        left_bar -= PIXEL_SHIFT
        right_bar -= PIXEL_SHIFT
        rect = (left_bar,0,right_bar,img_height)

    # handle right command
    elif command == RIGHT and right_bar + PIXEL_SHIFT <= img_width:
        left_bar += PIXEL_SHIFT
        right_bar += PIXEL_SHIFT
        rect = (left_bar,0,right_bar,img_height)

    # handle grabcut
    elif(command == GRABCUT):
        img = grabcut.update_image(img,rect) 

    # show and save 'edit mode' image
    editImg = img.copy()
    cv2.imshow("Image commander",editImg)
    cv2.imwrite('images/{}.jpg'.format(datetime.now().strftime('%Y%m%d_%Hh%Mm%Ss%f')), editImg)   

    # save image
    cv2.imwrite('images/just_daphne.jpg', img) 

To start with, we define some constants for our voice commands. We can move the box on the image left and right, and we can request OpenCV GrabCut.

After defining a PIXEL_SHIFT constant, which determines how far the box shifts on each command, we load the original image of Wayne and Daphne and initialise some variables.

Once in the while loop, we use the Google’s Text To Speech service to ask Arkwood for his next command, and then use Google’s Speech To Text service to fetch the words he has uttered into the microphone.

If the command matches our LEFT constant, then we shift the box on the image left. Likewise, if the command is RIGHT, we shift the box right. If the command is GRABCUT, we do that instead.

We take a copy of the image, so that we can draw the box on it and display it in a window for Arkwood (as well as saving it to disk). We also save the original image to disk, minus Wayne.

Here’s the Speech class, which takes care of our communication with Google’s Text To Speech and Speech To Text services:

from subprocess import Popen, PIPE, call
import urllib
class Speech(object):
    # converts speech to text
    def speech_to_text(self, filepath):
            # utilise PiAUISuite to turn speech into text
            text = Popen(['sudo', filepath], stdout=PIPE).communicate()[0]
            # tidy up text
            text = text.replace('"', '').strip()
            # debug

            return text
            print ("Error translating speech")
    # converts text to speech
    def text_to_speech(self, text):
            # truncate text as google only allows 100 chars
            text = text[:100]
            # encode the text
            query = urllib.quote_plus(text)
            # build endpoint
            endpoint = "http://translate.google.com/translate_tts?tl=en&q=" + query
            # debug
            # get google to translate and mplayer to play
            call(["mplayer", endpoint], shell=False, stdout=PIPE, stderr=PIPE)
            print ("Error translating text")

And finally, our GrabCut class:

import numpy as np
import cv2

class GrabCut(object):
    # updates image with grabcut
    def update_image(self, img, rect):

        mask = np.zeros(img.shape[:2],np.uint8)

        bgdModel = np.zeros((1,65),np.float64)
        fgdModel = np.zeros((1,65),np.float64)


        mask2 = np.where((mask==2)|(mask==0),0,1).astype('uint8')
        img = img*mask2[:,:,np.newaxis]

        return img

Now, a final word from the Dude.