Of late, I have been playing with multimodal interaction  with a Raspberry PI (referred to as an IoT edge device) that is being powered by predictions using Deep Learning. In this post, I discuss how I created the pipeline using voice interaction (voice decoding and speaking) and Raspberry PI camera to identify fruits.
FID30  image dataset contains 30 fruit category of images. These images were augmented using manipulations of the images such as a 90 degree transformation, and negative transformation of the images. Whenever possible, these images were augmented with images along with the transformations.
Deep Learning  can be used for object recognition in images and for decoding voice. In this demo, I have used Convolutional Neural Networks (CNN)  for image analysis. For this demo, I have build the CNN model; and used CMU Sphinx with Linux’s espeak for voice interaction.
Specifically, CNN is a type of feed forward artificial neural network that has a wide applicability in image recognition. The models built using NVidia Digits Framework  uses Caffe Framework  running an NVidia GPU . For the network part, I have chosen to use AlexNet  to generate the model; other networks such as GoogLenet were evaluated too. Typically, CNN require a large dataset of labelled images, and the generated models was overfitting. To have a better accuracy and to reduce the overfitting, a huge repository of images is needed. Creation of such image repository with 100’s of thousands of images is a challenge and cannot be addressed by this post.
IoT Edge Device built using the Raspberry PI module
The edge node was built using Raspberry PI using a $2 microphone, a $34 camera, $10 usb wifi adaptor and a $12 speaker system. The system took about $100 to build. The software tech stack includes Python - PyAudio with models were based on CMU PocketSphinx for voice interaction.
Cloud Tech Stack
The model creation and prediction using the model was done in the cloud. Tech Stack included Linux for OS, Python, Java, MongoDB as the backend storage, NVidia GPU for model generation, Caffe Framework for model generation - object recognition and NVidia’s Digits for model generation.
What you will see in the Demo video
When the word “this” is detected by IoT Device, an image is captured and uploaded to the cloud (my laptop) via a REST interface. On the server, the object recognition module pulls the image, identifies the object and updates the entry via the REST Interface. The Raspberry PI polls the REST interface for the prediction result for the object it just uploaded and announces the object on the speakers.
All human interaction is with the IoT Device - through the speaking and listening; and all image recognition is performed on the server. It can be altered such that an initial prediction can be performed at the edge device making it a ‘true’ IoT device. This demo shows and end-to-end pipeline with CNN.
Apple being identified as Apples:
Lemons being identified as Lemons:
Orange being identified as Apricots:
In some cases, the same fruit with with a different background has been identified incorrectly.
 Multimodal Interaction. https://en.wikipedia.org/wiki/Multimodal_interaction
 Convolutional Neural Networks. https://en.wikipedia.org/wiki/Convolutional_neural_network
 Deep Learning. https://en.wikipedia.org/wiki/Deep_learning
 Krizhevsky, A., Sutskever, I. and Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada.
 Škrjanec Marko, Automatic fruit recognition using computer vision. Bsc Thesis, (Mentor: Matej Kristan), Fakulteta za računalništvo in informatiko, Univerza v Ljubljani, 2013.
 Digits Framework. https://developer.nvidia.com/digits
 Caffe Framework. http://caffe.berkeleyvision.org/
 NVIDIA GeForce GTX. http://www.geforce.com/hardware/notebook-gpus/geforce-gtx-760m