Computing MFCCs voice recognition features on ARM systems

By Ons Wechtati - March 1st,

During my summer internship with Linagora’s R&D team, I was immersed in a world that any technology enthusiast would dream of, especially artificial intelligence.

My mission was to participate in the development of the intelligent agent LINTO and in particular in the module allowing gestural and visual detection, through the construction and training of a 360° image dataset in order to improve the quality of detection of LINTO.

The article below will give a global and detailed description of this project.

What is LINTO ?

LINTO is an intelligent Open Source assistant based only on Open Source technologies, and LINTO is compatible with the Cloud and free from GAFAM .

It is a personal assistant able to recognize your voice and help you in a professional environment, it’s also respectful of your privacy and does not share your information for commercial purposes.

The research project is subsidized by the PIA (Programme d’Investissement d’Avenir) of the French state within the framework of the Grand Challenges of the Digital Age. It brings together technology companies such as LINAGORA and ZELROS and research laboratories such as IRIT, LaaS, CNRS and the computer science research laboratory of the “Ecole Polytechnique”.

LINTO is made primarily to meet the needs expressed by professionals within companies, with the primary goal of reducing tedious and stressful tasks.

To better understand the purpose of this work, we can take the example of a weekly meeting during which six members of the work team will be present.

During this meeting there were three people who spoke , and four people voted for a new internal regulation.

At the end of our meeting a report has to be drawn up, this is where LINTO comes in with the capabilities to perform voice, facial and gestural detection.

In the following we will detail the process that is necessary to perform the facial and gestural detection through the construction of a data set of panoramic images.

Construction of a 360 images data set :

Before proceeding to the annotation step, it is essential to have a set of panoramic images.

These images are taken from various video recordings taken during several meetings, which will be recorded after these steps:

  • Checking the ambient light
  • Choosing a suitable location for the camera
table avec camera 360°
In a second step, we have to divide each video into a set of frames ( image.jpg), this is done with the help of a script written in python:
to train our model, the acquired images must be annotated correctly and carefully

Image annotation :

Giving eyes to your machine is no longer a dream, thanks to artificial intelligence technologies a machine can perceive the objects constituting the environment in which it is located, i.e. analyze, process and understand several images, this is called computer vision.

The ability to see and interpret the world can provide us with seemingly cutting-edge technology such as medical imaging analysis, optical character recognition, and facial and gesture recognition.

However, this will not be possible without using clear, annotated data to form machine learning models.

Indeed, the performance rate of machine learning and deep learning algorithms* depends on the accuracy rate of the training data on which these models are based.

The machine learning models are fed from the data provided. As soon as an algorithm has processed enough annotated data, it can begin to automatically recognize recurring new non-annotated data in the annotated data provided to it.

In other words, annotation is a manual task that involves assigning labels or metadata to a dataset, it is also a type of data labeling.

Image annotation is used to identify objects and to segment images, it marks the data that the machine learning system is supposed to recognize, it is an indispensable step in the pre-processing of data in supervised learning .

The Type of image annotation used :

There are various annotation methods. The method of annotation is chosen according to the need, in our case, we had to use enclosing boxes, it is the simplest type of annotation and the most used, these are imaginary boxes drawn on images. The contents of the bounding box are annotated to help a machine learning model to recognize it as a distinct type of object.

For this form of annotation the annotator must draw a frame around the object he wants to annotate in the image, this frame must be as close as possible to the edges of the objects to be annotated.

The use of delimiting frames is often useful for the classification, localization and detection of objects.

Sometimes there may be more than one target object in a single image, in which case there should be as many frames as there are objects to delineate.
reunion en 360°

The Type of image annotation used :

To be able to annotate images we need annotation software.

“LabelImg” is a source program developed with Python that allows you to :

Create a 2d rectangle around an identified element of type YOLO or VOC PASCAL

Assign a label to this rectangle.

Generate .txt or .xml files.

The different annotation classes :

We started by adding new labels to our open source software which are the following:



reunion en 360°
reunion en 360°
reunion en 360°

Then we started to annotate the images related to speaking, voting and person.

Generation of the XML file :

Once an image has been well annotated, an XML file corresponding to it will be automatically generated by the LabelImg software.

This file contains :

The location of the image from which this file comes from
One or more “Object” tags, each of which corresponds to an annotated object, thus containing the label used and the spatial coordinates on the x-axis and y-axis of the tagging frame.

These coordinates are essential for identifying objects and for training models.

Annotation quality check :

Each image in the dataset must be carefully and accurately labeled to train a deep learning model to recognize objects in the same way a human can. The higher the quality of the annotation, the better the performance of the model.

It is therefore necessary to check the quality of the annotation before moving on to model training.

Such a task can be performed using a script written in Python in a Jupyter Notebook where it’s possible to read the desired image and then check the position of the labeling rectangle.

Such a manual verification, despite its efficiency, remains a repetitive and time-consuming task.
reunion en 360°
The process of data enrichment (annotation) is the most laborious part of an artificial intelligence project, which improves performance and accuracy by efficiently training the created models.

Producing an annotated dataset in sufficient quantity and quality to train a good model and have acceptable results, does not provide 100% performance.

The annotation work does not stop when the first models are performing well.

Such a project depends on the data that is provided, and cannot guarantee the same performance on new cases that cannot be anticipated.

However, the need to detect new gestures requires to annotation on targeted examples.