The appearance of “hands-free” vocal interfaces in technology, ranging from device controls to search, marks the beginning of a long term trend in digital productivity tools and puts speech recognition and transcription (along with natural language understanding) at the forefront of AI research. At LINAGORA Labs, we are continually improving the algorithms behind our speech-to-text engine and speech generation models. The French language is our primary focus, though we are working to expand our offerings to a variety of European languages while maintaining GDPR standards for user privacy and data autonomy.
Our approach to keyword spotting focuses on developing reliable, multi-platform, small-footprint open source software to detect wake words in streams of spoken language. To this end, we develop a training methodology to easily produce detection models for customized keywords, along with packaged, ready-to-use implementations of these models for target platforms. This training methodology is based on a rigorous comparison of state of the art algorithms for data processing, including neural net architectures, to achieve a balance between performance and accuracy.
To respond to requests such as Turn on the lights in the meeting room and questions such as How is the weather in Toulouse?, our LinTO assistant uses high performance command models tailored to specific business use cases. By curating a small, targeted vocabulary, we optimize the accuracy and computational efficiency of the command model by reducing the size of its language model component. The resulting smaller, customized command models can then be embedded in IoT devices in a way that allows the voice data to remain as close as possible to its source, thus respecting user privacy.
Available in both streaming and offline versions, our large vocabulary models are designed to transcribe extended open-domain dialogue with a focus on handling spontaneous, multi-party conversations of the sort encountered in business meetings. These interactions pose a number of challenges for advanced speech recognition systems, which are generally trained on grammatically correct text and speech: noisy recording conditions, high levels of disfluency (e.g. hesitations, repetitions, false starts), and overlapping speech. Our LinSTT system exploits a hybrid speech recognition model that combines an acoustic model (Deep Neural Network) and a language model (Hidden Markov Model).
Text-to-speech technology aims to develop artificial voices for speech-based user interactions. Our research and development focus on creating natural sounding models using the latest technologies and deploying server-side embedded text-to-speech services in our products. Additional challenges include achieving a balance between voice quality and speech generation speed while handling large concurrent API requests, as well as developing voices for a variety of languages.