How Does Voice Translation Works?
Nov 08, 2018

Hishell Translation

A speech translation system would typically integrate the following three software technologies: automatic speech recognition (ASR), machine translation (MT) and voice synthesis (TTS).

The speaker of language A speaks into a microphone and the speech recognition module recognizes the utterance. It compares the input with a phonological model, consisting of a large corpus of speech data from multiple speakers. The input is then converted into a string of words, using dictionary and grammar of language A, based on a massive corpus of text in language A.

The machine translation module then translates this string. Early systems replaced every word with a corresponding word in language B. Current systems do not use word-for-word translation, but rather take into account the entire context of the input to generate the appropriate translation. The generated translation utterance is sent to the speech synthesis module, which estimates the pronunciation and intonation matching the string of words based on a corpus of speech data in language B. Waveforms matching the text are selected from this database and the speech synthesis connects and outputs them.

