Translate Your Cat’s Meow!

Siddhant Sancheti
7 min readDec 15, 2022

--

Lets talk QUERTY!

Hello Friends! ¡Hola Amigos! Bon les amis!

Wondering why did I just start my very first medium article with multilingual greetings? Exactly, that’s what I will be conversing with you about in this article about language translation and to be more specific, quite flabbergasted to share spoken language translation. You might be a little curious to know what exactly spoken language is. In layman’s terms, these are unwritten, usually regional, languages that lack a standard written form. By definition, A spoken language is a language produced by articulate sounds or (depending on one’s definition) manual gestures, as opposed to a written language. (source. Wikipedia). So, just imagine how challenging would it be to perform an A.I. translator on such unwritten oral languages. But, here is an exciting and exclusive fact. Meta AI has recently announced its Direct Speech-to-Speech AI translator(S2ST) for spoken languages primarily which outperforms previously used S2ST since it does not rely on text generation as an intermediate step, in contrast to prior methods. Meta AI, to my knowledge, is the first to establish a textless S2ST technique that can be trained with real-world data and works for multiple language pairs. Also, in a blog post, Meta AI co-founder and chief, Mark Zuckerberg said “Spoken communications can help break down barriers and bring people together wherever they are located — even in the metaverse”.

For most people with scripted languages, this may seem like a usual translator tool. Still, for the tons of people with their native textless language, this may be a boon to access worldly data and information across the internet. Meta first demonstrated its perceptive model to translate Hokkein, widely spoken within the Chinese and Taiwanese diaspora, into a standardized phonetic notation called “Tâi-lô.”. In addition to developing a method for evaluating Hokkien-English speech translations, the team created the first Hokkien-English bidirectional speech-to-speech translation benchmark dataset, based on a Hokkien speech corpus called Taiwanese Across Taiwan. This was just an insane breakthrough as I always used to wonder about some of my native languages and how they might be getting translated sentimentally. This is one of the reasons why I was so enthralled in writing this article. But my interest doesn’t stop there and being a tech lad I’m keen to excavate into the backend tech part so let’s delve into it.

The Conventional S2ST employed a cascaded series of steps– speech recognition, then text-to-text translation, and finally conversion of a translated text back to speech–which indeed is computationally expensive and moreover, infeasible for translations to every unwritten spoken language since they relied on text generation as an intermediate step. These models immediately convert source speech into target speech spectrograms, which are multidimensional continuous-value vector representations of the frequency spectrum. Unlike present ones, the researchers at Meta AI profound a mind-wobbling system that uses discretized speech units proposed from a speech-to-unit translation (S2UT) framework that encodes target speech into discrete representations, instead of spectrograms, which they derived by clustering self-supervised speech representations. These discrete units, based on a self-supervised unit-based speech normalization technique, enable the researchers to employ enhanced automatic speech recognition (ASR), machine translation (MT), or end-to-end speech-to-text translation (S2T) to direct S2ST. This improved quality of speech units also allows masterminds to apply spoken generative language modeling, and emotion conversion cast as a unit-to-unit translation task.

Hence, the experiments showed that with the combination of discrete unit prediction, speech and text joint training, and beam search, the direct S2ST system matched the performance of a cascaded S2T+TTS system. For this, their study uses the Fisher Spanish-English speech translation corpus, which comprises 139K sentences (about 170 hours) transcribed in both Spanish and English from Spanish-speaking telephone conversations. ##Meta conducted experiments on Spanish-English (Es-En) and English-Spanish (En-Es) translation with FAIRSEQ##.

Using a top-notch internal text-to-speech engine, they produce synthetic target speech with a single female voice as the training target. Compared to a straightforward S2ST model that predicts spectrogram features as a baseline, using discrete units results in an improvement of 6.7 BLEU. Additionally, without human annotations, the top textless direct voice translation model transcribes target speech with performance on par with cascaded text-based systems. An additional 2.0 BLEU gain was seen while using autonomously mined S2ST data during training. Now, with the proposed S2UT system that trained on real data from VoxPopuli S2S data and automatically mined S2S data, they extended the monolingual unit pre-training to a multilingual setup, and perform unlabeled speech encoder and discrete unit decoder pre-training separately. In addition, Meta AI adopted UnitY for a two-pass decoding mechanism where the first-pass decoder generates text in a related language (Mandarin), and the second-pass decoder creates units.

The illustration of the direct S2ST model with discrete units is given as follows:

(1) a transformer-based speech-to-unit translation (S2UT) model with a speech encoder and a discrete unit decoder,

(2) auxiliary tasks conditioned on the encoder,

(3) a text CTC decoder conditioned on the discrete unit decoder,

(4) a vocoder separately trained to transform discrete units into a waveform

S2ST model with discrete units. Source credits: https://arxiv.org/pdf/2107.05604.pdf

Furthermore, an in-depth illustration of the textless S2ST model is as follows:

Illustration of the textless S2ST model. Source credits: https://arxiv.org/pdf/2112.08352.pdf

The left side is the speech-to-unit translation (S2UT) model with an auxiliary task while the right part is the unit-based HiFi-GAN vocoder for unit-to-speech conversion. We apply the speech normalizer to generate a norm unit as the target for S2UT training. The vocoder is trained with orig-unit obtained from HuBERT and K-means model. Only the shaded modules are used during inference.

Finally, the enhanced system of Direct Speech-to-Speech Translators using self-supervised pre-training and data augmentation consist of following integral parts as follows:

  1. Speech-to-unit translation (S2UT) model: Here, the target speech is encoded as discrete units with a HuBERT model trained on unlabeled speech followed by K-means model. In the end, the direct S2ST system consists of a sequence-to-sequence S2UT model with a speech encoder and a unit decoder, followed by a unit HiFi-GAN vocoder trained separately for unit-to-waveform conversion.
Flowchart for the speech encoder and decoder pretraining and finetuning process. Source credits: https://arxiv.org/pdf/2204.02967.pdf

2. Model pre-training:

i. Encoder pre-training: wav2vec 2.0: Wav2vec 2.0 is a self-supervised framework to learn speech representations from unlabeled audio data.

ii. Decoder pre-training: unit mBART: mBART is proposed for denoising autoencoder over text. The reduced discrete units extracted from unlabeled speech data as text and applied mBART training with a Transformer-based encoder-decoder architecture.

3. Model finetuning: The wav2vec 2.0 encoder and the unit mBART decoder are combined to study the finetuning strategies.

A. The four fine-tuning strategies in total:

a. LNA-E: The LayerNorm and self-attention parameters in the encoder and all the parameters in the decoder are fine-tuned.

b. LNA-D: The whole encoder and the LayerNorm and both encoder and self-attention in the decoder are finetuned. We optionally freeze the encoder for the first k updates.

c. LNA-E, D: Only LNA parameters are finetuned both on the encoder and the decoder side.

d. Full: We finetune the whole model end-to-end with an option of freezing the encoder for the first k updates.

4. Data augmentation: Data augmentation is a common method for increasing the size of training data.

Experiment Data:

Meta has released the data used to work on the model. For modeling target speech in English, Spanish or French, they train a single mHuBERT model with 100k subset of VoxPopuli unlabeled speech, which contains 4.5k hrs of data from three languages. for En, Es, and Fr. They employed VoxPopuli ASR dataset and convert text transcriptions to reference units for training the speech normalizer. TTS data for HiFi-GAN vocoder along VAD to remove the silence at both ends of the audio

https://github.com/pytorch/fairseq/blob/ main/examples/speech_to_speech/docs/textless_ s2st_real_data.md

https://huggingface.co/facebook/tts_ transformer-en-ljspeech, Es: https://huggingface. co/facebook/tts_transformer-es-css10

References:

  1. “Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation”- https://arxiv.org/pdf/2204.02967.pdf
  2. “Textless Speech-to-Speech Translation on Real Data”- https://arxiv.org/pdf/2112.08352.pdf
  3. “Direct Speech-to-Speech Translation With Discrete Units”- https://arxiv.org/pdf/2107.05604.pdf
  4. https://research.facebook.com/publications/hokkien-direct-speech-to-speech-translation/
  5. https://ai.facebook.com/blog/teaching-ai-to-translate-100s-of-spoken-and-written-languages-in-real-time/
  6. https://ai.facebook.com/blog/advancing-direct-speech-to-speech-modeling-with-discrete-units/
  7. https://about.fb.com/news/2022/10/hokkien-ai-speech-translation/
  8. Audio samples are available at: https://facebookresearch.github.io/speech_translation/enhanced_direct_s2st_units/index.html

--

--

Siddhant Sancheti

Siddhant, SJSU MS in Software Eng. | AI/ML enthusiast | Exploring the nexus of tech and innovation in software and data science.