Of all the approximately 7,000 known languages in the world, nearly half of them are spoken by word of mouth and do not contain written content. These textless languages present a unique problem for modern machine learning translation systems, as they often require converting spoken language into written text and reverting text to voice before translating to a new language, but Meta has addressed this problem with its latest open source language AI advances.

As part of Meta's Universal Voice Translator (UST) project, which is working to develop real-time voice-to-voice translation so that residents of the metaverse can interact more easily (pronounced as: sexual harassment between each other). As part of this project, Meta researchers studied Fujian Minnan dialect, a non-text language scattered across Asia and one of the mainstream languages in Taiwan.
Machine learning translation systems usually require a large number of tagged language examples, including written and spoken languages for training - which is exactly what non-text languages like Minnan dialect don't have. To solve this problem, "Meta uses voice-to-unit translation (S2UT) to convert the input speech directly into a sequence of acoustic units that Meta has pioneered," CEO Mark Zuckerberg explained in a blog post Wednesday. "We then generate waveforms from these units. In addition, UnitY is adopted as a dual-pass decoding mechanism, the first-pass decoder generates text in the relevant language ( Mandarin ) and the second-pass decoder creates the unit. "
" We use Mandarin as the intermediate language to create pseudo-tags, we first translate English (or Minnan dialect as mentioned above) into Mandarin text, and then we translate it into Minnan dialect (or English) and add it to the training data. "Currently, the system allows people who speak Fujian dialect to talk to English speakers, albeit bluntly, and the model can only translate one full sentence at a time. But Zuckerberg believes that the technology can eventually be applied to more languages and will be improved to the point where it provides real-time translation.
Zuckerberg announced that in addition to Meta's open source model and training data from this project, the company will also release the first voice translation benchmark system based on the Minnan dialect discourse corpus, as well as "voice matrix, a voice translation library mined with Meta's innovative data mining technology LASER." This system will enable researchers to create their own voice-to-speech translation (S2ST) system.