Māori Text-to-Speech Model Spurns Big Tech’s Values

New Zealand is a country famed for its dramatic landscapes, but its linguistic landscape is arguably just as interesting. Of its three official languages, only te reo Māori (the Māori language) could be described as indigenous. Though spoken fluently by just 4.3 percent of the population, national statistics show that about 30 percent of New Zealanders can speak more than a few words or phrases of the language.

But ask ChatGPT to write te reo Māori and it will oblige, fluently answering your questions in the standardized form of the language taught in schools and broadcast on national television. Claude and Perplexity can do the same. This impressive language performance is built on text and audio produced by Māori communities and academics, which was scraped and ingested without their permission, processed outside New Zealand, and returned to users through interfaces owned by large technology companies. For Māori, that is a problem.

“These companies overseas have the resources to produce AI models that work well,” said Te Taka Keegan, an associate professor at the University of Waikato and co-director of its AI Institute. “But they scraped all of that data with no input from us, and we don’t own the output. Our language is the most important conveyor we have for our knowledge .… yet we see technology developed outside of Aotearoa [New Zealand] get more and more control over the transfer of that knowledge.”

Motivated by this need for “sovereign digital systems,” as Keegan calls it, he and Kingsley Eng, Keegan’s master’s student at the time, set out to develop a high-fidelity synthetic voice—a text-to-speech system, in other words—for a specific dialect of te reo Māori. Every technical decision Keegan and Eng made along the way was shaped by a foundational constraint typically ignored by the AI sector: That this synthetic voice, and everything used to build it, must remain owned by the people who speak that dialect. What they produced, they hope, offers a replicable blueprint for other minority language communities around the world.

Challenges in Māori AI Voice Models

AI voice models are predominantly built in English, so applying those models to other languages can lead to errors. Te reo Māori has some specific linguistic features, such as the importance of vowel length, that lead to additional challenges for AI voice systems.

As an example, the words for ‘cake’ (keke), ‘armpit’ (kēkē) and ‘to creak’ (kekē) differ only by how long the vowel sounds are. Digraphs—two letters making one sound— are also common, and are pronounced differently than they are in English; “wh” is usually pronounced “f.” In the Māori language, inaccurate pronunciation changes the meanings of words.

In addition, te reo Māori is considered a low-resource language, because, compared to a language like English or Chinese, there’s relatively little potential training data in the form of text, datasets, or recorded speech available in digital formats. To address this problem, Keegan recruited Ngaringi Katipa—a translator, educator and language mentor—to be the consenting human voice behind the tool.

“Our language is the most important conveyor we have for our knowledge .… yet we see technology developed outside of Aotearoa get more and more control over the transfer of that knowledge.” —Te Taka Keegan, University of Waikato

“We focused on our local dialect, Waikato-Maniapoto, because it’s in the dialects that you see the real beauty of language. They tie it to a specific place, and sense of identity,” says Keegan.

“We initially just recorded Ngaringi reading passages from books, which gave us 4.5 hours of data,” says Eng, now a machine learning engineer at precision toolmaker Extec. “Later, we expanded the dataset by recording from a comprehensive list of sentences and words—including very rare words—given to us by Te Taka’s brother Peter, who is a Māori linguistics expert.” Once cleaned and processed, the final tally was 7 hours and 45 minutes of recordings.

A Māori Text-to-Speech AI Model

Building a text-to-speech system generally takes one of two approaches to data input. The first is character-based, where raw letters are passed directly to the model. The second is phoneme-based, where text is first converted into a phonetic representation, or a description of how each word sounds, before training begins.

“We tried both, but the phoneme approach was far better,” says Eng. “Giving the model phoneme rules off the bat was like a headstart.” Phonemes effectively tell the model what certain groups of letters sound like, “which lets you skip some of the learning,” he says. To provide the model with phoneme rules, the researchers used an open-source tool called eSpeak-NG, which includes a beta Māori ruleset that they adapted further.

Eng tested three open-source neural architectures—Matcha-TTS, Tacotron2, and Piper—to train and transform the recordings into a synthetic voice. Piper, which can run offline on a local machine, had the best results and was chosen for the final build.

Despite using under eight hours of good quality recordings—considerably less than the hundreds of hours typically suggested for training a text-to-speech model—the final AI voice was effective. The primary metric used in text-to-speech research is word error rate, in which a lower percentage indicates higher accuracy. Keegan and Eng’s AI voice achieved an error rate of 6.78 percent, considered “good” by current industry standards.

Throughout the development process, a professional Māori language evaluator assessed the voice, rating it in terms of its naturalness, pronunciation accuracy, and expressiveness.

The researchers also invited 68 fluent speakers of te reo Māori to listen to both human and synthesized audio, and asked them to identify which was which. The listeners correctly identified the voices 65 percent of the time. “We were happy with that because some of the listeners were family members of the speaker—they know her voice really well, but a few still got it wrong,” says Keegan.

Māori AI Sovereignty

While Google provided some funding to the Waikato team, Keegan says it came with no conditions attached and no ownership stake claimed. “They said, we’ve heard about your work with preserving languages, and we wanted to support you. Use the grant whatever way you want.” Ultimately, he says, it allowed them to fairly compensate Katipa for her work.

With the tool now ready for use, the question of ownership remains front of mind for Keegan. From a standard intellectual property perspective, the voice belongs to Katipa. From a Māori perspective, Keegan says, it belongs to the collective: “It’s a treasure that’s been handed down through her ancestors; and her role is to protect it for her children and her grandchildren.”

So rather than release the voice model publicly, Keegan is in discussion with the three iwi (tribes) that Katipa affiliates with—Waikato, Maniapoto, and Raukawa. “Guardianship of this needs to sit with them,” Keegan says, “rather than the university.”

To that end, Keegan found a Wellington-based company, Catalyst IT, that gifted website hosting and the computing power needed to run the voice model for a year.

Data sovereignty is a rapidly growing focus in indigenous AI communities. Te Hiku Media, a Māori media organization in New Zealand’s far north, developed anautomatic speech recognition system that achieves 92 percent accuracy for te reo Māori and 82 percent accuracy for bilingual speech. The organization released the model under aKaitiakitanga licence—a legal instrument stipulating that data can only be used for the benefit of the Māori people.

Elsewhere in the world,the Aina project at the Barcelona Supercomputing Center released Matxa, a multi-dialect Catalan text-to-speech system also built on open-source architectures. In Quebec, Michael Running Wolf leads the First Languages AI Reality(FLAIR) initiative, which is working to build speech recognition models for Indigenous languages across North America.

Voice-driven technologies, such as virtual assistants, screen readers, navigation systems, and smart devices are ubiquitous. For Keegan, these tools can either be a way to “sanitize and colonize our language” or a means to “empower my moko [grandchildren] with their traditional knowledge.” The difference, he says, comes from who develops and owns the technology. “I want my grandchildren and my great-grandchildren to access our knowledge through our own systems. This voice is the first step in achieving that.”

Longer term, his ambition is to use the same open-source, community-owned methodology to build full language models. “It won’t be a te reo Māori large language model,” he says. “It’ll be a Maniapoto large language model, a Tūhoe large language model, et cetera.” Each model would be owned by, and trained on the speech of, the people whose language it speaks.

While that’s a more significant engineering challenge than a text-to-speech system, the Waikato project demonstrates that the necessary infrastructure already exists—efficient training on minimal data, phoneme-based input, open-source tools, and a legal and governance framework for community ownership. “We’ve laid a template so that other iwi throughout the country can do the same thing,” says Keegan. “I am happy to help them do it.”

From Your Site Articles