Your voice → a Hawaiian computer voice. Here's the whole path.
Open record.html in any browser. 144 short lines — greetings, place names, everyday phrases. About 5 minutes. Each line is one clip, and the text is attached automatically — nothing can get mismatched. No software to install, no studio needed. A quiet room is enough.
When you're done, the recorder bundles everything into a .zip file. Upload it on your speaker page, or email it to [email protected]. That's it from your side.
Each clip goes through automated checks: Is it loud enough? Too loud (clipping)? The right length? Does the audio contain only Hawaiian sounds — no background noise, no English? Any clip that doesn't pass gets flagged. We don't train on bad audio.
All clips are resampled to 22,050 Hz mono — the standard for speech models. Volume is leveled so one line isn't quieter than another. The result is a clean, consistent dataset where every clip is the same format and quality.
We use VITS — a neural text-to-speech architecture that learns to turn written Hawaiian into spoken audio. The model sees each clip paired with its exact transcript. Over thousands of training steps, it learns: when the text says ʻokina, the audio has a brief catch. When it says kahakō, the vowel is held longer. The voice that emerges carries the quality of whoever recorded it — which is why fluent speakers matter.
A fluent Hawaiian speaker listens to samples of the trained model's output and judges: does it sound right? Are the ʻokina and kahakō in the correct places? Is the prosody natural? The model is iteratively improved until it passes human review.
The finished voice is released openly — free for schools, families, apps, and any organization serving ʻōlelo Hawaiʻi. Your name appears in the credits. The voice you helped create becomes part of the infrastructure of the language.
Most TTS projects fail because the audio doesn't match the transcripts — a missing ʻokina in the text teaches the model to drop it. We learned this the hard way in our pilot. The recorder page fixes it structurally: one line = one clip = transcript is always correct. Everything after that — validation, normalization, training — is built on that foundation.
18–24 weeks from first recording to a working voice.