When a student creates a practice plan, they pick one of three modalities:
The modality only controls what the student sees during practice. Both numbers are always computed and stored on the back end, so progress data is comparable across students regardless of how each one configured their plan.
Practice is done in a web browser with microphone access. There is no app to install.
Each calendar day the student is given a single assignment of ten short conversational items, three to twelve words each. Roughly half are items they have struggled with in past sessions and half are new to them. The assignment is materialized once per day per adopted voice; if the student signs in again later that day they continue the same list rather than getting a fresh one.
For each item the student listens to a reference rendition, records their own attempt, and sees an immediate scored readout. They get up to three attempts per item (the initial recording plus two retries), and the score sent to the dashboard is the best of the three — ranked by overall accuracy first, with the perception score as a tiebreak. Students are expected to use all three attempts unless they score 70% or higher on the accuracy score, or land in the top level of the perception score for the direction they are training toward, in which case they may advance early. A complete ten-item assignment typically takes around five to fifteen minutes, depending on how many retries the student uses per item.
All scores are 0–100, clamped. Contour comparisons are first phase-resampled onto a shared 100-point grid and the per-point score is averaged over the grid. Scalar comparisons are evaluated on per-utterance means / medians.
Pitch (scalar, median F0):
pitch = 100 − max(0, |1200·log₂(user / target)| − 50) / 2Resonance (per-formant percent contour, then weighted mean):
resonance_fk = mean over phase i:
100 − max(0, |uᵢ − tᵢ| / |tᵢ| · 100 − 5) / 1
resonance = 0.2·resonance_f1 + 0.4·resonance_f2 + 0.4·resonance_f3Texture (composite over CPP / H1–H2 / H1–A3 means):
texture_cpp = 100 − max(0, |user − target| − 2) / 0.2
texture_h1h2 = 100 − max(0, |user − target| − 2) / 0.2
texture_h1a3 = 100 − max(0, |user − target| − 3) / 0.3
texture = 0.20·texture_cpp + 0.40·texture_h1h2 + 0.40·texture_h1a3Prosody (composite over F0 contour, intensity contour, F0 std):
prosody_f0 = mean over phase i:
100 − max(0, |1200·log₂(uᵢ / tᵢ)| − 50) / 2
prosody_int = mean over phase i:
100 − max(0, |z(uᵢ) − z(tᵢ)| − 0.2) / 0.01
prosody_std = 100 − max(0, |user − target| / |target| · 100 − 20) / 1
prosody = 0.40·prosody_f0 + 0.30·prosody_int + 0.30·prosody_stdOverall (weighted mean ceilinged by a weighted geometric mean):
weighted = 0.40·resonance + 0.30·pitch + 0.15·texture + 0.15·prosody
ceiling = pitch^0.35 · resonance^0.65
overall = min(weighted, ceiling)All scores are 0–100, clamped. Each input is mapped through a logistic against masculine and feminine anchor norms drawn from the published voice-modification literature, then combined.
Logistic helper (auto-inverts when feminine < masculine):
σ(x; masculine, feminine) = 100 / (1 + exp( −6·(x − ½(masculine+feminine)) / (feminine − masculine) ))Pitch femininity (plateau outside 100–200 Hz):
pitch_fem = 0 if F0 ≤ 100 Hz
100 if F0 ≥ 200 Hz
100 / (1 + exp(−0.1·(F0 − 155))) otherwiseResonance femininity (F1 excluded):
res_fem = 0.45·σ(F2; 1500, 1700) + 0.55·σ(F3; 2500, 2900)Texture femininity (CPP norms inverted: lower CPP = more feminine):
tex_fem = 0.40·σ(H1−H2; 1, 6)
+ 0.40·σ(H1−A3; 20, 35)
+ 0.20·σ(CPP; 20, 14)Combination + cap (prosody intentionally excluded):
weighted = 0.40·pitch_fem + 0.50·res_fem + 0.10·tex_fem
ceiling = pitch_fem^0.35 · res_fem^0.65
perception = min(weighted, ceiling)The anchor norms are explicit approximations and will be refined as the population of recorded attempts grows.
Student recordings are not persisted. The audio body of an analysis request lives only in memory on the analysis service: it is transcoded to a temporary WAV, run through the extraction pipeline, scored, and discarded when the request completes. Only the resulting numerical scores are written back to the database — there is no audio path and no per-frame analysis path stored against student sessions.
The microphone is only read inside an active practice session, and only after the student explicitly clicks the record button on a given item. The browser’s standard microphone permission applies; the page does not request or hold any background audio access.
Catalog reference audio is the one exception: it is stored, because we have to play it back during practice. That audio is not student-generated.