Information for Educators

How the practice is used

When a student creates a practice plan, they pick one of three modalities:

Match a target voice. The student chooses a specific reference voice from the Atavoice catalog, and each attempt is scored on how closely it matches that voice. The headline readout is accuracy.
Practice for perception. The student picks a perceptual direction — more masculine, more feminine, or more androgynous — and each attempt is scored on where it lands on that axis. The headline readout is perception. A reference voice is still chosen for playback (filtered to the matching perceptual band), but the score itself is not a match score.
Both. Accuracy and perception are scored and shown side by side.

The modality only controls what the student sees during practice. Both numbers are always computed and stored on the back end, so progress data is comparable across students regardless of how each one configured their plan.

How the student accesses practice

Practice is done in a web browser with microphone access. There is no app to install.

Each calendar day the student is given a single assignment of ten short conversational items, three to twelve words each. Roughly half are items they have struggled with in past sessions and half are new to them. The assignment is materialized once per day per adopted voice; if the student signs in again later that day they continue the same list rather than getting a fresh one.

For each item the student listens to a reference rendition, records their own attempt, and sees an immediate scored readout. They get up to three attempts per item (the initial recording plus two retries), and the score sent to the dashboard is the best of the three — ranked by overall accuracy first, with the perception score as a tiebreak. Students are expected to use all three attempts unless they score 70% or higher on the accuracy score, or land in the top level of the perception score for the direction they are training toward, in which case they may advance early. A complete ten-item assignment typically takes around five to fifteen minutes, depending on how many retries the student uses per item.

Accuracy score

All scores are 0–100, clamped. Contour comparisons are first phase-resampled onto a shared 100-point grid and the per-point score is averaged over the grid. Scalar comparisons are evaluated on per-utterance means / medians.

Pitch (scalar, median F0):

pitch = 100 − max(0, |1200·log₂(user / target)| − 50) / 2

Resonance (per-formant percent contour, then weighted mean):

resonance_fk = mean over phase i:
                 100 − max(0, |uᵢ − tᵢ| / |tᵢ| · 100 − 5) / 1

resonance    = 0.2·resonance_f1 + 0.4·resonance_f2 + 0.4·resonance_f3

Texture (composite over CPP / H1–H2 / H1–A3 means):

texture_cpp  = 100 − max(0, |user − target| − 2) / 0.2
texture_h1h2 = 100 − max(0, |user − target| − 2) / 0.2
texture_h1a3 = 100 − max(0, |user − target| − 3) / 0.3

texture      = 0.20·texture_cpp + 0.40·texture_h1h2 + 0.40·texture_h1a3

Prosody (composite over F0 contour, intensity contour, F0 std):

prosody_f0  = mean over phase i:
                100 − max(0, |1200·log₂(uᵢ / tᵢ)| − 50) / 2
prosody_int = mean over phase i:
                100 − max(0, |z(uᵢ) − z(tᵢ)| − 0.2) / 0.01
prosody_std = 100 − max(0, |user − target| / |target| · 100 − 20) / 1

prosody     = 0.40·prosody_f0 + 0.30·prosody_int + 0.30·prosody_std

Overall (weighted mean ceilinged by a weighted geometric mean):

weighted = 0.40·resonance + 0.30·pitch + 0.15·texture + 0.15·prosody
ceiling  = pitch^0.35 · resonance^0.65
overall  = min(weighted, ceiling)

Perception score

All scores are 0–100, clamped. Each input is mapped through a logistic against masculine and feminine anchor norms drawn from the published voice-modification literature, then combined.

Logistic helper (auto-inverts when feminine < masculine):

σ(x; masculine, feminine) = 100 / (1 + exp( −6·(x − ½(masculine+feminine)) / (feminine − masculine) ))

Pitch femininity (plateau outside 100–200 Hz):

pitch_fem = 0                                   if F0 ≤ 100 Hz
            100                                 if F0 ≥ 200 Hz
            100 / (1 + exp(−0.1·(F0 − 155)))    otherwise

Resonance femininity (F1 excluded):

res_fem = 0.45·σ(F2; 1500, 1700) + 0.55·σ(F3; 2500, 2900)

Texture femininity (CPP norms inverted: lower CPP = more feminine):

tex_fem = 0.40·σ(H1−H2; 1, 6)
        + 0.40·σ(H1−A3; 20, 35)
        + 0.20·σ(CPP;   20, 14)

Combination + cap (prosody intentionally excluded):

weighted   = 0.40·pitch_fem + 0.50·res_fem + 0.10·tex_fem
ceiling    = pitch_fem^0.35 · res_fem^0.65
perception = min(weighted, ceiling)

The anchor norms are explicit approximations and will be refined as the population of recorded attempts grows.

Data safety

Student recordings are not persisted. The audio body of an analysis request lives only in memory on the analysis service: it is transcoded to a temporary WAV, run through the extraction pipeline, scored, and discarded when the request completes. Only the resulting numerical scores are written back to the database — there is no audio path and no per-frame analysis path stored against student sessions.

The microphone is only read inside an active practice session, and only after the student explicitly clicks the record button on a given item. The browser’s standard microphone permission applies; the page does not request or hold any background audio access.

Catalog reference audio is the one exception: it is stored, because we have to play it back during practice. That audio is not student-generated.