Abstract
Speech editing traditionally focuses on word-level content replacement, limiting its flexibility in realistic scenarios that require modifying fine-grained pronunciation, or even the speaker identity and emotion. We extend speech editing to Speech Attribute Editing (SAE), which treats content, speaker, and emotion as editable attributes within a unified framework. We propose UniSAE, a two-stage architecture that disentangles content editing from acoustic attribute rendering. UniSAE introduces a Discrete Phonetic PosteriorGram (DPPG) representation that explicitly models phoneme identity, pronunciation variants, and duration, enabling editing with multiple granunlarities. While DPPG tokens support explicit phoneme- and sub-phoneme editing, an autoregressive content model predicts edited DPPG sequences for indirect modification such as word editing. A diffusion-based acoustic decoder then synthesizes the content sequence into speech conditioned on disentangled speaker and emotion representations. To facilitate robust attribute disentanglement, we further construct UniEditCorpus, a large-scale synthetic emotional speech corpus with counterfactual supervision. Experiments demonstrate effective and composable control of content, speaker identity, and emotion, while supporting fine-grained phonetic editing within a unified framework.
Demo Index
Emotion + Speaker Editing
All the prompts are from our proposed UniEditCorpus
Attributes highlighted in the blue boxes are the information to extract and preserved in the generation results
1. Seen-S:
Prompts from the Seen-S test set. Speakers seen, text unseen, emotions seen in the training set
1.1 Target Emotion: neutral
| sample | text | Speaker | Emotion | |
|---|---|---|---|---|
| Source Speech | This would provide a significant and welcome boost for local employment. | JLcorpur_female2 | Happy | |
| Speaker Prompt | He is not a pathetic figure. | ESD_0016 | Neutral | |
| Emotion Prompt | She hopes to study in Britain one day. | MEAD_M039 | Neutral |
| UniSAE | EmoConv-diff | ZEST |
|---|---|---|
1.2 Target Emotion: happy
| sample | text | Speaker | Emotion | |
|---|---|---|---|---|
| Source Speech | I didn't have the reply. | RAVDESS_15 | Neutral | |
| Speaker Prompt | I can't remember a lot of it. | TESS_YAF | Neutral | |
| Emotion Prompt | It is, however, an unlikely outcome. | MEAD_W018 | Happy |
| UniSAE | EmoConv-diff | ZEST |
|---|---|---|
1.3 Target Emotion: sad
| sample | text | Speaker | Emotion | |
|---|---|---|---|---|
| Source Speech | They never gave up hope. | RAVDESS_13 | Angry | |
| Speaker Prompt | Mr Kennedy is not a candidate for the Scottish Parliament. | TESS_YAF | Neutral | |
| Emotion Prompt | Five years ago its return was rejected. | MEAD_M028 | Sad |
| UniSAE | EmoConv-diff | ZEST |
|---|---|---|
1.4 Target Emotion: angry
| sample | text | Speaker | Emotion | |
|---|---|---|---|---|
| Source Speech | I didn't have the reply. | MEAD_M005 | Sad | |
| Speaker Prompt | Head injuries are the leading cause of death. | ESD_0013 | Neutral | |
| Emotion Prompt | Who will perform? | MEAD_M022 | Angry |
| UniSAE | EmoConv-diff | ZEST |
|---|---|---|
1.5 Target Emotion: surprised
| sample | text | Speaker | Emotion | |
|---|---|---|---|---|
| Source Speech | Meetings will also remain private. | RAVDESS_7 | Angry | |
| Speaker Prompt | They had asked not to be named. | ESD_0016 | Neutral | |
| Emotion Prompt | That decision will be made over the next couple of days. | JLcorpus_female2 | Surprised |
| UniSAE | EmoConv-diff | ZEST |
|---|---|---|
2. Unseen-S:
Prompts from the Unseen-S test set. Speakers unseen, text unseen, emotions seen in the training set.
2.1 Target Emotion: neutral
| sample | text | Speaker | Emotion | |
|---|---|---|---|---|
| Source Speech | I am not concerned with protection of the public. | MEAD_M025 | Sad | |
| Speaker Prompt | We never said that we would walk through the third division. | ESD_0019 | Neutral | |
| Emotion Prompt | Having guidelines in advance is helpful. | RAVDESS_1 | Neutral |
| UniSAE | EmoConv-diff | ZEST |
|---|---|---|
2.2 Target Emotion: happy
| sample | text | Speaker | Emotion | |
|---|---|---|---|---|
| Source Speech | This would provide a significant and welcome boost for local employment. | MEAD_W016 | Neutral | |
| Speaker Prompt | Mr Kennedy is not a candidate for the Scottish Parliament. | RAVDESS_1 | Neutral | |
| Emotion Prompt | I'd rather be in our position than Rangers. | MEAD_M023 | Happy |
| UniSAE | EmoConv-diff | ZEST |
|---|---|---|
2.3 Target Emotion: sad
| sample | text | Speaker | Emotion | |
|---|---|---|---|---|
| Source Speech | They had asked not to be named. | ESD_0017 | Happy | |
| Speaker Prompt | At the time, he was a living legend. | RAVDESS_2 | Neutral | |
| Emotion Prompt | Five years ago its return was rejected. | ESD_0019 | Sad |
| UniSAE | EmoConv-diff | ZEST |
|---|---|---|
2.4 Target Emotion: angry
| sample | text | Speaker | Emotion | |
|---|---|---|---|---|
| Source Speech | There were no proposals on the table. | ESD_0012 | Surprised | |
| Speaker Prompt | I would not count on it. | MEAD_M023 | Neutral | |
| Emotion Prompt | Scrutiny by European Parliament is limited. | RAVDESS_1 | Angry |
| UniSAE | EmoConv-diff | ZEST |
|---|---|---|
2.5 Target Emotion: surprised
| sample | text | Speaker | Emotion | |
|---|---|---|---|---|
| Source Speech | He has a wealth of experience. | ESD_0017 | Neutral | |
| Speaker Prompt | We never said that we would walk through the third division. | MEAD_W011 | Neutral | |
| Emotion Prompt | I think the referee was good. | RAVDESS_1 | Surprised |
| UniSAE | EmoConv-diff | ZEST |
|---|---|---|
Word-level Content Editing
ESD speech data used.
For UniSAE, source speech, speaker prompt, emotion prompt are the same here.
1. Insert
The nine the eggs I keep → The nine already the eggs I keep
| source speech | UniSAE | VoiceCraft | SSR-Speech |
|---|---|---|---|
Clear than clear water → Clear than clear even water
| source speech | UniSAE | VoiceCraft | SSR-Speech |
|---|---|---|---|
2. Delete
Chapter ten a warm welcome → Chapter a warm welcome
| source speech | UniSAE | VoiceCraft | SSR-Speech |
|---|---|---|---|
Clear than clear water → Clear than water
| source speech | UniSAE | VoiceCraft | SSR-Speech |
|---|---|---|---|
3. Substitute
Let's make the noise a snake → Let's make the things a snake
| source speech | UniSAE | VoiceCraft | SSR-Speech |
|---|---|---|---|
At the end of four → At the places of four
| source speech | UniSAE | VoiceCraft | SSR-Speech |
|---|---|---|---|
Phoneme- and Sub-Phoneme-level Content Editing
ESD speech data used
1. /n/ to /l/: The nine (<n_0>) the eggs I keep → The line (<l_0>) the eggs I keep
| source speech (sample 1) | UniSAE (sample 1) | UniSAE (sample 2) | UniSAE (sample 3) |
|---|---|---|---|
2. Variant: The nine (<n_0>) the eggs I keep → The ine (<n_4>) the eggs I keep
| source speech (sample 1) | UniSAE (sample 1) | UniSAE (sample 2) | UniSAE (sample 3) |
|---|---|---|---|
2. Duration: The nine the eggs I keep → The niiine the eggs I keep
| source speech | UniSAE |
|---|---|
3. /s/ to /sh/: Not much use is it, sam → Not much use is it, sham
| source speech (sample 1) | UniSAE (sample 1) | UniSAE (sample 2) | UniSAE (sample 3) |
|---|---|---|---|
5. /k/ to /p/: The nine the eggs I keep → The nine the eggs I geep
| source speech (sample 1) | UniSAE (sample 1) | UniSAE (sample 2) | UniSAE (sample 3) |
|---|---|---|---|
6. /aa/ to /ey/: Then we all say aha → Then we all say ahey
| source speech (sample 1) | UniSAE (sample 1) | UniSAE (sample 2) | UniSAE (sample 3) |
|---|---|---|---|
Joint Editing
ESD speech data used
1. And be with you Tom → [Happy] And be with yooooou Tom
| source speech | speaker prompt | emotion prompt | UniSAE |
|---|---|---|---|