UniSAE: Unified Speech Attribute Editing
on Speaker, Emotion and Low-Level Content
via Discrete Phonetic Posteriorgram Modelling

Abstract

Speech editing traditionally focuses on word-level content replacement, limiting its flexibility in realistic scenarios that require modifying fine-grained pronunciation, or even the speaker identity and emotion. We extend speech editing to Speech Attribute Editing (SAE), which treats content, speaker, and emotion as editable attributes within a unified framework. We propose UniSAE, a two-stage architecture that disentangles content editing from acoustic attribute rendering. UniSAE introduces a Discrete Phonetic PosteriorGram (DPPG) representation that explicitly models phoneme identity, pronunciation variants, and duration, enabling editing with multiple granunlarities. While DPPG tokens support explicit phoneme- and sub-phoneme editing, an autoregressive content model predicts edited DPPG sequences for indirect modification such as word editing. A diffusion-based acoustic decoder then synthesizes the content sequence into speech conditioned on disentangled speaker and emotion representations. To facilitate robust attribute disentanglement, we further construct UniEditCorpus, a large-scale synthetic emotional speech corpus with counterfactual supervision. Experiments demonstrate effective and composable control of content, speaker identity, and emotion, while supporting fine-grained phonetic editing within a unified framework.

Demo Index

Emotion + Speaker Editing

All the prompts are from our proposed UniEditCorpus

Attributes highlighted in the blue boxes are the information to extract and preserved in the generation results

1. Seen-S:

Prompts from the Seen-S test set. Speakers seen, text unseen, emotions seen in the training set
1.1 Target Emotion: neutral
sample text Speaker Emotion
Source Speech This would provide a significant and welcome boost for local employment. JLcorpur_female2 Happy
Speaker Prompt He is not a pathetic figure. ESD_0016 Neutral
Emotion Prompt She hopes to study in Britain one day. MEAD_M039 Neutral
UniSAE EmoConv-diff ZEST
1.2 Target Emotion: happy
sample text Speaker Emotion
Source Speech I didn't have the reply. RAVDESS_15 Neutral
Speaker Prompt I can't remember a lot of it. TESS_YAF Neutral
Emotion Prompt It is, however, an unlikely outcome. MEAD_W018 Happy
UniSAE EmoConv-diff ZEST
1.3 Target Emotion: sad
sample text Speaker Emotion
Source Speech They never gave up hope. RAVDESS_13 Angry
Speaker Prompt Mr Kennedy is not a candidate for the Scottish Parliament. TESS_YAF Neutral
Emotion Prompt Five years ago its return was rejected. MEAD_M028 Sad
UniSAE EmoConv-diff ZEST
1.4 Target Emotion: angry
sample text Speaker Emotion
Source Speech I didn't have the reply. MEAD_M005 Sad
Speaker Prompt Head injuries are the leading cause of death. ESD_0013 Neutral
Emotion Prompt Who will perform? MEAD_M022 Angry
UniSAE EmoConv-diff ZEST
1.5 Target Emotion: surprised
sample text Speaker Emotion
Source Speech Meetings will also remain private. RAVDESS_7 Angry
Speaker Prompt They had asked not to be named. ESD_0016 Neutral
Emotion Prompt That decision will be made over the next couple of days. JLcorpus_female2 Surprised
UniSAE EmoConv-diff ZEST

2. Unseen-S:

Prompts from the Unseen-S test set. Speakers unseen, text unseen, emotions seen in the training set.
2.1 Target Emotion: neutral
sample text Speaker Emotion
Source Speech I am not concerned with protection of the public. MEAD_M025 Sad
Speaker Prompt We never said that we would walk through the third division. ESD_0019 Neutral
Emotion Prompt Having guidelines in advance is helpful. RAVDESS_1 Neutral
UniSAE EmoConv-diff ZEST
2.2 Target Emotion: happy
sample text Speaker Emotion
Source Speech This would provide a significant and welcome boost for local employment. MEAD_W016 Neutral
Speaker Prompt Mr Kennedy is not a candidate for the Scottish Parliament. RAVDESS_1 Neutral
Emotion Prompt I'd rather be in our position than Rangers. MEAD_M023 Happy
UniSAE EmoConv-diff ZEST
2.3 Target Emotion: sad
sample text Speaker Emotion
Source Speech They had asked not to be named. ESD_0017 Happy
Speaker Prompt At the time, he was a living legend. RAVDESS_2 Neutral
Emotion Prompt Five years ago its return was rejected. ESD_0019 Sad
UniSAE EmoConv-diff ZEST
2.4 Target Emotion: angry
sample text Speaker Emotion
Source Speech There were no proposals on the table. ESD_0012 Surprised
Speaker Prompt I would not count on it. MEAD_M023 Neutral
Emotion Prompt Scrutiny by European Parliament is limited. RAVDESS_1 Angry
UniSAE EmoConv-diff ZEST
2.5 Target Emotion: surprised
sample text Speaker Emotion
Source Speech He has a wealth of experience. ESD_0017 Neutral
Speaker Prompt We never said that we would walk through the third division. MEAD_W011 Neutral
Emotion Prompt I think the referee was good. RAVDESS_1 Surprised
UniSAE EmoConv-diff ZEST

Word-level Content Editing

ESD speech data used.

For UniSAE, source speech, speaker prompt, emotion prompt are the same here.

1. Insert

The nine the eggs I keep → The nine already the eggs I keep
source speech UniSAE VoiceCraft SSR-Speech
Clear than clear water → Clear than clear even water
source speech UniSAE VoiceCraft SSR-Speech

2. Delete

Chapter ten a warm welcome → Chapter a warm welcome
source speech UniSAE VoiceCraft SSR-Speech
Clear than clear water → Clear than water
source speech UniSAE VoiceCraft SSR-Speech

3. Substitute

Let's make the noise a snake → Let's make the things a snake
source speech UniSAE VoiceCraft SSR-Speech
At the end of four → At the places of four
source speech UniSAE VoiceCraft SSR-Speech

Phoneme- and Sub-Phoneme-level Content Editing

ESD speech data used

1. /n/ to /l/: The nine (<n_0>) the eggs I keep → The line (<l_0>) the eggs I keep
source speech (sample 1) UniSAE (sample 1) UniSAE (sample 2) UniSAE (sample 3)
2. Variant: The nine (<n_0>) the eggs I keep → The ine (<n_4>) the eggs I keep
source speech (sample 1) UniSAE (sample 1) UniSAE (sample 2) UniSAE (sample 3)
2. Duration: The nine the eggs I keep → The niiine the eggs I keep
source speech UniSAE
3. /s/ to /sh/: Not much use is it, sam → Not much use is it, sham
source speech (sample 1) UniSAE (sample 1) UniSAE (sample 2) UniSAE (sample 3)
5. /k/ to /p/: The nine the eggs I keep → The nine the eggs I geep
source speech (sample 1) UniSAE (sample 1) UniSAE (sample 2) UniSAE (sample 3)
6. /aa/ to /ey/: Then we all say aha → Then we all say ahey
source speech (sample 1) UniSAE (sample 1) UniSAE (sample 2) UniSAE (sample 3)

Joint Editing

ESD speech data used

1. And be with you Tom → [Happy] And be with yooooou Tom
source speech speaker prompt emotion prompt UniSAE