ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks

Submitted to Interspeech 2026

Abstract

Speech deepfake detection (SDD) systems perform well on standard benchmark datasets but often fail to generalize to expressive and emotional spoofing attacks. Many methods rely on spoof-heavy training data, learning dataset-specific artifacts rather than transferable cues of natural speech. In contrast, humans internalize variability in real speech and detect fakes as deviations from it. We introduce ProSDD, a two-stage framework that enriches model embeddings through supervised masked prediction of speaker-conditioned prosodic variation based on pitch, voice activity, and energy. Stage I learns prosodic variability from real speech, and Stage II jointly optimizes this objective with spoof classification. ProSDD consistently outperforms baselines under both ASVspoof 2019 and 2024 training, reducing ASVspoof 2024 EER from 25.43% to 16.14% (2019-trained) and from 39.62% to 7.38% (2024-trained), while achieving 50% relative reductions on EmoFake and EmoSpoof-TTS.
ProSDD framework diagram

Code Contents

The public repository at github.com/ProSDD/codes contains everything needed to reproduce our results and run ProSDD on new data.

ProSDD Checkpoints

Pretrained ProSDD checkpoints under both training settings: ASVspoof 2019 LA and ASVspoof 2024.

Baseline Checkpoints

We also release pretrained checkpoints for RawNet2, AASIST, and XLSR-SLS trained on ASVspoof 2024, to help the community make better use of 2024-trained models.

Training & Inference Scripts

Full two-stage ProSDD training pipeline and inference scripts.

Supervised Targets

Extracted speaker embeddings (ECAPA-TDNN) used as supervised targets in Stage I and Stage II for the LibriSpeech, ASVspoof 2019, and ASVspoof 2024 train and dev sets are provided. Since the prosody files are large, we provide the code used to extract the frame-level prosody embeddings.

Reproducibility Notes

Hyperparameter configs, and training details provided in codes to help recreate all results reported in the paper.

Score Files

All score files to compute EER for ProSDD under both training settings are available, along with baseline scores trained on the ASVspoof 2024 setting.
We provide baseline scores trained on 2024 to facilitate use of this recent dataset by the community.

Results & Analysis

We report Equal Error Rate (EER ↓) across all benchmarks. ProSDD consistently outperforms all baselines on both standard and emotional/expressive datasets under both training settings.

Dataset names in the column headers are linked to their respective download pages.

(a) Trained on ASVspoof 2019 LA

Model ASV 2019 ASV 2021 ASV 2024 EmoFake EmoSpoof
RawNet24.608.0840.6721.7143.04
AASIST0.838.1535.5313.6431.06
XLSR-SLS0.563.0425.438.8418.92
ProSDD0.423.8716.143.709.54

(b) Trained on ASVspoof 2024

Model ASV 2019 ASV 2021 ASV 2024 EmoFake EmoSpoof
RawNet224.7525.5943.6149.4927.13
AASIST23.1622.7425.7762.7115.19
XLSR-SLS27.0026.5439.6258.5725.92
ProSDD19.0418.087.3825.0611.96

(c) Ablation Study — Trained on ASVspoof 2019 LA

MP = supervised masked prediction objective. "w/o MP" removes masked prediction in both stages. "w/o Stage I" removes real-only prosodic pretraining while retaining MP in Stage II.

Model ASV 2019 ASV 2021 ASV 2024 EmoFake EmoSpoof
w/o MP6.7825.1828.1214.0210.02
w/o Stage I5.147.8315.556.3715.02
ProSDD0.423.8716.143.709.54

Key Findings