Development of Voice Synthesis Systems with Emphasis on Prosody

Research project on developing multilingual automatic dubbing systems preserving human-like prosodic characteristics.

Project Overview

This project emerged from an internship at ETIQMEDIA's R&D department, aiming to create a multilingual automatic dubbing system emphasizing natural prosody. Leveraging neural networks for voice synthesis, the project achieved precise synchronization between dubbed audio and original video, delivering human-like voice quality.

System Architecture

System Architecture Diagram

Signal Processing

Audio Signal with Outliers

Attention Mechanisms

Attention Explanation Binary Attention Matrix

Key Features

  • Accurate transcription using Whisper (OpenAI)
  • High-quality translations with OPUS MT models
  • Voice synthesis with XTTS v2, ensuring human-like prosody
  • Advanced audio processing and speaker style transfer
  • Modular design for easy maintenance and scalability

Performance Metrics

  • Whisper Transcription (WER): 9.6% Spanish, 7.6% English
  • Translation Quality (BLEU): Average score of 58
  • Voice naturalness and synchronization validated through subjective testing

Future Improvements

  • Advanced language models for improved translation and dubbing coherence
  • Lip synchronization techniques implementation
  • Graphical interfaces for manual dubbing correction and editing

Project Details

Technologies

Python PyTorch OpenAI Whisper Transformers XTTS v2 CUDA

Evaluation Metrics

BLEU WER