Text to Speech Using Python

Introduction

Text to Speech (TTS) is a type of assistive technology that reads digital text aloud. Utilizing a combination of natural language processing and digital signal processing, TTS systems convert words from a document or other sources into audible speech. TTS technology is widely used to assist individuals with visual impairments or reading disabilities, improve user engagement, and provide hands-free computing.

In software engineering, TTS can serve multiple purposes across various domains:

TTS Models and Their Set Up

We will introduce three popular TTS models for python.

gTTS:

gTTS is a Python library and CLI tool to interface with Google Translate’s text-to-speech API.

Price: Free!

Voice Choices: Only the default voice, but supports most of languages.

Available Functions: Only contain basic text-to-speech conversions.

Set-up

Quickstart

OpenAI TTS:

OpenAI’s text-to-speech (TTS) technology refers to a suite of artificial intelligence models and tools developed to convert written text into spoken words. This technology is built on advanced machine learning and deep learning principles, making it possible to generate highly realistic and natural-sounding voice outputs. OpenAI’s TTS systems are designed to understand the nuances of language, including intonation, emotion, and context, allowing them to produce speech that closely mimics human-like articulation and expressiveness.

Price: $15.00 / 1M characters

Voice choices: There are 6 voices to choose from (alloy, echo, fable, onyx, nova, and shimmer)

Supported output formats: Opus, AAC, FLAC, WAV, PCM

Available Functions: Have two different models: 1. tts-1(optimized for speed) 2. tts-1-hd(optimized for quality)

Set-up

API aquirement Getting an OpenAI api key is a mandatory requirement to use the OpenAI TTS module.

Google Cloud TTS:

Google Cloud Text-to-Speech API is a powerful tool offered by Google Cloud Platform for converting text into natural-sounding speech. It utilizes advanced machine learning techniques to generate high-quality audio output, allowing developers to integrate speech synthesis capabilities into their applications with ease.

Price: Based on the number of characters, $4 / 1M characters for Standard voice, will be more expensive depending on Feature. First 4 million characters is free for Standard voice each month.

Voice choices: Support most of languages. Only default voiceline, but can upgrage for other voicelines.

Supported output formats: MP3, Linear16, OGG Opus, and a number of other audio formats.

Key features: Custom voices, Long audio synthesis, Text and SSML support, Pitch tuning

Set-up

API aquirement Before you can begin using Text-to-Speech, you must enable the API in the Google Cloud Platform Console.

Quickstart Text-to-Speech supports programmatic access. You can access the API in 2 ways: Clinet libraries and REST

-REST: It is suggested to call this serevice with Google-providede client libraries. However, if you nned to use your own libraries to call this service, following information will help you make the API requests.

The service endpoint(base URL) for this API service is https://texttospeech.googleapis.com

A Discovery Document serves as a machine-readable blueprint detailing and facilitating the utilization of REST APIs. Its purpose lies in enabling the construction of client libraries, IDE plugins, and various tools that engage with Google APIs. Cloud Text-to-Speech API service provides the following Discovery Documents : v1 and v1beta1.

Here is one example of text.sythesize:

POST https://texttospeech.googleapis.com/v1/text:synthesize

Request Body:

{
  "input": {
    object (SynthesisInput)
  },
  "voice": {
    object (VoiceSelectionParams)
  },
  "audioConfig": {
    object (AudioConfig)
  }
}

Response body:

{
  "audioContent": string
}

Comparison Between the Three Models

In summary, the choice between these TTS APIs depends on factors such as the level of customization needed, pricing considerations, ease of integration, and the specific requirements of your project or application. Google Cloud Text-to-Speech API and gTTS are suitable for general-purpose TTS tasks, while OpenAI’s TTS models offer advanced capabilities and natural-sounding speech synthesis as the cost is much higher. Furthermore, while Google Cloud Text-to-Speech API and gTTS offer relatively straightforward pricing models, gTTS may be simpler to use for basic text-to-speech tasks. Ultimately, the choice depends on the specific requirements and preferences of the project.

Reference