Cloud Text-to-Speech API

Name: Cloud Text-to-Speech API API
Brand: Cloud Text-to-Speech API
Availability: InStock

✓ Official Vendor SpecAI/MLSpeechoauth27 EndpointsREST

For Agents

Synthesize natural-sounding speech audio from text or SSML in 50+ languages, including long-form Cloud Storage output and a wide voice catalog.

Quickstart

Get started with Cloud Text-to-Speech API in minutes using your preferred integration method.

# Add to your MCP client config (Claude Desktop, Cursor, Windsurf)
{
  "jentic": {
    "url": "https://api.jentic.com/mcp",
    "auth": "oauth"
  }
}

# Then ask your agent:
"synthesize speech from text"

# → Jentic returns the GET /events tool with parameter schema, agent executes.

Capabilities

What an agent can do with Cloud Text-to-Speech API API.

Synthesize short-form speech audio from plain text or SSML markup

Stream long-form audio synthesis output to a Cloud Storage bucket

List available voices filtered by language and gender

Apply SSML controls for pacing, pitch, pauses, emphasis, and pronunciation

GET STARTED

Start building with Cloud Text-to-Speech API API

Explore with Jentic

View OpenAPI Document

Use Cases

Patterns agents use Cloud Text-to-Speech API API for, with concrete tasks.

★ IVR and Voice Bot Prompts

Contact centres use Cloud Text-to-Speech to generate dynamic prompts for IVR systems and voice bots. Static greetings are pre-rendered with a Studio voice and cached; per-call dynamic text (caller name, account balance) is synthesised on demand using lower-latency Neural2 voices. Output is returned as base64 LINEAR16 audio that integrates with telephony stacks like Dialogflow CX or Twilio.

POST /v1/text:synthesize with input.text='Welcome back, Maria. Your balance is $312.', voice.languageCode='en-US', voice.name='en-US-Neural2-F', and audioConfig.audioEncoding='LINEAR16'.

Audiobook and Long-Form Narration

Publishers convert book chapters and course transcripts into audio using the synthesizeLongAudio endpoint. The endpoint accepts up to 1 million characters per request, runs as a long-running operation, and writes the resulting audio file directly to a Cloud Storage bucket. This avoids the synchronous endpoint's 5,000-character limit and is the production-grade path for any content longer than a paragraph.

POST /v1/{parent=projects/*/locations/*}:synthesizeLongAudio with the chapter text, parent=projects/PROJECT/locations/us-central1, and outputGcsUri='gs://my-bucket/chapter-1.wav'.

Multilingual Marketing Voice-Overs

Marketing teams generate localised voice-overs for product videos by sending the same script translated into target languages and selecting a matching Neural2 or Studio voice per locale. This produces consistent brand voice characteristics across markets in minutes rather than scheduling voice talent per language.

Iterate over a list of language-text pairs and POST /v1/text:synthesize for each with voice.languageCode set per locale and voice.name=<lang>-Studio-A.

Agent-Generated Spoken Responses via Jentic

An AI agent producing voice replies for a smart-home assistant generates the response text from an LLM, then calls Cloud Text-to-Speech through Jentic to render audio. Jentic isolates the GCP credential, returns the audio bytes, and the agent streams them to the smart speaker.

Through Jentic, search 'synthesize speech from text', load POST /v1/text:synthesize, and execute it with the LLM-generated reply text and a chosen Neural2 voice.

Key Endpoints

7 endpoints — cloud text-to-speech synthesises natural-sounding speech from text or ssml using google's neural network voices, including wavenet, neural2, and studio voice tiers.

METHOD

PATH

DESCRIPTION

POST

/v1/text:synthesize

Synthesize speech audio from text or SSML

GET

/v1/voices

List available voices, optionally filtered by language code

POST

/v1/{+parent}:synthesizeLongAudio

Start a long-running synthesis writing to Cloud Storage

GET

/v1/{+name}/operations

List long-running synthesis operations

POST

/v1/{+name}:cancel

Cancel an in-flight long-running synthesis

POST

/v1/text:synthesize

Synthesize speech audio from text or SSML

GET

/v1/voices

List available voices, optionally filtered by language code

POST

/v1/{+parent}:synthesizeLongAudio

Start a long-running synthesis writing to Cloud Storage

GET

/v1/{+name}/operations

List long-running synthesis operations

POST

/v1/{+name}:cancel

Cancel an in-flight long-running synthesis

Why though Jentic?

Three things that make agents converge on Jentic-routed access.

Credential isolation

Google service-account JSON is stored encrypted in the Jentic vault. Agents call synthesize operations through Jentic and never hold raw service-account keys.

Intent-based discovery

Agents search 'synthesize speech from text' or 'list available voices' and Jentic returns the matching v1 operation with full input schema (input, voice, audioConfig).

Time to first call

Direct Text-to-Speech integration: 1-2 days for service-account setup, voice catalog exploration, and audio handling. Through Jentic: under 30 minutes.

Related APIs

Alternatives and complements available in the Jentic catalogue.

Complementary

Cloud Speech-to-Text API

The reverse direction — transcribe spoken audio into text

Use Speech-to-Text for transcription (audio in, text out); use Text-to-Speech for synthesis (text in, audio out). Pair them for full voice-bot loops.

Complementary

Cloud Translation API

Translate source text before synthesising speech in the target language

Translate first, then call Text-to-Speech with the translated string and a matching languageCode.

Complementary

Google Dialogflow API

Conversational platform that uses Text-to-Speech for spoken agent responses

Dialogflow CX integrates Text-to-Speech for the response leg of a voice agent; call this API directly when building a custom voice stack.

FAQs

Specific to using Cloud Text-to-Speech API API through Jentic.

What authentication does the Cloud Text-to-Speech API use?

OAuth 2.0 with the cloud-platform scope is required. Most production integrations use a Google Cloud service-account credential and exchange it for a short-lived bearer token. Through Jentic, the service-account JSON is stored encrypted in the vault and Jentic mints scoped tokens per call.

Can I synthesize long-form audio with the Cloud Text-to-Speech API?

Yes. The synchronous /v1/text:synthesize endpoint caps input at 5,000 characters; for longer content use POST /v1/{parent}:synthesizeLongAudio, which accepts up to 1 million characters per request and writes the resulting audio to a Cloud Storage URI you specify. Track progress via the returned long-running operation.

What are the rate limits for the Cloud Text-to-Speech API?

The default per-project quota is 1,000 requests per minute for the synchronous synthesize endpoint, with separate character-per-minute caps that vary by voice tier (Standard voices have higher throughput than Studio). Long-audio synthesis allows a smaller number of concurrent operations. Quota increases are available through the Cloud Console.

How do I generate speech audio through Jentic?

Search Jentic for 'synthesize speech from text', load POST /v1/text:synthesize, and execute it with input.text, voice (languageCode and optional name), and audioConfig.audioEncoding (MP3, LINEAR16, OGG_OPUS). The response includes audioContent as base64. Get started at https://app.jentic.com/sign-up.

What voice tiers and languages are supported?

Standard, WaveNet, Neural2, and Studio voice tiers are available across more than 50 languages. Studio voices offer the highest realism and are recommended for narration; Neural2 balances quality and latency for interactive use; WaveNet is a long-standing high-quality tier; Standard is the cheapest and lowest-latency. List all available voices via GET /v1/voices.

Is the Cloud Text-to-Speech API free?

Pricing is per-character with a free tier each month. Standard voices and WaveNet/Neural2 voices have separate free-tier allotments; Studio voices have their own pricing. SSML markup is included in the character count. See cloud.google.com/text-to-speech/pricing for current rates.