For Agents
Synthesize natural-sounding speech audio from text or SSML in 50+ languages, including long-form Cloud Storage output and a wide voice catalog.
Get started with Cloud Text-to-Speech API in minutes using your preferred integration method.
# Add to your MCP client config (Claude Desktop, Cursor, Windsurf)
{
"jentic": {
"url": "https://api.jentic.com/mcp",
"auth": "oauth"
}
}
# Then ask your agent:
"synthesize speech from text"
# → Jentic returns the GET /events tool with parameter schema, agent executes.What an agent can do with Cloud Text-to-Speech API API.
Synthesize short-form speech audio from plain text or SSML markup
Stream long-form audio synthesis output to a Cloud Storage bucket
List available voices filtered by language and gender
Apply SSML controls for pacing, pitch, pauses, emphasis, and pronunciation
GET STARTED
Use for: I need to convert text to speech audio in MP3 format, Synthesize a long audiobook chapter to a Cloud Storage bucket, List all available English voices, Generate speech with SSML for a custom pronunciation
Not supported: Does not handle speech recognition, audio editing, or voice cloning of arbitrary speakers — use for synthesising speech from text using Google's voice catalog only.
Cloud Text-to-Speech synthesises natural-sounding speech from text or SSML using Google's neural network voices, including WaveNet, Neural2, and Studio voice tiers. It supports more than 50 languages and locales, dozens of voice variants per language, and SSML controls for pacing, pitch, pauses, and pronunciation. The synchronous synthesize endpoint returns audio for short text; the synthesizeLongAudio endpoint streams long-form audio (audiobooks, course content) to a Cloud Storage bucket via a long-running operation.
Choose between Standard, WaveNet, Neural2, and Studio voice quality tiers
Track long-running synthesis operations through dedicated operation endpoints
Patterns agents use Cloud Text-to-Speech API API for, with concrete tasks.
★ IVR and Voice Bot Prompts
Contact centres use Cloud Text-to-Speech to generate dynamic prompts for IVR systems and voice bots. Static greetings are pre-rendered with a Studio voice and cached; per-call dynamic text (caller name, account balance) is synthesised on demand using lower-latency Neural2 voices. Output is returned as base64 LINEAR16 audio that integrates with telephony stacks like Dialogflow CX or Twilio.
POST /v1/text:synthesize with input.text='Welcome back, Maria. Your balance is $312.', voice.languageCode='en-US', voice.name='en-US-Neural2-F', and audioConfig.audioEncoding='LINEAR16'.
Audiobook and Long-Form Narration
Publishers convert book chapters and course transcripts into audio using the synthesizeLongAudio endpoint. The endpoint accepts up to 1 million characters per request, runs as a long-running operation, and writes the resulting audio file directly to a Cloud Storage bucket. This avoids the synchronous endpoint's 5,000-character limit and is the production-grade path for any content longer than a paragraph.
POST /v1/{parent=projects/*/locations/*}:synthesizeLongAudio with the chapter text, parent=projects/PROJECT/locations/us-central1, and outputGcsUri='gs://my-bucket/chapter-1.wav'.
Multilingual Marketing Voice-Overs
Marketing teams generate localised voice-overs for product videos by sending the same script translated into target languages and selecting a matching Neural2 or Studio voice per locale. This produces consistent brand voice characteristics across markets in minutes rather than scheduling voice talent per language.
Iterate over a list of language-text pairs and POST /v1/text:synthesize for each with voice.languageCode set per locale and voice.name=<lang>-Studio-A.
Agent-Generated Spoken Responses via Jentic
An AI agent producing voice replies for a smart-home assistant generates the response text from an LLM, then calls Cloud Text-to-Speech through Jentic to render audio. Jentic isolates the GCP credential, returns the audio bytes, and the agent streams them to the smart speaker.
Through Jentic, search 'synthesize speech from text', load POST /v1/text:synthesize, and execute it with the LLM-generated reply text and a chosen Neural2 voice.
7 endpoints — cloud text-to-speech synthesises natural-sounding speech from text or ssml using google's neural network voices, including wavenet, neural2, and studio voice tiers.
METHOD
PATH
DESCRIPTION
/v1/text:synthesize
Synthesize speech audio from text or SSML
/v1/voices
List available voices, optionally filtered by language code
/v1/{+parent}:synthesizeLongAudio
Start a long-running synthesis writing to Cloud Storage
/v1/{+name}/operations
List long-running synthesis operations
/v1/{+name}:cancel
Cancel an in-flight long-running synthesis
/v1/text:synthesize
Synthesize speech audio from text or SSML
/v1/voices
List available voices, optionally filtered by language code
/v1/{+parent}:synthesizeLongAudio
Start a long-running synthesis writing to Cloud Storage
/v1/{+name}/operations
List long-running synthesis operations
/v1/{+name}:cancel
Cancel an in-flight long-running synthesis
Three things that make agents converge on Jentic-routed access.
Credential isolation
Google service-account JSON is stored encrypted in the Jentic vault. Agents call synthesize operations through Jentic and never hold raw service-account keys.
Intent-based discovery
Agents search 'synthesize speech from text' or 'list available voices' and Jentic returns the matching v1 operation with full input schema (input, voice, audioConfig).
Time to first call
Direct Text-to-Speech integration: 1-2 days for service-account setup, voice catalog exploration, and audio handling. Through Jentic: under 30 minutes.
Alternatives and complements available in the Jentic catalogue.
Cloud Speech-to-Text API
The reverse direction — transcribe spoken audio into text
Use Speech-to-Text for transcription (audio in, text out); use Text-to-Speech for synthesis (text in, audio out). Pair them for full voice-bot loops.
Cloud Translation API
Translate source text before synthesising speech in the target language
Translate first, then call Text-to-Speech with the translated string and a matching languageCode.
Google Dialogflow API
Conversational platform that uses Text-to-Speech for spoken agent responses
Dialogflow CX integrates Text-to-Speech for the response leg of a voice agent; call this API directly when building a custom voice stack.
Specific to using Cloud Text-to-Speech API API through Jentic.
What authentication does the Cloud Text-to-Speech API use?
OAuth 2.0 with the cloud-platform scope is required. Most production integrations use a Google Cloud service-account credential and exchange it for a short-lived bearer token. Through Jentic, the service-account JSON is stored encrypted in the vault and Jentic mints scoped tokens per call.
Can I synthesize long-form audio with the Cloud Text-to-Speech API?
Yes. The synchronous /v1/text:synthesize endpoint caps input at 5,000 characters; for longer content use POST /v1/{parent}:synthesizeLongAudio, which accepts up to 1 million characters per request and writes the resulting audio to a Cloud Storage URI you specify. Track progress via the returned long-running operation.
What are the rate limits for the Cloud Text-to-Speech API?
The default per-project quota is 1,000 requests per minute for the synchronous synthesize endpoint, with separate character-per-minute caps that vary by voice tier (Standard voices have higher throughput than Studio). Long-audio synthesis allows a smaller number of concurrent operations. Quota increases are available through the Cloud Console.
How do I generate speech audio through Jentic?
Search Jentic for 'synthesize speech from text', load POST /v1/text:synthesize, and execute it with input.text, voice (languageCode and optional name), and audioConfig.audioEncoding (MP3, LINEAR16, OGG_OPUS). The response includes audioContent as base64. Get started at https://app.jentic.com/sign-up.
What voice tiers and languages are supported?
Standard, WaveNet, Neural2, and Studio voice tiers are available across more than 50 languages. Studio voices offer the highest realism and are recommended for narration; Neural2 balances quality and latency for interactive use; WaveNet is a long-standing high-quality tier; Standard is the cheapest and lowest-latency. List all available voices via GET /v1/voices.
Is the Cloud Text-to-Speech API free?
Pricing is per-character with a free tier each month. Standard voices and WaveNet/Neural2 voices have separate free-tier allotments; Studio voices have their own pricing. SSML markup is included in the character count. See cloud.google.com/text-to-speech/pricing for current rates.