For Agents
Transcribe speech audio to text with optional phrase-set bias. Supports short synchronous calls and long-running jobs for multi-minute recordings.
Get started with Cloud Speech-to-Text API in minutes using your preferred integration method.
# Add to your MCP client config (Claude Desktop, Cursor, Windsurf)
{
"jentic": {
"url": "https://api.jentic.com/mcp",
"auth": "oauth"
}
}
# Then ask your agent:
"transcribe an audio file"
# → Jentic returns the GET /events tool with parameter schema, agent executes.What an agent can do with Cloud Speech-to-Text API API.
Transcribe a short audio clip synchronously and return the recognised text and word-level confidence
Submit a long-running recognition job for audio stored in Cloud Storage and poll for results
Bias the recogniser toward domain vocabulary by attaching a phrase set or custom class
Manage phrase sets — collections of weighted phrases — at the project level
GET STARTED
Use for: I want to transcribe an audio file to text, Submit a long audio file for asynchronous transcription, Get the status of a long-running speech recognition job, Create a phrase set to improve recognition of product names
Not supported: Does not handle text-to-speech synthesis, speaker diarization training, or live in-browser microphone capture — use for converting recorded or streamed audio to text only.
The Cloud Speech-to-Text API converts spoken audio into text using Google's speech recognition models. It supports synchronous recognition for short clips, long-running recognition for multi-minute audio passed by Cloud Storage URI, and adaptation through phrase sets and custom classes that bias the recogniser toward domain-specific vocabulary. Typical inputs are LINEAR16, FLAC, or OGG_OPUS audio with the language code declared on the request.
Manage custom classes that group named entities for reuse across phrase sets
Cancel or inspect long-running speech recognition operations
Patterns agents use Cloud Speech-to-Text API API for, with concrete tasks.
★ Call Centre Transcription
Transcribe recorded customer support calls into searchable text for QA review. Calls are uploaded to Cloud Storage and submitted via POST /v1/speech:longrunningrecognize, which returns an operation handle. The job runs asynchronously and the resulting transcript is fetched via the operations endpoint, typically within minutes.
POST /v1/speech:longrunningrecognize with audio.uri=gs://calls/abc.flac and config.languageCode=en-US, then poll GET /v1/operations/{name} until done.
Voice Note Capture in a Field App
A mobile field service app records short voice notes and sends them to the Speech-to-Text API for synchronous transcription. POST /v1/speech:recognize accepts audio inline as base64 or by Cloud Storage URI and returns the transcript in the same response, suitable for clips up to about a minute long.
POST /v1/speech:recognize with audio.content=<base64 LINEAR16> and config.languageCode=en-GB; return the alternatives[0].transcript.
Domain-Adapted Medical Dictation
Improve recognition accuracy for clinicians by creating a phrase set containing common drug names, procedures, and anatomical terms. The phrase set is then referenced in each recognise call's adaptation config so the model is biased toward those terms during decoding.
POST /v1/{+parent}/phraseSets with phrases=[{value:'amoxicillin',boost:15},...], then call recognize with adaptation.phraseSets=['projects/p/locations/global/phraseSets/meds'].
AI Agent Voice-Driven Workflow
An AI agent receives a voice message from a user, transcribes it via the Speech-to-Text API through Jentic, and then routes the transcript to its downstream reasoning step. The agent searches for the recognise operation, loads the schema, and executes — Jentic handles auth so the agent never sees the underlying credentials.
Search Jentic for 'transcribe an audio file', execute POST /v1/speech:recognize with the user's audio content and languageCode='en-US', then pass alternatives[0].transcript to the next reasoning step.
11 endpoints — the cloud speech-to-text api converts spoken audio into text using google's speech recognition models.
METHOD
PATH
DESCRIPTION
/v1/speech:recognize
Synchronously transcribe a short audio clip
/v1/speech:longrunningrecognize
Submit a long audio file for asynchronous transcription
/v1/operations
List long-running recognition operations
/v1/operations/{+name}
Get the status and result of a long-running recognition job
/v1/{+parent}/phraseSets
List phrase sets in a project location
/v1/{+parent}/customClasses
List custom classes in a project location
/v1/speech:recognize
Synchronously transcribe a short audio clip
/v1/speech:longrunningrecognize
Submit a long audio file for asynchronous transcription
/v1/operations
List long-running recognition operations
/v1/operations/{+name}
Get the status and result of a long-running recognition job
/v1/{+parent}/phraseSets
List phrase sets in a project location
Three things that make agents converge on Jentic-routed access.
Credential isolation
Speech-to-Text service-account credentials are stored in the Jentic vault (MAXsystem) and exchanged for scoped, short-lived access tokens on each call. Long-lived JSON keys never enter the agent context.
Intent-based discovery
Agents search Jentic with intents like 'transcribe an audio file' or 'submit a long audio for transcription', and Jentic returns the matching speech.recognize or longrunningrecognize operation with its request schema.
Time to first call
Direct Speech-to-Text integration: 1-2 days to wire OAuth, audio encoding choices, and long-running operation polling. Through Jentic: under 30 minutes to discover, load, and execute.
Alternatives and complements available in the Jentic catalogue.
Deepgram API
Deepgram offers fast, streaming-first speech recognition with on-prem options.
Choose Deepgram when low-latency streaming or self-hosted deployment is the priority. Choose Cloud Speech-to-Text when staying inside the Google Cloud trust boundary matters.
AssemblyAI API
AssemblyAI bundles transcription with summarisation, topic detection, and PII redaction.
Choose AssemblyAI when downstream NLP features (entity detection, summaries) are needed in the same call. Choose Cloud Speech-to-Text for raw transcription with phrase-set bias.
Cloud Text-to-Speech API
Text-to-Speech generates audio from text; Speech-to-Text does the reverse.
Use Text-to-Speech to produce a spoken response after Speech-to-Text has parsed the user's voice input — e.g. in a voice agent loop.
Cloud Translation API
Translate the transcribed text into another language for multilingual workflows.
Use Cloud Translation when the spoken input language differs from the language the downstream system expects.
Specific to using Cloud Speech-to-Text API API through Jentic.
What authentication does the Cloud Speech-to-Text API use?
The Cloud Speech-to-Text API uses OAuth 2.0 with the cloud-platform scope. Through Jentic, OAuth credentials are stored in the Jentic vault (MAXsystem) and exchanged for short-lived access tokens, so service-account JSON keys never enter the agent context.
Can I transcribe long audio files with the Speech-to-Text API?
Yes. POST /v1/speech:longrunningrecognize accepts a Cloud Storage URI and returns an operation that can be polled via GET /v1/operations/{name}. This is the recommended path for any audio longer than about 60 seconds, where the synchronous recognise endpoint times out.
What are the rate limits for the Cloud Speech-to-Text API?
Default project quotas are 900 requests per minute and 480 minutes of audio per minute, with stricter limits on long-running submissions. Quotas are visible in the Google Cloud Console under IAM and admin > Quotas and can be raised on request.
How do I improve recognition of brand or technical terms?
Create a phrase set via POST /v1/{+parent}/phraseSets with the target terms and a boost value, then reference the phrase set in the adaptation field of the recognise request. This biases the recogniser toward the supplied vocabulary without retraining a model.
How do I run transcription through Jentic?
Search Jentic for 'transcribe an audio file', load the speech.recognize or speech.longrunningrecognize schema, and execute. Jentic returns the operation result for sync calls and the long-running operation handle for async jobs, which the agent can poll via the operations endpoint.
Is the Cloud Speech-to-Text API free?
Google offers 60 minutes of free transcription per month, after which usage is billed per 15-second increment, with different rates for standard, video, and medical models. Phrase set storage is free; data adaptation usage is billed at standard rates.
/v1/{+parent}/customClasses
List custom classes in a project location