For Agents
Provision and operate Hadoop and Spark clusters, submit jobs, and run serverless Spark batches in Google Cloud. Lets agents drive analytics workloads without managing infrastructure by hand.
Get started with Cloud Dataproc API in minutes using your preferred integration method.
# Add to your MCP client config (Claude Desktop, Cursor, Windsurf)
{
"jentic": {
"url": "https://api.jentic.com/mcp",
"auth": "oauth"
}
}
# Then ask your agent:
"submit a pyspark job to a dataproc cluster"
# → Jentic returns the GET /events tool with parameter schema, agent executes.What an agent can do with Cloud Dataproc API API.
Provision Dataproc clusters with custom machine types, autoscaling, and initialization actions
Submit Spark, PySpark, Hive, Pig, Presto, and SparkR jobs to a cluster
Run serverless Spark batches without standing up a cluster
Define workflow templates that orchestrate multi-step Dataproc jobs
GET STARTED
Use for: I need to submit a PySpark job to a Dataproc cluster, Provision a Dataproc cluster with autoscaling enabled, Run a serverless Spark batch on a Parquet file in Cloud Storage, Diagnose a failed Dataproc cluster and download the logs
Not supported: Does not run BigQuery SQL, manage non-Dataproc compute, or schedule cross-service workflows — use for Dataproc cluster lifecycle, job submission, serverless batches, and workflow templates only.
Cloud Dataproc is Google's managed Hadoop and Spark service for running batch, streaming, and interactive analytics workloads. The API exposes 65 endpoints covering clusters, jobs, workflow templates, autoscaling policies, batches (serverless Spark), and sessions (interactive Jupyter). It supports cluster create-start-stop-repair lifecycle plus job submission for Spark, PySpark, Hive, Pig, Presto, and SparkR engines.
Diagnose, repair, start, and stop clusters through the API
Manage interactive Jupyter sessions for exploratory analytics
Patterns agents use Cloud Dataproc API API for, with concrete tasks.
★ On-Demand Spark Cluster Workloads
Spin up a Dataproc cluster, submit Spark or PySpark jobs against Cloud Storage data, and tear the cluster down once the job finishes. The API exposes clusters create, jobs submit, and clusters delete in a tight loop suitable for ephemeral analytics. Cluster startup is typically 90 seconds; deleting the cluster after the job completes keeps cost bounded to job runtime.
Create cluster 'etl-2026-06-10', submit a PySpark job with mainPythonFileUri gs://acme/jobs/etl.py, wait for state DONE, and delete the cluster
Serverless Spark Batches
Run Spark workloads without managing cluster infrastructure using Dataproc Batches. POST /v1/{+parent}/batches accepts a Spark, PySpark, SparkR, or SparkSQL payload plus runtime config and a Cloud Storage staging bucket. The service provisions transient infrastructure for the run and bills only for the batch's runtime.
POST /v1/{+parent}/batches with a PySpark batch pointing at gs://acme/jobs/score.py and runtimeConfig version 2.2, then poll the batch resource until state is SUCCEEDED
Multi-Step Workflow Templates
Define a Dataproc workflow template that creates a managed cluster, runs a sequence of jobs (with dependencies), and deletes the cluster on completion. Workflow templates support parameterised execution so the same template runs against different inputs. Use the instantiate or instantiateInline endpoints to start a run from CI or an agent.
Instantiate workflow template 'nightly-etl' with parameter input_date=2026-06-09 and watch the resulting Operation until it reaches DONE
AI Agent Analytics Workload Operator
An AI agent can run Spark workloads on demand through Jentic without operator-written cluster code. Jentic search returns the matching cluster, job, batch, or workflow operation, the agent loads the schema, and Jentic executes against dataproc.googleapis.com using vault-stored credentials. This collapses the multi-day setup of Dataproc OAuth and operation polling into a single agent run.
Use Jentic to search 'submit a pyspark job to dataproc', load the submit schema, and execute it against the named cluster with the provided main file URI
65 endpoints — cloud dataproc is google's managed hadoop and spark service for running batch, streaming, and interactive analytics workloads.
METHOD
PATH
DESCRIPTION
/v1/projects/{projectId}/regions/{region}/clusters
Create a Dataproc cluster
/v1/projects/{projectId}/regions/{region}/clusters/{clusterName}:diagnose
Diagnose a cluster and produce a diagnostic tarball
/v1/projects/{projectId}/regions/{region}/jobs
Submit a job to a cluster
/v1/{+parent}/batches
Run a serverless Spark batch
/v1/{+parent}/workflowTemplates
Create a workflow template
/v1/projects/{projectId}/regions/{region}/clusters/{clusterName}:stop
Stop a running cluster
/v1/projects/{projectId}/regions/{region}/clusters
Create a Dataproc cluster
/v1/projects/{projectId}/regions/{region}/clusters/{clusterName}:diagnose
Diagnose a cluster and produce a diagnostic tarball
/v1/projects/{projectId}/regions/{region}/jobs
Submit a job to a cluster
/v1/{+parent}/batches
Run a serverless Spark batch
/v1/{+parent}/workflowTemplates
Create a workflow template
Three things that make agents converge on Jentic-routed access.
Credential isolation
Google OAuth client secrets and refresh tokens are stored encrypted in the Jentic vault. Agents receive scoped, short-lived access tokens for dataproc.googleapis.com; raw credentials never enter the agent context.
Intent-based discovery
Agents search Jentic by intent (e.g. 'submit a pyspark job to dataproc') and Jentic returns the matching operation with its input schema, so the agent calls the right endpoint without browsing the discovery doc.
Time to first call
Direct Dataproc integration: 2-5 days for OAuth, cluster-config schema work, and operation polling. Through Jentic: under 1 hour.
Alternatives and complements available in the Jentic catalogue.
Dataflow API
Dataflow runs Apache Beam pipelines; Dataproc runs Spark and Hadoop. Beam vs Spark choice drives the pick.
Choose Dataflow when the workload is Beam or needs streaming with autoscaling. Use Dataproc when the workload is Spark, Hadoop, or Hive.
BigQuery API
BigQuery is the typical sink for Dataproc-derived datasets and the source for SparkSQL reads.
Choose BigQuery for SQL analytics on managed storage. Use Dataproc when the workload requires Spark transformations before landing in BigQuery.
Cloud Storage API
Cloud Storage holds inputs, outputs, and staging artifacts for Dataproc jobs and batches.
Choose Cloud Storage for object-level operations. Use Dataproc to consume those objects from Spark or Hadoop.
Specific to using Cloud Dataproc API API through Jentic.
What authentication does the Cloud Dataproc API use?
The Cloud Dataproc API uses OAuth 2.0 with the cloud-platform scope. Through Jentic the OAuth client and refresh tokens are stored in the Jentic vault and the agent receives short-lived scoped access tokens, so raw Google credentials never enter the agent context.
Can I run serverless Spark with the Cloud Dataproc API?
Yes. POST /v1/{+parent}/batches submits a PySpark, Spark, SparkR, or SparkSQL workload with runtime and environment configuration; Dataproc provisions transient infrastructure for the duration of the run and tears it down automatically.
What are the rate limits for the Cloud Dataproc API?
Google enforces standard Cloud quotas on dataproc.googleapis.com: per-project rate limits on read/write calls plus quotas on concurrent clusters, jobs, and batches per region. Quotas are visible in the Cloud Console under IAM and admin, quotas, filtered to dataproc.googleapis.com.
How do I submit a PySpark job through Jentic?
Search Jentic for 'submit a pyspark job to dataproc', load the schema for POST /v1/projects/{projectId}/regions/{region}/jobs, and execute with placement.clusterName and pysparkJob.mainPythonFileUri set. Jentic returns the Job resource with its job ID for status polling.
Is the Cloud Dataproc API free?
API calls are free; clusters, batches, and sessions are billed by underlying Compute Engine vCPU and memory plus a per-vCPU-hour Dataproc premium. Batches and Sessions are billed per runtime second only, with no charge while idle.
How do I diagnose a failing cluster?
Call POST /v1/projects/{projectId}/regions/{region}/clusters/{clusterName}:diagnose to produce a diagnostic tarball in Cloud Storage. The response contains the gs:// path of the tarball, which holds master and worker logs plus a YARN dump for offline analysis.
/v1/projects/{projectId}/regions/{region}/clusters/{clusterName}:stop
Stop a running cluster