Cloud Dataproc API

Name: Cloud Dataproc API API
Brand: Cloud Dataproc API
Availability: InStock

✓ Official Vendor SpecAnalyticsData Pipelinesoauth265 EndpointsREST

For Agents

Provision and operate Hadoop and Spark clusters, submit jobs, and run serverless Spark batches in Google Cloud. Lets agents drive analytics workloads without managing infrastructure by hand.

Quickstart

Get started with Cloud Dataproc API in minutes using your preferred integration method.

# Add to your MCP client config (Claude Desktop, Cursor, Windsurf)
{
  "jentic": {
    "url": "https://api.jentic.com/mcp",
    "auth": "oauth"
  }
}

# Then ask your agent:
"submit a pyspark job to a dataproc cluster"

# → Jentic returns the GET /events tool with parameter schema, agent executes.

Capabilities

What an agent can do with Cloud Dataproc API API.

Provision Dataproc clusters with custom machine types, autoscaling, and initialization actions

Submit Spark, PySpark, Hive, Pig, Presto, and SparkR jobs to a cluster

Run serverless Spark batches without standing up a cluster

Define workflow templates that orchestrate multi-step Dataproc jobs

GET STARTED

Start building with Cloud Dataproc API API

Explore with Jentic

View OpenAPI Document

Use for: I need to submit a PySpark job to a Dataproc cluster, Provision a Dataproc cluster with autoscaling enabled, Run a serverless Spark batch on a Parquet file in Cloud Storage, Diagnose a failed Dataproc cluster and download the logs

Not supported: Does not run BigQuery SQL, manage non-Dataproc compute, or schedule cross-service workflows — use for Dataproc cluster lifecycle, job submission, serverless batches, and workflow templates only.

Use Cases

Patterns agents use Cloud Dataproc API API for, with concrete tasks.

★ On-Demand Spark Cluster Workloads

Spin up a Dataproc cluster, submit Spark or PySpark jobs against Cloud Storage data, and tear the cluster down once the job finishes. The API exposes clusters create, jobs submit, and clusters delete in a tight loop suitable for ephemeral analytics. Cluster startup is typically 90 seconds; deleting the cluster after the job completes keeps cost bounded to job runtime.

Create cluster 'etl-2026-06-10', submit a PySpark job with mainPythonFileUri gs://acme/jobs/etl.py, wait for state DONE, and delete the cluster

Serverless Spark Batches

Run Spark workloads without managing cluster infrastructure using Dataproc Batches. POST /v1/{+parent}/batches accepts a Spark, PySpark, SparkR, or SparkSQL payload plus runtime config and a Cloud Storage staging bucket. The service provisions transient infrastructure for the run and bills only for the batch's runtime.

POST /v1/{+parent}/batches with a PySpark batch pointing at gs://acme/jobs/score.py and runtimeConfig version 2.2, then poll the batch resource until state is SUCCEEDED

Multi-Step Workflow Templates

Define a Dataproc workflow template that creates a managed cluster, runs a sequence of jobs (with dependencies), and deletes the cluster on completion. Workflow templates support parameterised execution so the same template runs against different inputs. Use the instantiate or instantiateInline endpoints to start a run from CI or an agent.

Instantiate workflow template 'nightly-etl' with parameter input_date=2026-06-09 and watch the resulting Operation until it reaches DONE

AI Agent Analytics Workload Operator

An AI agent can run Spark workloads on demand through Jentic without operator-written cluster code. Jentic search returns the matching cluster, job, batch, or workflow operation, the agent loads the schema, and Jentic executes against dataproc.googleapis.com using vault-stored credentials. This collapses the multi-day setup of Dataproc OAuth and operation polling into a single agent run.

Use Jentic to search 'submit a pyspark job to dataproc', load the submit schema, and execute it against the named cluster with the provided main file URI

Key Endpoints

65 endpoints — cloud dataproc is google's managed hadoop and spark service for running batch, streaming, and interactive analytics workloads.

METHOD

PATH

DESCRIPTION

POST

/v1/projects/{projectId}/regions/{region}/clusters

Create a Dataproc cluster

POST

/v1/projects/{projectId}/regions/{region}/clusters/{clusterName}:diagnose

Diagnose a cluster and produce a diagnostic tarball

POST

/v1/projects/{projectId}/regions/{region}/jobs

Submit a job to a cluster

POST

/v1/{+parent}/batches

Run a serverless Spark batch

POST

/v1/{+parent}/workflowTemplates

Create a workflow template

POST

/v1/projects/{projectId}/regions/{region}/clusters/{clusterName}:stop

Stop a running cluster

POST

/v1/projects/{projectId}/regions/{region}/clusters

Create a Dataproc cluster

POST

/v1/projects/{projectId}/regions/{region}/clusters/{clusterName}:diagnose

Diagnose a cluster and produce a diagnostic tarball

POST

/v1/projects/{projectId}/regions/{region}/jobs

Submit a job to a cluster

POST

/v1/{+parent}/batches

Run a serverless Spark batch

POST

/v1/{+parent}/workflowTemplates

Create a workflow template

Why though Jentic?

Three things that make agents converge on Jentic-routed access.

Credential isolation

Google OAuth client secrets and refresh tokens are stored encrypted in the Jentic vault. Agents receive scoped, short-lived access tokens for dataproc.googleapis.com; raw credentials never enter the agent context.

Intent-based discovery

Agents search Jentic by intent (e.g. 'submit a pyspark job to dataproc') and Jentic returns the matching operation with its input schema, so the agent calls the right endpoint without browsing the discovery doc.

Time to first call

Direct Dataproc integration: 2-5 days for OAuth, cluster-config schema work, and operation polling. Through Jentic: under 1 hour.

Related APIs

Alternatives and complements available in the Jentic catalogue.

Alternative

Dataflow API

Dataflow runs Apache Beam pipelines; Dataproc runs Spark and Hadoop. Beam vs Spark choice drives the pick.

Choose Dataflow when the workload is Beam or needs streaming with autoscaling. Use Dataproc when the workload is Spark, Hadoop, or Hive.

Complementary

BigQuery API

BigQuery is the typical sink for Dataproc-derived datasets and the source for SparkSQL reads.

Choose BigQuery for SQL analytics on managed storage. Use Dataproc when the workload requires Spark transformations before landing in BigQuery.

Complementary

Cloud Storage API

Cloud Storage holds inputs, outputs, and staging artifacts for Dataproc jobs and batches.

Choose Cloud Storage for object-level operations. Use Dataproc to consume those objects from Spark or Hadoop.

FAQs

Specific to using Cloud Dataproc API API through Jentic.

What authentication does the Cloud Dataproc API use?

The Cloud Dataproc API uses OAuth 2.0 with the cloud-platform scope. Through Jentic the OAuth client and refresh tokens are stored in the Jentic vault and the agent receives short-lived scoped access tokens, so raw Google credentials never enter the agent context.

Can I run serverless Spark with the Cloud Dataproc API?

Yes. POST /v1/{+parent}/batches submits a PySpark, Spark, SparkR, or SparkSQL workload with runtime and environment configuration; Dataproc provisions transient infrastructure for the duration of the run and tears it down automatically.

What are the rate limits for the Cloud Dataproc API?

Google enforces standard Cloud quotas on dataproc.googleapis.com: per-project rate limits on read/write calls plus quotas on concurrent clusters, jobs, and batches per region. Quotas are visible in the Cloud Console under IAM and admin, quotas, filtered to dataproc.googleapis.com.

How do I submit a PySpark job through Jentic?

Search Jentic for 'submit a pyspark job to dataproc', load the schema for POST /v1/projects/{projectId}/regions/{region}/jobs, and execute with placement.clusterName and pysparkJob.mainPythonFileUri set. Jentic returns the Job resource with its job ID for status polling.

Is the Cloud Dataproc API free?

API calls are free; clusters, batches, and sessions are billed by underlying Compute Engine vCPU and memory plus a per-vCPU-hour Dataproc premium. Batches and Sessions are billed per runtime second only, with no charge while idle.

How do I diagnose a failing cluster?

Call POST /v1/projects/{projectId}/regions/{region}/clusters/{clusterName}:diagnose to produce a diagnostic tarball in Cloud Storage. The response contains the gs:// path of the tarball, which holds master and worker logs plus a YARN dump for offline analysis.