For Agents
Run predictions on thousands of open-source ML models, train custom versions, and deploy dedicated infrastructure. Supports image, text, audio, and video models with automatic scaling.
Get started with Replicate API in minutes using your preferred integration method.
# Add to your MCP client config (Claude Desktop, Cursor, Windsurf)
{
"jentic": {
"url": "https://api.jentic.com/mcp",
"auth": "oauth"
}
}
# Then ask your agent:
"run a prediction on an ML model"
# → Jentic returns the GET /events tool with parameter schema, agent executes.What an agent can do with Replicate API API.
Run predictions on any public or private model with automatic GPU provisioning
Train custom model versions on your own datasets with configurable hardware
Deploy models to dedicated always-on infrastructure for low-latency production traffic
Browse curated model collections organized by task like text-to-image or speech synthesis
GET STARTED
Use for: I need to run an image generation model on Replicate, I want to list available versions of a specific model, Get the status of my running prediction, Cancel a prediction that is taking too long
Not supported: Does not handle model training data storage, vector databases, or real-time streaming inference — use for batch predictions, model versioning, and deployment management only.
Jentic publishes the only available OpenAPI document for Replicate API, keeping it validated and agent-ready.
Jentic publishes the only available OpenAPI specification for Replicate API, keeping it validated and agent-ready. Run open-source ML models in the cloud without managing infrastructure across 20 endpoints covering predictions, model versioning, training, deployments, and collections. Supports thousands of community-contributed models for image generation, language processing, audio synthesis, and video creation with automatic GPU scaling and pay-per-second billing.
Version and publish models with semantic versioning and hardware requirements
Cancel in-progress predictions and training runs to manage compute costs
Query available hardware options for GPU selection during model deployment
Patterns agents use Replicate API API for, with concrete tasks.
★ AI Agent Model Inference via Jentic
AI agents discover and invoke ML models on Replicate through Jentic's intent-based search. Agents specify what they need (e.g., 'generate an image from text') and Jentic returns matching Replicate operations with input schemas for the specific model version. No SDK setup or model hosting required — agents call POST /v1/predictions with a model version ID and inputs, then poll for results.
Search Jentic for 'run an image generation model', load the POST /v1/predictions schema, and execute with version ID for stable-diffusion and a text prompt input
On-Demand Image Generation
Generate images from text prompts by running predictions against community models like Stable Diffusion, FLUX, and SDXL. Create a prediction via POST /v1/predictions with the model version and prompt, then poll until the output URL is available. Replicate handles GPU provisioning, scaling to zero when idle, and pay-per-second billing so you only pay for actual compute time.
Create a prediction on POST /v1/predictions with a Stable Diffusion XL version ID, input prompt 'a mountain landscape at sunset', and poll GET /v1/predictions/{prediction_id} until status is 'succeeded'
Custom Model Training
Fine-tune open-source models on custom datasets using Replicate's training endpoints. Create a training run via POST /v1/models/{model_owner}/{model_name}/versions/{version_id}/trainings with your training data and hyperparameters. Monitor progress via GET /v1/trainings/{training_id}. The resulting model version can be used for predictions immediately or deployed to dedicated hardware.
Create a training run via POST /v1/models/stability-ai/sdxl/versions/{version_id}/trainings with a dataset URL and 2000 training steps, then poll for completion
Production Model Deployment
Deploy models to dedicated always-on infrastructure for consistent low-latency responses via POST /v1/deployments. Unlike on-demand predictions that cold-start from zero, deployments keep models warm on reserved GPUs. Run predictions against deployments via POST /v1/deployments/{deployment_owner}/{deployment_name}/predictions for predictable latency in production applications.
Create a deployment via POST /v1/deployments for a text generation model with min_instances=1, then run a prediction via POST /v1/deployments/{owner}/{name}/predictions
20 endpoints — jentic publishes the only available openapi specification for replicate api, keeping it validated and agent-ready.
METHOD
PATH
DESCRIPTION
/v1/predictions
Run a prediction on a model version
/v1/predictions/{prediction_id}
Get prediction status and output
/v1/predictions/{prediction_id}/cancel
Cancel a running prediction
/v1/models
List available models
/v1/models/{model_owner}/{model_name}/versions
List versions of a model
/v1/deployments
Create a dedicated model deployment
/v1/collections/{collection_slug}
Get models in a curated collection
/v1/hardware
List available hardware options
/v1/predictions
Run a prediction on a model version
/v1/predictions/{prediction_id}
Get prediction status and output
/v1/predictions/{prediction_id}/cancel
Cancel a running prediction
/v1/models
List available models
/v1/models/{model_owner}/{model_name}/versions
List versions of a model
Three things that make agents converge on Jentic-routed access.
Credential isolation
Replicate Bearer tokens are stored encrypted in the Jentic vault (MAXsystem). Agents receive scoped access tokens — raw API tokens (r8_...) never enter the agent's context or logs.
Intent-based discovery
Agents search by intent (e.g., 'run an image generation model') and Jentic returns matching Replicate operations with input schemas, model version requirements, and hardware options, so the agent can launch predictions without browsing the model catalog.
Time to first call
Direct Replicate integration: 1-2 days for auth, model discovery, async polling, and output handling. Through Jentic: under 1 hour — search for the operation, load schema, execute and poll.
Alternatives and complements available in the Jentic catalogue.
Hugging Face API
Model hub with inference API and broader ecosystem of datasets and spaces
Choose Hugging Face when you need access to the largest model repository, dataset hosting, or prefer their Inference Endpoints for dedicated hosting
Stability AI API
Official Stable Diffusion API with direct vendor support and optimizations
Choose Stability AI when you specifically need Stable Diffusion models with official vendor optimization, upscaling, and inpainting features
OpenAI API
Proprietary LLMs and DALL-E for tasks not covered by open-source models
Use OpenAI alongside Replicate when you need GPT-4o for reasoning tasks or DALL-E 3 for image generation that complements open-source model outputs
Pinecone API
Vector database for storing embeddings generated by Replicate models
Use Pinecone alongside Replicate to store embeddings from open-source embedding models and build retrieval systems
Specific to using Replicate API API through Jentic.
Why is there no official OpenAPI spec for Replicate API?
Replicate does not publish an OpenAPI specification. Jentic generates and maintains this spec so that AI agents and developers can call Replicate API via structured tooling. It is validated against the live API and kept up to date. Get started at https://app.jentic.com/sign-up.
What authentication does the Replicate API use?
The Replicate API uses Bearer token authentication. Pass your API token in the Authorization header as 'Bearer r8_...'. Through Jentic, your Replicate token is stored encrypted in the MAXsystem vault and agents receive scoped access without the raw token entering their context.
Can I run any open-source model on Replicate?
Yes. Replicate hosts thousands of community-contributed models accessible via POST /v1/predictions. Specify the model version ID and input parameters. Popular models include Stable Diffusion XL, FLUX, LLaMA, and Whisper. You can also push your own models using Cog packaging and run them through the same predictions API.
What are the rate limits for the Replicate API?
Replicate does not enforce strict per-minute rate limits. Instead, concurrency is limited by your plan: free accounts get 1 concurrent prediction, paid accounts scale based on GPU availability. The API returns 429 status codes if you exceed concurrent prediction limits. Deployment endpoints have separate concurrency based on configured instances.
How do I run a prediction on Replicate through Jentic?
Search Jentic for 'run a model prediction on Replicate' to discover the POST /v1/predictions operation. The schema requires a version ID (model version hash) and an input object matching the model's schema. Execute through Jentic's SDK (pip install jentic) and poll the returned prediction URL until status shows 'succeeded'. The output field contains your results.
What is the difference between predictions and deployments?
POST /v1/predictions runs inference on shared, auto-scaling infrastructure that cold-starts from zero — ideal for variable traffic and cost efficiency. POST /v1/deployments creates dedicated always-on GPU instances that stay warm — ideal for production workloads needing consistent sub-second latency. Deployments cost more but eliminate cold-start delays.
/v1/deployments
Create a dedicated model deployment
/v1/collections/{collection_slug}
Get models in a curated collection
/v1/hardware
List available hardware options