Replicate logo

Lifecycle of an instance

TypeDuration

Setup time
0.0s
Active time
0.0s
Idle time
0.0s

Offline: When a model isn’t under demand, we scale it down to the minimum number of instances set (0 by default - you can customize this for deployments).
Setting up: When requests start to come in for a model, or the request volume is too high for the model's current scale to cope with, we set up an instance (up to a maximum - you can customize this for deployments). This can take a few seconds as we perform setup work like downloading weights.
Active: Once the model instance has finished setting up, it can start processing the queue of requests.
Idle: When there's a gap in requests, the instance will go idle for a few minutes rather than shutting down immediately, so it can stay responsive and avoid needing to set up from scratch every time.

Public models

When you use a public model on Replicate, you only pay for the time it's active processing your requests. Setup and idle time for the model is free.

By default, you share a hardware pool with other customers, meaning your requests enter a shared queue alongside other customer requests. This means you will sometimes encounter cold boots or scaling limits depending on how other customers are using the model.

If you would like more control over how the model is run, you can use a deployment and have your own instances and request queue.

Private models

Unlike public models, most private models (with the exception of fast booting models) run on dedicated hardware and you don't have to share a queue with anyone else. This means you pay for all the time instances of the model are online: the time they spend setting up; the time they spend idle, waiting for requests; and the time they spend active, processing your requests.

As with public models, if you would like more control over how a private model is run, you can use a deployment.

Here’s an example using Meta's Llama 3.1 405B Instruct:

	Tokens	Count	Price
Input	Write a limerick about llamas	8	$0.0000760
Output	There once was a llama named Sue,\n Whose favorite color was blue,\n She lived in the Andes,\n With her friends eating candies\n And together they all played kazoo.	43	$0.0004085
Total		51	$0.0004845

Fast booting models

Sometimes, we're able to optimize how a trained model is run so it boots fast. This works by using a common, shared pool of hardware running the base model. In these cases, we only ever charge you for the time the model is active and processing your requests, regardless of whether or not it's public or private.

Fast booting versions of models are labeled as such in the model's version list. You can also see which versions support the creation of fast booting models when training.

Deployments

Deployments are a feature that allow you to, among other things, control the hardware and scaling parameters of any model. Like with private models, we charge for all the time deployment instances are online: the time they spend setting up; the time they spend idle, waiting for requests; and the time they spend active, processing your requests.

In addition to the benefits of having a stable endpoint and graceful rollouts of versions, you might want to use a deployment if, for example:

you want to configure a public model owned by someone else to run on different hardware
you have steady use of a model and want to avoid being impacted by other customers using it
you know your expected request rate and want to avoid cold boots
you have a private model with a consistent, predictable request rate

Note that well-tuned deployments are usually only marginally more expensive than public models, because, despite paying for setup and idle time for deployment instances, when configured correctly, they should only be setting up or idle for a fraction of the time they're active.

Failed and canceled runs

For public models, if a run fails, we don't charge you for its time. However, if you cancel a run, we charge you for the time it ran up until that point.

For private models and deployments, failed and canceled runs are billed for the time the instances they ran on were active, as normal.

Hardware

Different models run on different hardware. You'll find the hardware specifications under the "Run time and cost" heading on each model's page. Check out stability-ai/sdxl for an example.

If a model is one you created on Replicate, you can adjust which hardware to use in the model's settings. You can also specify hardware for a deployment.

Billing

At the beginning of each month, we charge you for what you used in the previous month.

The minimum billable unit for an individual run of a public model is 1 second or 1 token.

Sometimes, when your usage exceeds certain thresholds for the first time, or after you change your payment method, we charge you early for some of the month's usage. We do this to help prevent fraudulent use of Replicate.

You can find your current usage and manage your billing settings on your account page.

Free limits

You can try featured models out on Replicate for free, but after a bit you'll be asked to set up billing.

Some features are only available to customers with billing set up.