- We've optimized how weights are loaded into GPU memory for some of the models we maintain, and we're going to open this up to all custom models soon.
- We're going to be distributing images as individual files rather than as image layers, which makes pulling images much more efficient.
Although our cold boots do suck, the comparison in this blog post is comparing apples to oranges because Fly machines are much lower level than Replicate models. It is more like a warm boot.
It seems to be using a stopped Fly machine, which has already pulled the Docker image onto a node. When it starts, all it's doing is starting the Docker container. Creating the Fly machine or scaling it up would take much longer.
On Replicate, the models auto-scale on a cluster. The model could be running anywhere in our cluster so we have to pull the image to that node when it starts.
Something funny seems to be going on with the latency too. Our round-trip latency is about 200ms for a similar model. Would be curious to see the methodology, or maybe something was broken on our end.
But we do acknowledge the problem. It's going to get better soon.
The warm boot numbers for Replicate are also a bit concerning, though. I know that you're contesting the 800ms latency, and saying that a similar model you tested is 200ms — but that's still 30% slower than Fly (155ms). Even if you fix the cold boot problem, it looks like you're still trailing Fly by quite a bit.
I feel like it would be worth a deep dive with your team on what's happening and maybe writing a blog post on what you found?
Also, I'll gently point out that Fly not having to pull Docker images on "cold" boot isn't something your customers think much about, since a stopped Fly machine doesn't accrue additional cost (other than a few cents a month for rootfs storage). If it's roughly the same price, and roughly the same level of effort, and ends up performing the same function for the customer (inference), whether or not it's doing Docker image pulls behind the scenes doesn't matter so much to most customers. Maybe it's worth adding a pricing tier to Replicate that charges a small amount for storage even for unused models, and results in much better cold boot time for those models since you can skip the Docker image pull — or in the future, model file download — and just attach a storage device?
(I know you're also selling the infinitely autoscaling cluster, but I think for a lot of people the tradeoff between finite-autoscaling vs extremely long cold boot times is not going to be in favor of the long cold boots — so paying a small fee for a block storage tier that can be attached quickly for autoscaling up to N instances would probably make a lot of sense, even if scaling to N+1 instances is slow again and/or requires clicking a button or running a CLI command.)
For what it's worth: creating and stopping/starting Fly Machines is the whole point of the API. If you're on-demand creating new Machines, rather than allocating AOT and then starting/stopping them JIT, you're holding it wrong. :)
(There's a lot I can say about why I think a benchmark like this is showing us unusually well! I'm not trying to argue that people should take this benchmark too seriously.)
Just as a top-level disclaimer, I'm working at one of the companies in "this" space (serverless GPU compute) so take anything I say with a grain of salt.
This is one of the things we (at https://fal.ai) working very hard to solve. Because of ML workloads and their multiple GB environments (torch, all those cuda/cudnn libraries, and anything else they pull) it is a real challange just to get the container to start in a reasonable time frame. We had to write our own shared Python virtual environment runtime using SquashFS distributed thru a peer-to-peer caching system to bring it down sub-second mark.
After the container boots, there is the aspect of storing model weights, which IMHO less challenging since it is just big blobs of data (compared to Python environments where there are thousands of smaller files where each might be sequentially read and incur a really major latency penalty). Distributing them once we had the system above was super easy since just like squashfs'd virtual environments, they are immutable data blobs.
We are also starting to play with GPUDirect on some of our bare metal clusters and hopefully planning to expose it to our customers, which is especially important if your models is 40GB or higher. At that point, you are technically operating at the PCIE/SXM speeds which is ~2-3 seconds for a model of that size.
GPUDirect is amazing, I wish more of our (Graphistry's) customers wanted it. We've had surprisingly few deploys like that yet one of the only ways we can do 100+ GB/s per-node analytics
I spent a couple months hacking on a dreambooth product that let users train a model on their own photos and then generate new images w/ presets or their own prompts.
The main costs were:
- gpu time for training
- gpu time for inference
- storage costs for the users' models
- egress fees to download model
I ended up using banana.dev and runpod.io for the serverless gpus. Both were great, easy to hook into, and highly customizable.
I spent a bunch of time trying to optimize download speed, egress fees, gpu spot pricing, gpu location, etc.
R2 is cheaper than s3 - free egress! But the download speeds were MUCH worse than s3 - enough that it ended up not even being competitive.
It was frequently cheaper to use more expensive GPUs w/ better location and network speeds. That factored more into the pricing than how long the actual inference took on each instance.
Likewise, if your most important metric is time from boot to starting inference then network access might be the limiting factor.
Replicate has really long boot times for custom models - 2/3 minutes if you are lucky and up to 30 minutes if they are having problems.
While we loved the dev experience we just couldn’t make it work with frequently switching models / LORA weights.
We switched to beam (https://www.beam.cloud) and it’s so much better. Their cold start times are consistently small and they provide caching layer for model files i.e volumes which make switching between models a breeze.
Beam also has much better pricing policy. For custom models on replicate you pay for boot times (which are very long!) so you are paying a lot of $ for a single request.
With beam you only pay for inference and idle time.
Founder of Replicate here. Our cold boots do suck (see my other comment), but you aren't charged for the boot time on Replicate, just the time that your `setup()` function runs.
Incentives are aligned for us to make it better. :)
Was not aware of that that. You should probably change the docs to better explain what you are charged for. Right now it says you do get charged for boot time:
“[…] Unlike public models, you’ll pay for boot and idle time in addition to the time it spends processing your requests.”
Apart from boot times, we actually find replicate to be an amazing platform, congrats
Cold start is super bad on azure machine learning endpoints, at least it was when we tried to use it a few months ago. Even before it gets to the environment loading step. Seems like even these results are better than what we got on AML. So it's impressive imo!
I wrote a review about Replicate last week and cog I was using, insanely-fast-whisper, had boot times exceeding 4 minutes. I wish there was more we can observe to find out the cause of the slow start up times. I was suspecting it was dependencies.
I haven’t had as much fun with inference models as I’ve had since finding out Fly GPU servers start and stop at the drop of a hat.
I can literally boot the server for the 10-20s it takes to run a bunch of generations, and have it shut down automatically afterwards. It feels like magic.
Sure, creating the image after a new deployment takes up to two minutes, but once it’s there it’s incredibly fast.
Is the 100 MB model being downloaded from HuggingFace on Fly too?
I ask this because Fly has immutable Docker containers which wouldn't store any data unless you use Fly Volumes. So it could be that Fly is downloading the 100MB model each time it cold-boots.
If that's the case, a multi-stage Dockerfile could help in bundling the model in, and perhaps reducing cold-boot time even further.
Yes, correct the 100MB is being downloaded on every boot!
I tested it initially because it’s the naivest implementation. The right implementation would bundle it in.
But I ended up primarily reporting timings that stop counting up as soon as control is handed over to user generated code - since that’s the number you care about the most.
Perhaps a good future idea would be to benchmark between bundling it in the Docker image, vs. using Fly Volumes as Simon suggested in a sibling comment.
why multi-stage would make any difference? isn't multi stage just implementation detail for resulting image?
AFAIK only build caching could benefit, but pulling image should be unaltered with or without using multi-stage build
I am guessing that it would be possible to download the model from HuggingFace in the build step. I'm not sure though. Plus the Cog image is 14GB in size (mentioned in the article), I hope there's a way to reduce that.
Fly recommend you use one of their mountable volumes for large model files rather than building it into the Docker image - you get faster cold starts that way, plus I think the images have size limits.
Sure, but I'm still do not understand why multi-stage should make any difference, because at the end it's just flattened as layer from copy between stages.
So the problem of size is exactly the same, and RUN with curl will have identical size as COPY layer from COPY --from=stage
Am I missing something?
I can see only benefit for build cache reuse, so download is independed from building code so you won't redownload 14G when you change code, is that what you had in mind?
Founder of Modal here. We've spent a ton of time on this, including building our own distributed file system optimized for low-latency high-througput workloads. We don't use K8s or Docker and built our own custom infrastructure instead.
Cold starting containers quickly is a fascinating problems. We've gotten a long way but there's still a lot more to do. For GPU-based inference, starting containers isn't enough – you also need to initialize the model GPU quickly. We are working on a long list of things that will bring down cold start latency even further.
Is Modal a good solution for running fine-tuned LLMs and Whisper models? If the cold-start time is low we're more than willing to modify our code to use Modal's infra.
Happy to follow up via email but didn't see one in your profile.
replicate is also very hard to predict costs on, I've found their salespeople are reluctant to make any predictions since things are changing so quickly. So it might take 5mins to cold-boot a model for a 2s prediction run, but its not clear how much you pay for that run.
Replicate created the cog spec and is a fantastic resource for browsing and playing with new models. They are a social destination too.
But flyIO is nice and simple for docker side projects, I hope their cog deployments are as smooth as there are a few extra pieces involved.
I tend to have a preflight script which is run prior to the final docker command. This let's the layers have the cached weights & avoids dealing with downloading or making changes to the codebase for loading downloaded weights. Would shave off 10s from both providers.
Here's what we're doing:
- Fine-tuned models now boot fast: https://replicate.com/blog/fine-tune-cold-boots
- You can keep models switched on to avoid cold boots: https://replicate.com/docs/deployments
- We've optimized how weights are loaded into GPU memory for some of the models we maintain, and we're going to open this up to all custom models soon.
- We're going to be distributing images as individual files rather than as image layers, which makes pulling images much more efficient.
Although our cold boots do suck, the comparison in this blog post is comparing apples to oranges because Fly machines are much lower level than Replicate models. It is more like a warm boot.
It seems to be using a stopped Fly machine, which has already pulled the Docker image onto a node. When it starts, all it's doing is starting the Docker container. Creating the Fly machine or scaling it up would take much longer.
On Replicate, the models auto-scale on a cluster. The model could be running anywhere in our cluster so we have to pull the image to that node when it starts.
Something funny seems to be going on with the latency too. Our round-trip latency is about 200ms for a similar model. Would be curious to see the methodology, or maybe something was broken on our end.
But we do acknowledge the problem. It's going to get better soon.