Show HN: GPU-Accelerated Inference Hosting (opens in new tab)

(inferrd.com)

56 pointstheo314y ago34 comments

34 comments

This is nice, and I've wanted this kind of thing repeatedly over the last 5 years! I think you often want to run little bits of CPU-based code in addition to your deep learning graph. So I think a better deployment model might be basically Lambda but with CUDA access... or something like that.

The factors that I think would make this service most valuable are low cost (think, lower than GPU's on AWS or similar, even at scale), high burst capability from cold start (1000QPS is a good target), and of course low cold start delays (< 1s, or .5s).

This led me down a rabbit hole in years past and the technical solution seems to be generally, the ability to swap models in and out of GPU ram very quickly. Possibly using NVIDIA's unified memory subsystem.

theo31OP4y ago

Thank you!

We don't have any cold start delay! In our custom environment, you can do exactly what you are describing (running both CPU and GPU code). We provide you with access to the GPU and the CUDA libraries installed. It's basically lambda (minus the cold start) with GPU access.

We can scale a lot very quickly depending on how much you need.

_hl_4y ago

That's impressive!

Are you willing to talk a bit about how this all works? I assume you host the hardware yourself somewhere, which in the days of AWS et al must be pretty tough to pull off, especially with these specs. Where do you get the hardware from these days with the crypto craze?

1 more reply

37ef_ced34y ago

Or, do your inference using an AVX-512 CPU:

https://NN-512.com (open source, free software, no dependencies)

With batch size 1, NN-512 is easily 2x faster than TensorFlow and does 27 ResNet50 inferences per second on a c5.xlarge instance. For more unusual networks, like DenseNet or ResNeXt, the performance gap is wider.

Even if you allow TensorFlow to use a larger ResNet50 batch size, NN-512 is easily 1.3x faster.

If you need a few dozen inferences per second per server, this is the cheapest way. And you're not depending on a proprietary solution whose parent company could go out of business in a year.

If you need Transformers instead of convolutions, Fabrice Bellard's LibNC is a good solution: https://bellard.org/libnc/

theo31OP4y ago

Oh that's very interesting, how ready for production is it? It only works for TF right?

> If you need a few dozen inferences per second per server, this is the cheapest way. And you're not depending on a proprietary solution whose parent company could go out of business in a year.

Definitely the cheapest way.

We've been in business for more than a year already actually :)

37ef_ced34y ago

NN-512 has no connection to TensorFlow. It is an open source Go program (with no dependencies) that generates C code (with no dependencies). And it's fully ready for production. Similarly, LibNC is stand-alone, and Fabrice Bellard (author of FFmpeg, QEMU, etc.) will release the source to anyone who asks for it.

I'm giving performance comparisons versus TensorFlow, which I consider to be a standard tool.

People who use your proprietary, closed, black-box service are dependent on the well-being of your business. You could vanish tomorrow.

ackbar034y ago

So is this mainly focused on deployment for applications with high-speed inference requirements? I didn't dive into product in detail. I run my own deep-learning based web-app and inference speed optimization is pretty non-trivial. As far as I know production level speed requirements require use of tensorrt which is definitely not hot-start and requires more than a few minutes to load (i'm not too sure what's going on under the hood, not an expert) but has inference speeds of up to x2 or more, so not quite sure what your targeting or if you've actually managed to solve that problem which would be highly impressive

stingraycharles4y ago

I suspect the audience is more about the GPU hosting aspect, effectively making GPU-based applications “serverless”.

To me, adding GPUs into the devops mix typically increases the complexity significantly, and I would definitely pay money to someone who can just take my model, host it, and let them deal with the complexities around it.

theo31OP4y ago

We don't use TensorRT at the moment, but it is something that we are exploring.

johndough4y ago

> Guaranteed < 200ms response time

This sounds confusing to me. Surely it is possible to craft a neural network that takes longer to process?

> Max. model size: X GB

Do you really mean model size or should this also include the size of the intermediate tensors?

The full screen option on the YouTube video is turned off by the way, so it is impossible to read without leaving your website.

Overall, this offer looks quite competitive. Are you planning to offer your service in the EU in the future?

theo31OP4y ago

The response time guaranteed is for a reasonably sized model. Bigger models (> 700 MB) will take a bit longer.

The model size is the zipped size of your model that is uploaded to Inferrd (either through the SDK or the website).

I'll fix the full screen problem right away, thank you for reporting.

We only have servers in the United States at the moment but are looking to have servers all around NA and EU very soon.

johndough4y ago

> The model size is the zipped size of your model that is uploaded to Inferrd (either through the SDK or the website).

Nice to hear!

> We only have servers in the United States at the moment but are looking to have servers all around NA and EU very soon.

Sorry, my question was not quite clear. What I actually wanted to know was more along the lines of being able to use your service in Europe legally. For example, I can not find a privacy policy or a way to get a GDPR data processing agreement.

rootdevelop4y ago

What are the specs of an Nvidia m80?

I’ve never heard of that type before and I wasn’t able to find anything with google.

Furthermore more, the lack of company information (address, company registration nr etc) and the fact that it’s not clear where the servers are located geographically makes me a bit hesitant.

theo31OP4y ago

Sorry that's a typo, they are K80s: https://www.nvidia.com/en-gb/data-center/tesla-k80/

sjnair964y ago

Looks awesome. Do you know if and how you guys support NVIDIA's software. For my project the NVIDIA software I'm using states it needs:

CUDA 11.3.0

cuBLAS 11.5.1.101

cuDNN 8.2.0.41

NCCL 2.9.6

TensorRT 7.2.3.4

Triton Inference Server 2.9.0

I'm new to deploying to production inference so I'm not sure if those are easily portable across such platforms or not really.

theo31OP4y ago

Those frameworks are installed by default in our custom environment. There is no additional setup/configuration required from you.

spullara4y ago

Does it need to reinitialize for each request or is there a warm start / cold start model like lambda? I don't really understand how you can charge per request.

maldeh4y ago

The pricing appears to be static per model with a ceiling on the monthly request count, not charged per request.

Edit: Actually, I didn't spot the free tier of 1000 requests. I wonder how you avoid the problem of a lot of users leaving defunct/disused models running while still keeping them hot - presumably some kind of limit to the model count?

theo31OP4y ago

There is no cold start! We keep your service hot all the time.

spullara4y ago

Well, I guess I know where I am going to host GPT-J-6B then. I don't think it is sustainable.

1 more reply

rubatuga4y ago

I guess GPU loading is quick? Like ~10 seconds?

nextaccountic4y ago

Looking at the examples in the landing page.. so I don't need any kind of authentication to do inference? Anyone can run the models I upload?

theo31OP4y ago

At the moment, no, only the random hash gives some kind of security by obfuscation. More advanced security controls are coming soon.

nextaccountic4y ago

Anyway, here's a must: having a different key for uploading a model and doing inference with it. Or even, there should be a set of keys for each model, with each access logged separately.

nextaccountic4y ago

Also, if someone uses my model, do I pay for it?

gigatexal4y ago

Looks amazing! A 3 line getting started animation? Sold. That’s all I need to see. Very good work folks.

derekhsu4y ago

May I deploy multiple models in the same billing account?

theo31OP4y ago

Yes you absolutely can!

manceraio4y ago

could I run spleeter on it?

theo31OP4y ago

Yes absolutely, you can run almost anything in our custom environment.

inshadows4y ago

White screen without JS

j / k navigate · click thread line to collapse

34 comments

etaioinshrdlu4y ago

theo31OP4y ago

Thank you!

We can scale a lot very quickly depending on how much you need.

_hl_4y ago

That's impressive!

1 more reply

37ef_ced34y ago

Or, do your inference using an AVX-512 CPU:

https://NN-512.com (open source, free software, no dependencies)

Even if you allow TensorFlow to use a larger ResNet50 batch size, NN-512 is easily 1.3x faster.

If you need a few dozen inferences per second per server, this is the cheapest way. And you're not depending on a proprietary solution whose parent company could go out of business in a year.

If you need Transformers instead of convolutions, Fabrice Bellard's LibNC is a good solution: https://bellard.org/libnc/

theo31OP4y ago

Oh that's very interesting, how ready for production is it? It only works for TF right?

> If you need a few dozen inferences per second per server, this is the cheapest way. And you're not depending on a proprietary solution whose parent company could go out of business in a year.

Definitely the cheapest way.

We've been in business for more than a year already actually :)

37ef_ced34y ago

I'm giving performance comparisons versus TensorFlow, which I consider to be a standard tool.

People who use your proprietary, closed, black-box service are dependent on the well-being of your business. You could vanish tomorrow.

ackbar034y ago

stingraycharles4y ago

I suspect the audience is more about the GPU hosting aspect, effectively making GPU-based applications “serverless”.

theo31OP4y ago

We don't use TensorRT at the moment, but it is something that we are exploring.

johndough4y ago

> Guaranteed < 200ms response time

This sounds confusing to me. Surely it is possible to craft a neural network that takes longer to process?

> Max. model size: X GB

Do you really mean model size or should this also include the size of the intermediate tensors?

The full screen option on the YouTube video is turned off by the way, so it is impossible to read without leaving your website.

Overall, this offer looks quite competitive. Are you planning to offer your service in the EU in the future?

theo31OP4y ago

The response time guaranteed is for a reasonably sized model. Bigger models (> 700 MB) will take a bit longer.

The model size is the zipped size of your model that is uploaded to Inferrd (either through the SDK or the website).

I'll fix the full screen problem right away, thank you for reporting.

We only have servers in the United States at the moment but are looking to have servers all around NA and EU very soon.

johndough4y ago

> The model size is the zipped size of your model that is uploaded to Inferrd (either through the SDK or the website).

Nice to hear!

> We only have servers in the United States at the moment but are looking to have servers all around NA and EU very soon.

rootdevelop4y ago

What are the specs of an Nvidia m80?

I’ve never heard of that type before and I wasn’t able to find anything with google.

Furthermore more, the lack of company information (address, company registration nr etc) and the fact that it’s not clear where the servers are located geographically makes me a bit hesitant.

theo31OP4y ago

Sorry that's a typo, they are K80s: https://www.nvidia.com/en-gb/data-center/tesla-k80/

sjnair964y ago

Looks awesome. Do you know if and how you guys support NVIDIA's software. For my project the NVIDIA software I'm using states it needs:

CUDA 11.3.0

cuBLAS 11.5.1.101

cuDNN 8.2.0.41

NCCL 2.9.6

TensorRT 7.2.3.4

Triton Inference Server 2.9.0

I'm new to deploying to production inference so I'm not sure if those are easily portable across such platforms or not really.

theo31OP4y ago

Those frameworks are installed by default in our custom environment. There is no additional setup/configuration required from you.

spullara4y ago

Does it need to reinitialize for each request or is there a warm start / cold start model like lambda? I don't really understand how you can charge per request.

maldeh4y ago

The pricing appears to be static per model with a ceiling on the monthly request count, not charged per request.

theo31OP4y ago

There is no cold start! We keep your service hot all the time.

spullara4y ago

Well, I guess I know where I am going to host GPT-J-6B then. I don't think it is sustainable.

1 more reply

rubatuga4y ago

I guess GPU loading is quick? Like ~10 seconds?

nextaccountic4y ago

Looking at the examples in the landing page.. so I don't need any kind of authentication to do inference? Anyone can run the models I upload?

theo31OP4y ago

At the moment, no, only the random hash gives some kind of security by obfuscation. More advanced security controls are coming soon.

nextaccountic4y ago

Anyway, here's a must: having a different key for uploading a model and doing inference with it. Or even, there should be a set of keys for each model, with each access logged separately.

nextaccountic4y ago

Also, if someone uses my model, do I pay for it?

gigatexal4y ago

Looks amazing! A 3 line getting started animation? Sold. That’s all I need to see. Very good work folks.

derekhsu4y ago

May I deploy multiple models in the same billing account?

theo31OP4y ago

Yes you absolutely can!

manceraio4y ago

could I run spleeter on it?

theo31OP4y ago

Yes absolutely, you can run almost anything in our custom environment.

inshadows4y ago

White screen without JS

j / k navigate · click thread line to collapse