Zero-Shot Text Classification on a low-end CPU-only machine?

8 pointsbackend-dev-331y ago12 comments

I want to do zero-shot text classification either with the model [1] (711 MB) or with something similar. Want to achieve high throughput in classification requests per second. Classification will run on low-end hardware: some Hetzner [2] machine without GPU (Hetzner is great, reliable and cheap, they just do not have GPU machines), something like this:

* CCX13: Dedicated vCPU, 2 VCPU, 8 GB RAM

* CX32: Shared vCPU, 4 VCPU, 8 GB RAM

Now there are multiple options for deploying and serving LLMs:

* lmdeploy

* text-generation-inference

* TensorRT-LLM

* vllm

There are more and more new frameworks for this. I am a bit lost. Would you suggest the best option for deploying the above-listed model (No-GPU hardware)?

[1] https://huggingface.co/MoritzLaurer/roberta-large-zeroshot-v2.0-c

[2] https://www.hetzner.com/cloud/

12 comments

kkielhofner1y ago

The model you linked is not an LLM either by architecture or size.

A few thoughts:

1) TensorRT anything isn’t an option because it requires Nvidia GPUs.

2) The serving frameworks you linked likely don’t support the architecture of this model, and even if they did they have varying levels of support for CPU.

3) I’m not terribly familiar with Hetzner but those instance types seem very low-end.

The model you linked has already been converted to ONNX. Your best bet (probably) is to take the ONNX model and load it in Triton Inference Server. Of course Triton is focused on Nvidia/CUDA but if it doesn’t find an Nvidia GPU it will load the model(s) to CPU. You can then do some performance testing in terms of requests/s but prepare to not be impressed…

Then you could look at (probably) int8 quantization of the model via the variety of available approaches (ONNX itself, Intel Neural Compressor, etc). With Triton specifically you should also look at Openvino CPU execution accelerator support. You will need to see if any of these dramatically impact the quality of the model.

Overall I think “good, fast, cheap: pick two” definitely applies here and even implementing what I’ve described is a fairly significant amount of development effort.

backend-dev-33OP1y ago

Well, looking at Triton Inference Server + OpenVINO backend [1]...uff... as you said: "significant amount of development effort". Not easy to handle when you do it first time.

Is ONMX runtime + OpenVINO [2] a good idea ? Seems easier to install and to use: Pre-built Docker image and Python package... Not sure about performance (the hardware-related performance improvements - they are in OpenVINO anyway, right?).

[1] https://github.com/triton-inference-server/openvino_backend

[2] https://onnxruntime.ai/docs/execution-providers/OpenVINO-Exe...

kkielhofner1y ago

Hah, it actually gets worse. What I was describing was the Triton ONNX backend with the OpenVINO execution accelerator[0] (not the OpenVINO backend itself). Clear as mud, right?

Your issue here is model performance with the additional challenge of offering it over a network socket across multiple requests and doing so in a performant manner.

Triton does things like dynamic batching[1] where throughput is increased significantly by aggregating disparate requests into one pass through the GPU.

A docker container for torch, ONNX, OpenVINO, etc isn't even natively going to offer a network socket. This is where people try to do things like rolling their own FastAPI API implementation (or something) only to discover it completely falls apart at any kind of load. That's development effort as well but it's a waste of time.

[0] - https://github.com/triton-inference-server/onnxruntime_backe...

[1] - https://docs.nvidia.com/deeplearning/triton-inference-server...

1 more reply

pilotneko1y ago

Hugging Face does maintain a package named Text Embedding Inference (TEI) with GPU/CPU-optimized container images. While I have only used this for hosting embedding models, it does appear to support Roberta architecture classifiers (specifically sentiment analysis).

https://github.com/huggingface/text-embeddings-inference

You can always run a zero shot pipeline in HF with a simple Flask/FastAPI application.

backend-dev-33OP1y ago

thanks for Text Embedding Inference (never heard about it before)

> you can always run a zero shot pipeline in HF with a simple Flask/FastAPI application.

Yeah, sometimes things that are right in front of your nose, you don't see them. you mean this? https://huggingface.co/docs/api-inference/index

pilotneko1y ago

Sorry, life got busy and I haven’t been able to get back to you. I was referring to pipelines in the Transformers package from Hugging Face. https://huggingface.co/docs/transformers/v4.45.2/en/main_cla...

These are essentially function calls for you to run pre-trained models. If you want to continue this conversation elsewhere, feel free to shoot me an e-mail. It’s just my username @ gmail.

Terretta1y ago

Have you considered doing it off machine?

https://github.com/GoogleCloudPlatform/cloud-shell-tutorials...

https://cloud.google.com/natural-language/docs/samples/langu...

I'd suggest v2:

https://cloud.google.com/natural-language/docs/classifying-t...

Here are built in content categories (which feel consumer advertising oriented, natch), but it handles other classifications as well:

https://cloud.google.com/natural-language/docs/categories#ca...

backend-dev-33OP1y ago

Thanks @Terretta

Well, the categories I use - do not overlap at all with the list of 1092 categories in Google Content Categories.

> it handles other classifications as well

hm... I highly doubt that. First of all - I do not see API to upload list of MY categories. Second: Can somebody with Google Cloud account try it? I have no account and when creating it - it asks for credit card...

backend-dev-33OP1y ago

UPDATE: how to do the same classification task using some hosting provider with GPU?

Let us discuss it here -> https://news.ycombinator.com/item?id=41768088

leeeeeepw1y ago

Setfit

j / k navigate · click thread line to collapse

12 comments

kkielhofner1y ago

The model you linked is not an LLM either by architecture or size.

A few thoughts:

1) TensorRT anything isn’t an option because it requires Nvidia GPUs.

2) The serving frameworks you linked likely don’t support the architecture of this model, and even if they did they have varying levels of support for CPU.

3) I’m not terribly familiar with Hetzner but those instance types seem very low-end.

Overall I think “good, fast, cheap: pick two” definitely applies here and even implementing what I’ve described is a fairly significant amount of development effort.

backend-dev-33OP1y ago

Well, looking at Triton Inference Server + OpenVINO backend [1]...uff... as you said: "significant amount of development effort". Not easy to handle when you do it first time.

[1] https://github.com/triton-inference-server/openvino_backend

[2] https://onnxruntime.ai/docs/execution-providers/OpenVINO-Exe...

kkielhofner1y ago

Hah, it actually gets worse. What I was describing was the Triton ONNX backend with the OpenVINO execution accelerator[0] (not the OpenVINO backend itself). Clear as mud, right?

Your issue here is model performance with the additional challenge of offering it over a network socket across multiple requests and doing so in a performant manner.

Triton does things like dynamic batching[1] where throughput is increased significantly by aggregating disparate requests into one pass through the GPU.

[0] - https://github.com/triton-inference-server/onnxruntime_backe...

[1] - https://docs.nvidia.com/deeplearning/triton-inference-server...

1 more reply

pilotneko1y ago

https://github.com/huggingface/text-embeddings-inference

You can always run a zero shot pipeline in HF with a simple Flask/FastAPI application.

backend-dev-33OP1y ago

thanks for Text Embedding Inference (never heard about it before)

> you can always run a zero shot pipeline in HF with a simple Flask/FastAPI application.

Yeah, sometimes things that are right in front of your nose, you don't see them. you mean this? https://huggingface.co/docs/api-inference/index

pilotneko1y ago

These are essentially function calls for you to run pre-trained models. If you want to continue this conversation elsewhere, feel free to shoot me an e-mail. It’s just my username @ gmail.

Terretta1y ago

Have you considered doing it off machine?

https://github.com/GoogleCloudPlatform/cloud-shell-tutorials...

https://cloud.google.com/natural-language/docs/samples/langu...

I'd suggest v2:

https://cloud.google.com/natural-language/docs/classifying-t...

Here are built in content categories (which feel consumer advertising oriented, natch), but it handles other classifications as well:

https://cloud.google.com/natural-language/docs/categories#ca...

backend-dev-33OP1y ago

Thanks @Terretta

Well, the categories I use - do not overlap at all with the list of 1092 categories in Google Content Categories.

> it handles other classifications as well

backend-dev-33OP1y ago

UPDATE: how to do the same classification task using some hosting provider with GPU?

Let us discuss it here -> https://news.ycombinator.com/item?id=41768088

leeeeeepw1y ago

Setfit

j / k navigate · click thread line to collapse