Serving tools

Endpoint provider

AnyScale

Fast, cost-efficient, serverless APIs for LLM Serving and Fine Tuning.
It allows to serve and fine-tune open-models the same way as openai does.

https://www.anyscale.com/endpoints

Hugging Face endpoints

With Inference Endpoints (dedicated), easily deploy Transformers, Diffusers or any model on dedicated, fully managed infrastructure. Keep your costs low with our secure, compliant and flexible production solution.

https://huggingface.co/inference-endpoints/dedicated

Libraries

SkyPilot

see here

vLLM

This is a fast and easy-to-use library for LLM inference and serving, offering:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache
Optimized CUDA kernels
https://vllm.readthedocs.io/en/latest/index.html

Truss

The simplest way to serve AI/ML models in production

Write once, run anywhere: Package and test model code, weights, and dependencies with a model server that behaves the same in development and production.
Fast developer loop: Implement your model with fast feedback from a live reload server, and skip Docker and Kubernetes configuration with a batteries-included model serving environment.
Support for all Python frameworks: From transformers and diffusers to PyTorch and TensorFlow to TensorRT and Triton, Truss supports models created and served with any framework.
https://truss.baseten.co/welcome

Langcorn

LangCorn is an API server that enables you to serve LangChain models and pipelines with ease, leveraging the power of FastAPI for a robust and efficient experience.

https://github.com/msoedov/langcorn

Text Generation Inference (TGI)

A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to power Hugging Chat, the Inference API and Inference Endpoint.

https://github.com/huggingface/text-generation-inference

🤗 Model Memory Calculator

This tool help calculate how much vRAM is needed to train and perform big model inference on a model.

https://huggingface.co/spaces/hf-accelerate/model-memory-usage