Endpoint provider
AnyScale
Fast, cost-efficient, serverless APIs for LLM Serving and Fine Tuning.
It allows to serve and fine-tune open-models the same way as openai does.
Hugging Face endpoints
With Inference Endpoints (dedicated), easily deploy Transformers, Diffusers or any model on dedicated, fully managed infrastructure. Keep your costs low with our secure, compliant and flexible production solution.
Libraries
SkyPilot
see here
vLLM
This is a fast and easy-to-use library for LLM inference and serving, offering:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with PagedAttention
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache
- Optimized CUDA kernels
- https://vllm.readthedocs.io/en/latest/index.html
Truss
The simplest way to serve AI/ML models in production
- Write once, run anywhere: Package and test model code, weights, and dependencies with a model server that behaves the same in development and production.
- Fast developer loop: Implement your model with fast feedback from a live reload server, and skip Docker and Kubernetes configuration with a batteries-included model serving environment.
- Support for all Python frameworks: From
transformers
anddiffusers
toPyTorch
andTensorFlow
toTensorRT
andTriton
, Truss supports models created and served with any framework. - https://truss.baseten.co/welcome
Langcorn
LangCorn is an API server that enables you to serve LangChain models and pipelines with ease, leveraging the power of FastAPI for a robust and efficient experience.
Text Generation Inference (TGI)
A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to power Hugging Chat, the Inference API and Inference Endpoint.
🤗 Model Memory Calculator
This tool help calculate how much vRAM is needed to train and perform big model inference on a model.