Chapter 23

References

Shell Completion

LocalAI provides shell completion support for bash, zsh, and fish shells. Once installed, tab completion works for all CLI commands, subcommands, and flags.

Generating Completion Scripts

Use the completion subcommand to generate a completion script for your shell:

local-ai completion bash
local-ai completion zsh
local-ai completion fish

Installation

Bash

Add the following to your ~/.bashrc:

source <(local-ai completion bash)

Or install it system-wide:

local-ai completion bash > /etc/bash_completion.d/local-ai

Zsh

Add the following to your ~/.zshrc:

source <(local-ai completion zsh)

Or install it to a completions directory:

local-ai completion zsh > "${fpath[1]}/_local-ai"

If shell completions are not already enabled in your zsh environment, add the following to the beginning of your ~/.zshrc:

autoload -Uz compinit
compinit

Fish

local-ai completion fish | source

Or install it permanently:

local-ai completion fish > ~/.config/fish/completions/local-ai.fish

Usage

After installation, restart your shell or source your shell configuration file. Then type local-ai followed by a tab to see available commands:

$ local-ai <TAB>
run              backends         completion       explorer         models
federated        sound-generation transcript       tts              util

Tab completion also works for subcommands and flags:

$ local-ai models <TAB>
install  list

$ local-ai run --<TAB>
--address          --backends-path    --context-size     --debug            ...

System Info and Version

LocalAI provides endpoints to inspect the running instance, including available backends, loaded models, and version information.

System Information

Method: GET
Endpoint: /system

Returns available backends and currently loaded models.

Response

Field	Type	Description
`backends`	`array`	List of available backend names (strings)
`loaded_models`	`array`	List of currently loaded models
`loaded_models[].id`	`string`	Model identifier

Usage

curl http://localhost:8080/system

Example response

{
  "backends": [
    "llama-cpp",
    "huggingface",
    "diffusers",
    "whisper"
  ],
  "loaded_models": [
    {
      "id": "my-llama-model"
    },
    {
      "id": "whisper-1"
    }
  ]
}

Version

Method: GET
Endpoint: /version

Returns the LocalAI version and build commit.

Response

Field	Type	Description
`version`	`string`	Version string in the format `version (commit)`

Usage

curl http://localhost:8080/version

Example response

{
  "version": "2.26.0 (a1b2c3d4)"
}

Error Responses

Status Code	Description
500	Internal server error

Model compatibility table

Besides llama based models, LocalAI is compatible also with other architectures. The table below lists all the backends, compatible models families and the associated repository.

Note

LocalAI will attempt to automatically load models which are not explicitly configured for a specific backend. You can specify the backend to use by configuring a model with a YAML file. See the advanced section for more details.

Text Generation & Language Models

Backend and Bindings	Compatible models	Completion/Chat endpoint	Capability	Embeddings support	Token stream support	Acceleration
llama.cpp	LLama, Mamba, RWKV, Falcon, Starcoder, GPT-2, and many others	yes	GPT and Functions	yes	yes	CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, CPU
vLLM	Various GPTs and quantization formats	yes	GPT	no	no	CUDA 12/13, ROCm, Intel
transformers	Various GPTs and quantization formats	yes	GPT, embeddings, Audio generation	yes	yes*	CUDA 12/13, ROCm, Intel, CPU
MLX	Various LLMs	yes	GPT	no	no	Metal (Apple Silicon)
MLX-VLM	Vision-Language Models	yes	Multimodal GPT	no	no	Metal (Apple Silicon)
vllm-omni	vLLM Omni multimodal	yes	Multimodal GPT	no	no	CUDA 12/13, ROCm, Intel
langchain-huggingface	Any text generators available on HuggingFace through API	yes	GPT	no	no	N/A

Audio & Speech Processing

Backend and Bindings	Compatible models	Completion/Chat endpoint	Capability	Embeddings support	Token stream support	Acceleration
whisper.cpp	whisper	no	Audio transcription	no	no	CUDA 12/13, ROCm, Intel SYCL, Vulkan, CPU
faster-whisper	whisper	no	Audio transcription	no	no	CUDA 12/13, ROCm, Intel, CPU
piper (binding)	Any piper onnx model	no	Text to voice	no	no	CPU
coqui	Coqui TTS	no	Audio generation and Voice cloning	no	no	CUDA 12/13, ROCm, Intel, CPU
kokoro	Kokoro TTS	no	Text-to-speech	no	no	CUDA 12/13, ROCm, Intel, CPU
chatterbox	Chatterbox TTS	no	Text-to-speech	no	no	CUDA 12/13, CPU
kitten-tts	Kitten TTS	no	Text-to-speech	no	no	CPU
silero-vad with Golang bindings	Silero VAD	no	Voice Activity Detection	no	no	CPU
neutts	NeuTTSAir	no	Text-to-speech with voice cloning	no	no	CUDA 12/13, ROCm, CPU
vibevoice	VibeVoice-Realtime	no	Real-time text-to-speech with voice cloning	no	no	CUDA 12/13, ROCm, Intel, CPU
pocket-tts	Pocket TTS	no	Lightweight CPU-based text-to-speech with voice cloning	no	no	CUDA 12/13, ROCm, Intel, CPU
mlx-audio	MLX	no	Text-tospeech	no	no	Metal (Apple Silicon)
nemo	NeMo speech models	no	Speech models	no	no	CUDA 12/13, ROCm, Intel, CPU
outetts	OuteTTS	no	Text-to-speech with voice cloning	no	no	CUDA 12/13, CPU
faster-qwen3-tts	Faster Qwen3 TTS	no	Fast text-to-speech	no	no	CUDA 12/13, ROCm, Intel, CPU
qwen-asr	Qwen ASR	no	Automatic speech recognition	no	no	CUDA 12/13, ROCm, Intel, CPU
voxcpm	VoxCPM	no	Speech understanding	no	no	CUDA 12/13, Metal, CPU
whisperx	WhisperX	no	Enhanced transcription	no	no	CUDA 12/13, ROCm, Intel, CPU

Image & Video Generation

Backend and Bindings	Compatible models	Completion/Chat endpoint	Capability	Embeddings support	Token stream support	Acceleration
stablediffusion.cpp	stablediffusion-1, stablediffusion-2, stablediffusion-3, flux, PhotoMaker	no	Image	no	no	CUDA 12/13, Intel SYCL, Vulkan, CPU
diffusers	SD, various diffusion models,…	no	Image/Video generation	no	no	CUDA 12/13, ROCm, Intel, Metal, CPU
transformers-musicgen	MusicGen	no	Audio generation	no	no	CUDA, CPU

Specialized AI Tasks

Backend and Bindings	Compatible models	Completion/Chat endpoint	Capability	Embeddings support	Token stream support	Acceleration
rfdetr	RF-DETR	no	Object Detection	no	no	CUDA 12/13, Intel, CPU
rerankers	Reranking API	no	Reranking	no	no	CUDA 12/13, ROCm, Intel, CPU
local-store	Vector database	no	Vector storage	yes	no	CPU
huggingface	HuggingFace API models	yes	Various AI tasks	yes	yes	API-based

Acceleration Support Summary

GPU Acceleration

NVIDIA CUDA: CUDA 12.0, CUDA 13.0 support across most backends
AMD ROCm: HIP-based acceleration for AMD GPUs
Intel oneAPI: SYCL-based acceleration for Intel GPUs (F16/F32 precision)
Vulkan: Cross-platform GPU acceleration
Metal: Apple Silicon GPU acceleration (M1/M2/M3+)

Specialized Hardware

NVIDIA Jetson (L4T CUDA 12): ARM64 support for embedded AI (AGX Orin, Jetson Nano, Jetson Xavier NX, Jetson AGX Xavier)
NVIDIA Jetson (L4T CUDA 13): ARM64 support for embedded AI (DGX Spark)
Apple Silicon: Native Metal acceleration for Mac M1/M2/M3+
Darwin x86: Intel Mac support

CPU Optimization

AVX/AVX2/AVX512: Advanced vector extensions for x86
Quantization: 4-bit, 5-bit, 8-bit integer quantization support
Mixed Precision: F16/F32 mixed precision support

Note: any backend name listed above can be used in the backend field of the model configuration file (See the advanced section).

* Only for CUDA and OpenVINO CPU/XPU acceleration.

Architecture

LocalAI is an API written in Go that serves as an OpenAI shim, enabling software already developed with OpenAI SDKs to seamlessly integrate with LocalAI. It can be effortlessly implemented as a substitute, even on consumer-grade hardware. This capability is achieved by employing various C++ backends, including ggml, to perform inference on LLMs using both CPU and, if desired, GPU. Internally LocalAI backends are just gRPC server, indeed you can specify and build your own gRPC server and extend LocalAI in runtime as well. It is possible to specify external gRPC server and/or binaries that LocalAI will manage internally.

LocalAI uses a mixture of backends written in various languages (C++, Golang, Python, …). You can check the model compatibility table to learn about all the components of LocalAI.

Backstory

As much as typical open source projects starts, I, mudler, was fiddling around with llama.cpp over my long nights and wanted to have a way to call it from go, as I am a Golang developer and use it extensively. So I’ve created LocalAI (or what was initially known as llama-cli) and added an API to it.

But guess what? The more I dived into this rabbit hole, the more I realized that I had stumbled upon something big. With all the fantastic C++ projects floating around the community, it dawned on me that I could piece them together to create a full-fledged OpenAI replacement. So, ta-da! LocalAI was born, and it quickly overshadowed its humble origins.

Now, why did I choose to go with C++ bindings, you ask? Well, I wanted to keep LocalAI snappy and lightweight, allowing it to run like a champ on any system and avoid any Golang penalties of the GC, and, most importantly built on shoulders of giants like llama.cpp. Go is good at backends and API and is easy to maintain. And hey, don’t forget that I’m all about sharing the love. That’s why I made LocalAI MIT licensed, so everyone can hop on board and benefit from it.

As if that wasn’t exciting enough, as the project gained traction, mkellerman and Aisuko jumped in to lend a hand. mkellerman helped set up some killer examples, while Aisuko is becoming our community maestro. The community now is growing even more with new contributors and users, and I couldn’t be happier about it!

Oh, and let’s not forget the real MVP here—llama.cpp. Without this extraordinary piece of software, LocalAI wouldn’t even exist. So, a big shoutout to the community for making this magic happen!

CLI Reference

Complete reference for all LocalAI command-line interface (CLI) parameters and environment variables.

Note: All CLI flags can also be set via environment variables. Environment variables take precedence over CLI flags. See .env files for configuration file support.

Global Flags

Parameter	Default	Description	Environment Variable
`-h, --help`		Show context-sensitive help
`--log-level`	`info`	Set the level of logs to output [error,warn,info,debug,trace]	`$LOCALAI_LOG_LEVEL`
`--debug`	`false`	DEPRECATED - Use `--log-level=debug` instead. Enable debug logging	`$LOCALAI_DEBUG`, `$DEBUG`

Storage Flags

Parameter	Default	Description	Environment Variable
`--models-path`	`BASEPATH/models`	Path containing models used for inferencing	`$LOCALAI_MODELS_PATH`, `$MODELS_PATH`
`--data-path`	`BASEPATH/data`	Path for persistent data (collectiondb, agent state, tasks, jobs). Separates mutable data from configuration	`$LOCALAI_DATA_PATH`
`--generated-content-path`	`/tmp/generated/content`	Location for assets generated by backends (e.g. stablediffusion, images, audio, videos)	`$LOCALAI_GENERATED_CONTENT_PATH`, `$GENERATED_CONTENT_PATH`
`--upload-path`	`/tmp/localai/upload`	Path to store uploads from files API	`$LOCALAI_UPLOAD_PATH`, `$UPLOAD_PATH`
`--localai-config-dir`	`BASEPATH/configuration`	Directory for dynamic loading of certain configuration files (currently runtime_settings.json, api_keys.json, and external_backends.json). See Runtime Settings for web-based configuration.	`$LOCALAI_CONFIG_DIR`
`--localai-config-dir-poll-interval`		Time duration to poll the LocalAI Config Dir if your system has broken fsnotify events (example: `1m`)	`$LOCALAI_CONFIG_DIR_POLL_INTERVAL`
`--models-config-file`		YAML file containing a list of model backend configs (alias: `--config-file`)	`$LOCALAI_MODELS_CONFIG_FILE`, `$CONFIG_FILE`

Backend Flags

Parameter	Default	Description	Environment Variable
`--backends-path`	`BASEPATH/backends`	Path containing backends used for inferencing	`$LOCALAI_BACKENDS_PATH`, `$BACKENDS_PATH`
`--backends-system-path`	`/var/lib/local-ai/backends`	Path containing system backends used for inferencing	`$LOCALAI_BACKENDS_SYSTEM_PATH`, `$BACKEND_SYSTEM_PATH`
`--external-backends`		A list of external backends to load from gallery on boot	`$LOCALAI_EXTERNAL_BACKENDS`, `$EXTERNAL_BACKENDS`
`--external-grpc-backends`		A list of external gRPC backends (format: `BACKEND_NAME:URI`)	`$LOCALAI_EXTERNAL_GRPC_BACKENDS`, `$EXTERNAL_GRPC_BACKENDS`
`--backend-galleries`		JSON list of backend galleries	`$LOCALAI_BACKEND_GALLERIES`, `$BACKEND_GALLERIES`
`--autoload-backend-galleries`	`true`	Automatically load backend galleries on startup	`$LOCALAI_AUTOLOAD_BACKEND_GALLERIES`, `$AUTOLOAD_BACKEND_GALLERIES`
`--parallel-requests`	`false`	Enable backends to handle multiple requests in parallel if they support it (e.g.: llama.cpp or vllm)	`$LOCALAI_PARALLEL_REQUESTS`, `$PARALLEL_REQUESTS`
`--max-active-backends`	`0`	Maximum number of active backends (loaded models). When exceeded, the least recently used model is evicted. Set to `0` for unlimited, `1` for single-backend mode	`$LOCALAI_MAX_ACTIVE_BACKENDS`, `$MAX_ACTIVE_BACKENDS`
`--single-active-backend`	`false`	DEPRECATED - Use `--max-active-backends=1` instead. Allow only one backend to be run at a time	`$LOCALAI_SINGLE_ACTIVE_BACKEND`, `$SINGLE_ACTIVE_BACKEND`
`--preload-backend-only`	`false`	Do not launch the API services, only the preloaded models/backends are started (useful for multi-node setups)	`$LOCALAI_PRELOAD_BACKEND_ONLY`, `$PRELOAD_BACKEND_ONLY`
`--enable-watchdog-idle`	`false`	Enable watchdog for stopping backends that are idle longer than the watchdog-idle-timeout	`$LOCALAI_WATCHDOG_IDLE`, `$WATCHDOG_IDLE`
`--watchdog-idle-timeout`	`15m`	Threshold beyond which an idle backend should be stopped	`$LOCALAI_WATCHDOG_IDLE_TIMEOUT`, `$WATCHDOG_IDLE_TIMEOUT`
`--enable-watchdog-busy`	`false`	Enable watchdog for stopping backends that are busy longer than the watchdog-busy-timeout	`$LOCALAI_WATCHDOG_BUSY`, `$WATCHDOG_BUSY`
`--watchdog-busy-timeout`	`5m`	Threshold beyond which a busy backend should be stopped	`$LOCALAI_WATCHDOG_BUSY_TIMEOUT`, `$WATCHDOG_BUSY_TIMEOUT`
`--watchdog-interval`	`500ms`	Interval between watchdog checks (e.g., `500ms`, `5s`, `1m`)	`$LOCALAI_WATCHDOG_INTERVAL`, `$WATCHDOG_INTERVAL`
`--force-eviction-when-busy`	`false`	Force eviction even when models have active API calls (default: false for safety). Warning: Enabling this can interrupt active requests	`$LOCALAI_FORCE_EVICTION_WHEN_BUSY`, `$FORCE_EVICTION_WHEN_BUSY`
`--lru-eviction-max-retries`	`30`	Maximum number of retries when waiting for busy models to become idle before eviction	`$LOCALAI_LRU_EVICTION_MAX_RETRIES`, `$LRU_EVICTION_MAX_RETRIES`
`--lru-eviction-retry-interval`	`1s`	Interval between retries when waiting for busy models to become idle (e.g., `1s`, `2s`)	`$LOCALAI_LRU_EVICTION_RETRY_INTERVAL`, `$LRU_EVICTION_RETRY_INTERVAL`

For more information on VRAM management, see VRAM and Memory Management.

Models Flags

Parameter	Default	Description	Environment Variable
`--galleries`		JSON list of galleries	`$LOCALAI_GALLERIES`, `$GALLERIES`
`--autoload-galleries`	`true`	Automatically load galleries on startup	`$LOCALAI_AUTOLOAD_GALLERIES`, `$AUTOLOAD_GALLERIES`
`--preload-models`		A list of models to apply in JSON at start	`$LOCALAI_PRELOAD_MODELS`, `$PRELOAD_MODELS`
`--models`		A list of model configuration URLs to load	`$LOCALAI_MODELS`, `$MODELS`
`--preload-models-config`		A list of models to apply at startup. Path to a YAML config file	`$LOCALAI_PRELOAD_MODELS_CONFIG`, `$PRELOAD_MODELS_CONFIG`
`--load-to-memory`		A list of models to load into memory at startup	`$LOCALAI_LOAD_TO_MEMORY`, `$LOAD_TO_MEMORY`

Note: You can also pass model configuration URLs as positional arguments: local-ai run MODEL_URL1 MODEL_URL2 ...

Performance Flags

Parameter	Default	Description	Environment Variable
`--f16`	`false`	Enable GPU acceleration	`$LOCALAI_F16`, `$F16`
`-t, --threads`		Number of threads used for parallel computation. Usage of the number of physical cores in the system is suggested	`$LOCALAI_THREADS`, `$THREADS`
`--context-size`		Default context size for models	`$LOCALAI_CONTEXT_SIZE`, `$CONTEXT_SIZE`

API Flags

Parameter	Default	Description	Environment Variable
`--address`	`:8080`	Bind address for the API server	`$LOCALAI_ADDRESS`, `$ADDRESS`
`--cors`	`false`	Enable CORS (Cross-Origin Resource Sharing)	`$LOCALAI_CORS`, `$CORS`
`--cors-allow-origins`		Comma-separated list of allowed CORS origins	`$LOCALAI_CORS_ALLOW_ORIGINS`, `$CORS_ALLOW_ORIGINS`
`--csrf`	`false`	Enable Fiber CSRF middleware	`$LOCALAI_CSRF`
`--upload-limit`	`15`	Default upload-limit in MB	`$LOCALAI_UPLOAD_LIMIT`, `$UPLOAD_LIMIT`
`--api-keys`		List of API Keys to enable API authentication. When this is set, all requests must be authenticated with one of these API keys	`$LOCALAI_API_KEY`, `$API_KEY`
`--disable-webui`	`false`	Disables the web user interface. When set to true, the server will only expose API endpoints without serving the web interface	`$LOCALAI_DISABLE_WEBUI`, `$DISABLE_WEBUI`
`--disable-runtime-settings`	`false`	Disables the runtime settings feature. When set to true, the server will not load runtime settings from the `runtime_settings.json` file and the settings web interface will be disabled	`$LOCALAI_DISABLE_RUNTIME_SETTINGS`, `$DISABLE_RUNTIME_SETTINGS`
`--disable-gallery-endpoint`	`false`	Disable the gallery endpoints	`$LOCALAI_DISABLE_GALLERY_ENDPOINT`, `$DISABLE_GALLERY_ENDPOINT`
`--disable-metrics-endpoint`	`false`	Disable the `/metrics` endpoint	`$LOCALAI_DISABLE_METRICS_ENDPOINT`, `$DISABLE_METRICS_ENDPOINT`
`--machine-tag`		If not empty, add that string to Machine-Tag header in each response. Useful to track response from different machines using multiple P2P federated nodes	`$LOCALAI_MACHINE_TAG`, `$MACHINE_TAG`

Hardening Flags

Parameter	Default	Description	Environment Variable
`--disable-predownload-scan`	`false`	If true, disables the best-effort security scanner before downloading any files	`$LOCALAI_DISABLE_PREDOWNLOAD_SCAN`
`--opaque-errors`	`false`	If true, all error responses are replaced with blank 500 errors. This is intended only for hardening against information leaks and is normally not recommended	`$LOCALAI_OPAQUE_ERRORS`
`--use-subtle-key-comparison`	`false`	If true, API Key validation comparisons will be performed using constant-time comparisons rather than simple equality. This trades off performance on each request for resilience against timing attacks	`$LOCALAI_SUBTLE_KEY_COMPARISON`
`--disable-api-key-requirement-for-http-get`	`false`	If true, a valid API key is not required to issue GET requests to portions of the web UI. This should only be enabled in secure testing environments	`$LOCALAI_DISABLE_API_KEY_REQUIREMENT_FOR_HTTP_GET`
`--http-get-exempted-endpoints`	`^/$,^/browse/?$,^/talk/?$,^/p2p/?$,^/chat/?$,^/image/?$,^/text2image/?$,^/tts/?$,^/static/.$,^/swagger.$`	If `--disable-api-key-requirement-for-http-get` is overridden to true, this is the list of endpoints to exempt. Only adjust this in case of a security incident or as a result of a personal security posture review	`$LOCALAI_HTTP_GET_EXEMPTED_ENDPOINTS`

P2P Flags

Parameter	Default	Description	Environment Variable
`--p2p`	`false`	Enable P2P mode	`$LOCALAI_P2P`, `$P2P`
`--p2p-dht-interval`	`360`	Interval for DHT refresh (used during token generation)	`$LOCALAI_P2P_DHT_INTERVAL`, `$P2P_DHT_INTERVAL`
`--p2p-otp-interval`	`9000`	Interval for OTP refresh (used during token generation)	`$LOCALAI_P2P_OTP_INTERVAL`, `$P2P_OTP_INTERVAL`
`--p2ptoken`		Token for P2P mode (optional)	`$LOCALAI_P2P_TOKEN`, `$P2P_TOKEN`, `$TOKEN`
`--p2p-network-id`		Network ID for P2P mode, can be set arbitrarily by the user for grouping a set of instances	`$LOCALAI_P2P_NETWORK_ID`, `$P2P_NETWORK_ID`
`--federated`	`false`	Enable federated instance	`$LOCALAI_FEDERATED`, `$FEDERATED`

Other Commands

LocalAI supports several subcommands beyond run:

local-ai models - Manage LocalAI models and definitions
local-ai backends - Manage LocalAI backends and definitions
local-ai tts - Convert text to speech
local-ai sound-generation - Generate audio files from text or audio
local-ai transcript - Convert audio to text
local-ai worker - Run workers to distribute workload (llama.cpp-only)
local-ai util - Utility commands
local-ai explorer - Run P2P explorer
local-ai federated - Run LocalAI in federated mode

Use local-ai <command> --help for more information on each command.

Examples

Basic Usage

./local-ai run

./local-ai run --models-path /path/to/models --address :9090

./local-ai run --f16

Environment Variables

export LOCALAI_MODELS_PATH=/path/to/models
export LOCALAI_ADDRESS=:9090
export LOCALAI_F16=true
./local-ai run

Advanced Configuration

./local-ai run \
  --models model1.yaml model2.yaml \
  --enable-watchdog-idle \
  --watchdog-idle-timeout=10m \
  --p2p \
  --federated

See Advanced Usage for configuration examples
See VRAM and Memory Management for memory management options

API Error Reference

This page documents the error responses returned by the LocalAI API. LocalAI supports multiple API formats (OpenAI, Anthropic, Open Responses), each with its own error structure.

Error Response Formats

OpenAI-Compatible Format

Most endpoints return errors using the OpenAI-compatible format:

{
  "error": {
    "code": 400,
    "message": "A human-readable description of the error",
    "type": "invalid_request_error",
    "param": null
  }
}

Field	Type	Description
`code`	`integer\|string`	HTTP status code or error code string
`message`	`string`	Human-readable error description
`type`	`string`	Error category (e.g., `invalid_request_error`)
`param`	`string\|null`	The parameter that caused the error, if applicable

This format is used by: /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/images/generations, /v1/audio/transcriptions, /models, and other OpenAI-compatible endpoints.

Anthropic Format

The /v1/messages endpoint returns errors in Anthropic’s format:

{
  "type": "error",
  "error": {
    "type": "invalid_request_error",
    "message": "A human-readable description of the error"
  }
}

Field	Type	Description
`type`	`string`	Always `"error"` for error responses
`error.type`	`string`	`invalid_request_error` or `api_error`
`error.message`	`string`	Human-readable error description

Open Responses Format

The /v1/responses endpoint returns errors with this structure:

{
  "error": {
    "type": "invalid_request",
    "message": "A human-readable description of the error",
    "code": "",
    "param": "parameter_name"
  }
}

Field	Type	Description
`type`	`string`	One of: `invalid_request`, `not_found`, `server_error`, `model_error`, `invalid_request_error`
`message`	`string`	Human-readable error description
`code`	`string`	Optional error code
`param`	`string`	The parameter that caused the error, if applicable

HTTP Status Codes

Code	Meaning	When It Occurs
400	Bad Request	Invalid input, missing required fields, malformed JSON
401	Unauthorized	Missing or invalid API key
404	Not Found	Model or resource does not exist
409	Conflict	Resource already exists (e.g., duplicate token)
422	Unprocessable Entity	Validation failed (e.g., invalid parameter range)
500	Internal Server Error	Backend inference failure, unexpected server errors

Global Error Handling

Authentication Errors (401)

When API keys are configured (via LOCALAI_API_KEY or --api-keys), all requests must include a valid key. Keys can be provided through:

Authorization: Bearer <key> header
x-api-key: <key> header
xi-api-key: <key> header
token cookie

Example request without a key:

curl http://localhost:8080/v1/models \
  -H "Content-Type: application/json"

Error response:

{
  "error": {
    "code": 401,
    "message": "An authentication key is required",
    "type": "invalid_request_error"
  }
}

The response also includes the header WWW-Authenticate: Bearer.

Request Parsing Errors (400)

All endpoints return a 400 error if the request body cannot be parsed:

{
  "error": {
    "code": 400,
    "message": "failed parsing request body: <details>",
    "type": ""
  }
}

Not Found (404)

Requests to undefined routes return:

{
  "error": {
    "code": 404,
    "message": "Resource not found"
  }
}

Opaque Errors Mode

When LOCALAI_OPAQUE_ERRORS=true is set, all error responses return an empty body with only the HTTP status code. This is a security hardening option that prevents information leaks.

Per-Endpoint Error Scenarios

Chat Completions — `POST /v1/chat/completions`

Status	Cause	Example Message
400	Invalid or malformed request body	`Bad Request`
400	Model not found in configuration	`Bad Request`
500	Backend inference failure	`Internal Server Error`

# Missing model field
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "hello"}]}'

Completions — `POST /v1/completions`

Status	Cause	Example Message
400	Invalid or malformed request body	`Bad Request`
500	Backend inference failure	`Internal Server Error`

Embeddings — `POST /v1/embeddings`

Status	Cause	Example Message
400	Invalid or malformed request body	`Bad Request`
400	Model not found in configuration	`Bad Request`
500	Backend inference failure	`Internal Server Error`

Image Generation — `POST /v1/images/generations`

Status	Cause	Example Message
400	Invalid or malformed request body	`Bad Request`
400	Model not found in configuration	`Bad Request`
500	Backend inference failure	`Internal Server Error`

Image Editing (Inpainting) — `POST /v1/images/edits`

Status	Cause	Example Message
400	Invalid or malformed request body	`Bad Request`
400	Missing `image` file	`missing image file`
400	Missing `mask` file	`missing mask file`
500	Storage preparation failure	`failed to prepare storage`

Audio Transcription — `POST /v1/audio/transcriptions`

Status	Cause	Example Message
400	Missing `file` field in form data	`Bad Request`
400	Model not found in configuration	`Bad Request`
500	Backend inference failure	`Internal Server Error`

Text to Speech — `POST /v1/audio/speech`, `POST /tts`

Status	Cause	Example Message
400	Invalid or malformed request body	`Bad Request`
400	Model not found in configuration	`Bad Request`
500	Backend inference failure	`Internal Server Error`

ElevenLabs TTS — `POST /v1/text-to-speech/:voice-id`

Status	Cause	Example Message
400	Invalid or malformed request body	`Bad Request`
400	Model not found in configuration	`Bad Request`
500	Backend inference failure	`Internal Server Error`

ElevenLabs Sound Generation — `POST /v1/sound-generation`

Status	Cause	Example Message
400	Invalid or malformed request body	`Bad Request`
400	Model not found in configuration	`Bad Request`
500	Backend inference failure	`Internal Server Error`

Reranking — `POST /v1/rerank`, `POST /jina/v1/rerank`

Status	Cause	Example Message
400	Invalid or malformed request body	`Bad Request`
422	`top_n` less than 1	`top_n - should be greater than or equal to 1`
500	Backend inference failure	`Internal Server Error`

Anthropic Messages — `POST /v1/messages`

Status	Cause	Error Type	Example Message
400	Missing `model` field	`invalid_request_error`	`model is required`
400	Model not in configuration	`invalid_request_error`	`model configuration not found`
400	Missing or invalid `max_tokens`	`invalid_request_error`	`max_tokens is required and must be greater than 0`
500	Backend inference failure	`api_error`	`model inference failed: <details>`
500	Prediction failure	`api_error`	`prediction failed: <details>`

# Missing model field
curl http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "hello"}], "max_tokens": 100}'

{
  "type": "error",
  "error": {
    "type": "invalid_request_error",
    "message": "model is required"
  }
}

Open Responses — `POST /v1/responses`

Status	Cause	Error Type	Example Message
400	Missing `model` field	`invalid_request`	`model is required`
400	Model not in configuration	`invalid_request`	`model configuration not found`
400	Failed to parse input	`invalid_request`	`failed to parse input: <details>`
400	`background=true` without `store=true`	`invalid_request_error`	`background=true requires store=true`
404	Previous response not found	`not_found`	`previous response not found: <id>`
500	Backend inference failure	`model_error`	`model inference failed: <details>`
500	Prediction failure	`model_error`	`prediction failed: <details>`
500	Tool execution failure	`model_error`	`failed to execute tools: <details>`
500	MCP configuration error	`server_error`	`failed to get MCP config: <details>`
500	No MCP servers available	`server_error`	`no working MCP servers found`

# Missing model field
curl http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{"input": "hello"}'

{
  "error": {
    "type": "invalid_request",
    "message": "model is required",
    "code": "",
    "param": ""
  }
}

Open Responses — `GET /v1/responses/:id`

Status	Cause	Error Type	Example Message
400	Missing response ID	`invalid_request_error`	`response ID is required`
404	Response not found	`not_found`	`response not found: <id>`

Open Responses Events — `GET /v1/responses/:id/events`

Status	Cause	Error Type	Example Message
400	Missing response ID	`invalid_request_error`	`response ID is required`
400	Response was not created with stream	`invalid_request_error`	`cannot stream a response that was not created with stream=true`
400	Invalid `starting_after` value	`invalid_request_error`	`starting_after must be an integer`
404	Response not found	`not_found`	`response not found: <id>`
500	Failed to retrieve events	`server_error`	`failed to get events: <details>`

Object Detection — `POST /v1/detection`

Status	Cause	Example Message
400	Invalid or malformed request body	`Bad Request`
400	Model not found in configuration	`Bad Request`
500	Backend inference failure	`Internal Server Error`

Video Generation — `POST /v1/video/generations`

Status	Cause	Example Message
400	Invalid or malformed request body	`Bad Request`
400	Model not found in configuration	`Bad Request`
500	Backend inference failure	`Internal Server Error`

Voice Activity Detection — `POST /v1/audio/vad`

Status	Cause	Example Message
400	Invalid or malformed request body	`Bad Request`
400	Model not found in configuration	`Bad Request`
500	Backend inference failure	`Internal Server Error`

Tokenize — `POST /v1/tokenize`

Status	Cause	Example Message
400	Invalid or malformed request body	`Bad Request`
400	Model not found in configuration	`Bad Request`

Models — `GET /v1/models`, `GET /models`

Status	Cause	Example Message
500	Failed to list models	`Internal Server Error`

Handling Errors in Client Code

Python (OpenAI SDK)

from openai import OpenAI, APIError

client = OpenAI(base_url="http://localhost:8080/v1", api_key="your-key")

try:
    response = client.chat.completions.create(
        model="my-model",
        messages=[{"role": "user", "content": "hello"}],
    )
except APIError as e:
    print(f"Status: {e.status_code}, Message: {e.message}")

curl

# Check HTTP status code
response=$(curl -s -w "\n%{http_code}" http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "nonexistent", "messages": [{"role": "user", "content": "hi"}]}')

http_code=$(echo "$response" | tail -1)
body=$(echo "$response" | head -1)

if [ "$http_code" -ne 200 ]; then
  echo "Error $http_code: $body"
fi

Environment Variable	Description
`LOCALAI_API_KEY`	Comma-separated list of valid API keys
`LOCALAI_OPAQUE_ERRORS`	Set to `true` to hide error details (returns empty body with status code only)
`LOCALAI_SUBTLEKEY_COMPARISON`	Use constant-time key comparison for timing-attack resistance

LocalAI binaries

LocalAI binaries are available for both Linux and MacOS platforms and can be executed directly from your command line. These binaries are continuously updated and hosted on our GitHub Releases page. This method also supports Windows users via the Windows Subsystem for Linux (WSL).

macOS Download

You can download the DMG and install the application:

Note: the DMGs are not signed by Apple as quarantined. See https://github.com/mudler/LocalAI/issues/6268 for a workaround, fix is tracked here: https://github.com/mudler/LocalAI/issues/6244

Otherwise, use the following one-liner command in your terminal to download and run LocalAI on Linux or MacOS:

curl -Lo local-ai "https://github.com/mudler/LocalAI/releases/download/v3.12.1/local-ai-$(uname -s)-$(uname -m)" && chmod +x local-ai && ./local-ai

Otherwise, here are the links to the binaries:

OS	Link
Linux (amd64)	Download
Linux (arm64)	Download
MacOS (arm64)	Download

Binaries do have limited support compared to container images:

Python-based backends are not shipped with binaries (e.g. diffusers or transformers)
MacOS binaries and Linux-arm64 do not ship TTS nor stablediffusion-cpp backends
Linux binaries do not ship stablediffusion-cpp backend

Running on Nvidia ARM64

LocalAI can be run on Nvidia ARM64 devices, such as the Jetson Nano, Jetson Xavier NX, Jetson AGX Orin, and Nvidia DGX Spark. The following instructions will guide you through building and using the LocalAI container for Nvidia ARM64 devices.

Platform Compatibility

CUDA 12 L4T images: Compatible with Nvidia AGX Orin and similar platforms (Jetson Nano, Jetson Xavier NX, Jetson AGX Xavier)
CUDA 13 L4T images: Compatible with Nvidia DGX Spark

Prerequisites

Docker engine installed (https://docs.docker.com/engine/install/ubuntu/)
Nvidia container toolkit installed (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-with-ap)

Pre-built Images

Pre-built images are available on quay.io and dockerhub:

CUDA 12 (for AGX Orin and similar platforms)

docker pull quay.io/go-skynet/local-ai:latest-nvidia-l4t-arm64
# or
docker pull localai/localai:latest-nvidia-l4t-arm64

CUDA 13 (for DGX Spark)

docker pull quay.io/go-skynet/local-ai:latest-nvidia-l4t-arm64-cuda-13
# or
docker pull localai/localai:latest-nvidia-l4t-arm64-cuda-13

Build the container

If you need to build the container yourself, use the following commands:

CUDA 12 (for AGX Orin and similar platforms)

git clone https://github.com/mudler/LocalAI

cd LocalAI

docker build --build-arg SKIP_DRIVERS=true --build-arg BUILD_TYPE=cublas --build-arg BASE_IMAGE=nvcr.io/nvidia/l4t-jetpack:r36.4.0 --build-arg IMAGE_TYPE=core -t quay.io/go-skynet/local-ai:master-nvidia-l4t-arm64-core .

CUDA 13 (for DGX Spark)

git clone https://github.com/mudler/LocalAI

cd LocalAI

docker build --build-arg SKIP_DRIVERS=false --build-arg BUILD_TYPE=cublas --build-arg CUDA_MAJOR_VERSION=13 --build-arg CUDA_MINOR_VERSION=0 --build-arg BASE_IMAGE=ubuntu:24.04 --build-arg IMAGE_TYPE=core -t quay.io/go-skynet/local-ai:master-nvidia-l4t-arm64-cuda-13-core .

Usage

Run the LocalAI container on Nvidia ARM64 devices using the following commands, where /data/models is the directory containing the models:

CUDA 12 (for AGX Orin and similar platforms)

docker run -e DEBUG=true -p 8080:8080 -v /data/models:/models -ti --restart=always --name local-ai --runtime nvidia --gpus all quay.io/go-skynet/local-ai:latest-nvidia-l4t-arm64

CUDA 13 (for DGX Spark)

docker run -e DEBUG=true -p 8080:8080 -v /data/models:/models -ti --restart=always --name local-ai --runtime nvidia --gpus all quay.io/go-skynet/local-ai:latest-nvidia-l4t-arm64-cuda-13

Note: /data/models is the directory containing the models. You can replace it with the directory containing your models.

References

Subsections of References

Shell Completion

Generating Completion Scripts

Installation

Bash

Zsh

Fish

Usage

System Info and Version

System Information

Response

Usage

Example response

Version

Response

Usage

Example response

Error Responses

Model compatibility table

Text Generation & Language Models

Audio & Speech Processing

Image & Video Generation

Specialized AI Tasks

Acceleration Support Summary

GPU Acceleration

Specialized Hardware

CPU Optimization

Architecture

Backstory

CLI Reference

Global Flags

Storage Flags

Backend Flags

Models Flags

Performance Flags

API Flags

Hardening Flags

P2P Flags

Other Commands

Examples

Basic Usage

Environment Variables

Advanced Configuration

Related Documentation

API Error Reference

Error Response Formats

OpenAI-Compatible Format

Anthropic Format

Open Responses Format

HTTP Status Codes

Global Error Handling

Authentication Errors (401)

Request Parsing Errors (400)

Not Found (404)

Opaque Errors Mode

Per-Endpoint Error Scenarios

Chat Completions — POST /v1/chat/completions

Completions — POST /v1/completions

Embeddings — POST /v1/embeddings

Image Generation — POST /v1/images/generations

Image Editing (Inpainting) — POST /v1/images/edits

Audio Transcription — POST /v1/audio/transcriptions

Text to Speech — POST /v1/audio/speech, POST /tts

ElevenLabs TTS — POST /v1/text-to-speech/:voice-id

ElevenLabs Sound Generation — POST /v1/sound-generation

Reranking — POST /v1/rerank, POST /jina/v1/rerank

Anthropic Messages — POST /v1/messages

Open Responses — POST /v1/responses

Open Responses — GET /v1/responses/:id

Open Responses Events — GET /v1/responses/:id/events

Object Detection — POST /v1/detection

Video Generation — POST /v1/video/generations

Voice Activity Detection — POST /v1/audio/vad

Tokenize — POST /v1/tokenize

Models — GET /v1/models, GET /models

Handling Errors in Client Code

Python (OpenAI SDK)

curl

Related Configuration

Chat Completions — `POST /v1/chat/completions`

Completions — `POST /v1/completions`

Embeddings — `POST /v1/embeddings`

Image Generation — `POST /v1/images/generations`

Image Editing (Inpainting) — `POST /v1/images/edits`

Audio Transcription — `POST /v1/audio/transcriptions`

Text to Speech — `POST /v1/audio/speech`, `POST /tts`

ElevenLabs TTS — `POST /v1/text-to-speech/:voice-id`

ElevenLabs Sound Generation — `POST /v1/sound-generation`

Reranking — `POST /v1/rerank`, `POST /jina/v1/rerank`

Anthropic Messages — `POST /v1/messages`

Open Responses — `POST /v1/responses`

Open Responses — `GET /v1/responses/:id`

Open Responses Events — `GET /v1/responses/:id/events`

Object Detection — `POST /v1/detection`

Video Generation — `POST /v1/video/generations`

Voice Activity Detection — `POST /v1/audio/vad`

Tokenize — `POST /v1/tokenize`

Models — `GET /v1/models`, `GET /models`