chore(ai): run llms using containers

This commit is contained in:
Michele Cereda
2026-02-10 20:44:36 +01:00
parent 6659b44667
commit 96e32cbdba
4 changed files with 197 additions and 1 deletions

View File

@@ -16,6 +16,11 @@ The easiest way to get up and running with large language models.
```sh
brew install --cask 'ollama-app' # or just brew install 'ollama'
docker pull 'ollama/ollama'
# Run in containers.
docker run -d -v 'ollama:/root/.ollama' -p '11434:11434' --name 'ollama' 'ollama/ollama'
docker run -d --gpus='all''ollama/ollama'
```
</details>
@@ -34,9 +39,15 @@ ollama pull 'glm-4.7:cloud'
# List pulled models.
ollama list
ollama ls
# Start Ollama.
ollama serve
OLLAMA_CONTEXT_LENGTH=64000 ollama serve
# Run models.
ollama run 'gemma3'
docker exec -it 'ollama' ollama run 'llama3.2'
# Quickly set up a coding tool with Ollama models.
ollama launch
@@ -44,11 +55,38 @@ ollama launch
# Launch models.
ollama launch 'claude' --model 'glm-4.7-flash'
# Only configure model, without launching them.
# Only configure models.
# Do *not* launch them.
ollama launch 'claude' --config
# Check usage.
ollama ps
# Stop running models.
ollama stop 'gemma3'
# Delete models.
ollama rm 'gemma3'
# Create custom models.
# Requires a Modelfile.
ollama create -f 'Modelfile'
# Quantize models.
# Requires a Modelfile.
ollama create --quantize 'q4_K_M' 'llama3.2'
# Push models to Ollama.
ollama push 'myuser/mymodel'
# Clone models.
ollama cp 'mymodel' 'myuser/mymodel'
# Sign into Ollama cloud, or create a new account.
ollama signin
# Sign out from Ollama cloud.
ollama signout
```
</details>

View File

@@ -2,6 +2,11 @@
TODO
```sh
# Allow containers to use devices on systems with SELinux
sudo setsebool container_use_devices=1
```
1. [Further readings](#further-readings)
## Further readings

View File

@@ -14,6 +14,7 @@
1. [Create builders](#create-builders)
1. [Build for specific platforms](#build-for-specific-platforms)
1. [Compose](#compose)
1. [Running LLMs locally](#running-llms-locally)
1. [Best practices](#best-practices)
1. [Troubleshooting](#troubleshooting)
1. [Use environment variables in the ENTRYPOINT](#use-environment-variables-in-the-entrypoint)
@@ -42,6 +43,9 @@ sudo zypper install 'docker'
vim '/etc/docker/daemon.json'
jq -i '."log-level"="info"' '/etc/docker/daemon.json'
jq -i '.dns=["8.8.8.8", "1.1.1.1"]' "${HOME}/.docker/daemon.json"
# Allow containers to use devices on systems with SELinux.
sudo setsebool container_use_devices=1
```
</details>
@@ -499,6 +503,144 @@ mkdir -p '/usr/local/lib/docker/cli-plugins' \
</details>
## Running LLMs locally
Refer [Run LLMs Locally with Docker: A Quickstart Guide to Model Runner] and [Docker Model Runner].
Docker introduced Model Runner in version 4.40.<br/>
It makes it easy to pull, run, and experiment with LLMs on local machines.
```sh
# Enable in Docker Desktop.
docker desktop enable model-runner
docker desktop enable model-runner --tcp='12434' # enable TCP interaction from host processes
# Install the plugin.
apt install 'docker-model-plugin'
dnf install 'docker-model-plugin'
pacman -S 'docker-model-plugin'
# Verify the installation.
docker model --help
# Stop the current runner.
docker model stop-runner
# Reinstall runners with CUDA GPU support.
docker model reinstall-runner --gpu 'cuda'
# Check the Model Runner container can access the GPU.
docker exec docker-model-runner nvidia-smi
```
Models are available in Docker Hub under the [ai/](https://hub.docker.com/u/ai) prefix.<br/>
Tags for models distributed by Docker follow the `{model}:{parameters}-{quantization}` scheme.<br/>
Alternatively, they can be downloaded from Hugging Face.
```sh
# Search for model variants.
docker search ai/llama2
# Pull models.
docker model pull 'ai/qwen2.5'
docker model pull 'ai/qwen3-coder:30B'
docker model pull 'ai/smollm2:360M-Q4_K_M' 'ai/llama2:7b-q4'
docker model pull 'some.registry.com/models/mistral:latest'
# Run models.
docker model run 'ai/smollm2:360M-Q4_K_M' 'Give me a fact about whales'
docker model run -d 'ai/qwen3-coder:30B'
docker model run -e 'MODEL_API_KEY=my-secret-key' --gpus 'all' …
docker model run --gpus '0' --gpu-memory '8g' -e 'MODEL_GPU_LAYERS=40' …
docker model run --gpus '0,1,2' --memory '16g' --memory-swap '16g' …
docker model run --no-gpu --cpus '4' …
docker model run -p '3000:8080' …
docker model run -p '127.0.0.1:8080:8080' …
docker model run -p '8080:8080' -p '9090:9090' …
# Distribute models across GPUs.
docker model run --gpus 'all' --tensor-parallel '2' 'ai/llama2-70b'
# Show resource usage.
docker model stats
docker model stats llm
docker model stats --format json
# View models' logs.
docker model logs
docker model logs llm | grep -i gpu
docker model logs -f llm
docker model logs --tail 100 -t llm
```
Model Runner exposes an OpenAI endpoint under <http://model-runner.docker.internal/engines/v1> for containers, and
(if TCP host access was enabled during initialization on port 12434) under <http://localhost:12434/engines/v1>
for host processes.<br/>
Use this endpoint to hook up OpenAI-compatible clients or frameworks.
Executing `docker model run` will **not** spin up containers.<br/>
Instead, it calls an Inference Server API endpoint hosted by Model Runner through Docker Desktop.
The Inference Server runs an inference engine as a native host process, and provides interaction through an
OpenAI/Ollama-compatible API.<br/>
When requests come in, Model Runner loads the requested model on demand, then performs the inference on the requests.
The active model will stay in memory until another model is requested, or until a pre-defined inactivity timeout
(usually 5 minutes) is reached.
Model Runner will transparently load the requested model on-demand, assuming it has been pulled beforehand and is
locally available. There is no need to execute `docker model run` before interacting with a specific model from host
processes or from within containers.
Docker Model Runner supports the [llama.cpp], [vLLM], and [Diffusers] inference engines.<br/>
[llama.cpp] is the default one.
```sh
# List downloaded models.
docker model ls
docker model ls --json
docker model ls --openai
docker model ls -q
# List running models.
docker model ps
# Show models' configuration.
docker model inspect 'ai/qwen2.5-coder'
# View models' layers.
docker model history 'ai/llama2'
# Configure models.
docker model configure --context-size '8192' 'ai/qwen2.5-coder'
# Reset model configuration.
docker model configure --context-size '-1' 'ai/qwen2.5-coder'
# Remove models.
docker model rm 'ai/llama2'
docker model rm -f 'ai/llama2'
docker model rm $(docker model ls -q)
# Only remove unused models.
docker model prune
# Print system information.
docker model system info
# Print disk usage.
docker model system df
# Clean up unused resources.
docker model system prune
# Full cleanup (remove all models)
docker model system prune -a
```
Model Runner collects user data.<br/>
Data collection is controlled by the Send usage statistics setting.
## Best practices
- Use multi-stage `Dockerfile`s when possible to reduce the final image's size.
@@ -568,6 +710,8 @@ Alternatively, keep the exec form but force invoking a shell in it:
- [Unable to reach services behind VPN from docker container]
- [Improve docker volume performance on MacOS with a RAM disk]
- [How to Connect to Localhost Within a Docker Container]
- [Run LLMs Locally with Docker: A Quickstart Guide to Model Runner]
- [Docker Model Runner Cheatsheet: Commands & Examples]
<!--
Reference
@@ -580,15 +724,18 @@ Alternatively, keep the exec form but force invoking a shell in it:
[kaniko]: kaniko.md
[podman]: podman.md
[testcontainers]: testcontainers.md
[vLLM]: ai/vllm.md
<!-- Upstream -->
[building multi-arch images for arm and x86 with docker desktop]: https://www.docker.com/blog/multi-arch-images/
[docker compose]: https://github.com/docker/compose
[docker docs I cannot ping my containers]: https://docs.docker.com/desktop/features/networking/#i-cannot-ping-my-containers
[Docker Model Runner]: https://docs.docker.com/ai/model-runner/
[dockerfile reference]: https://docs.docker.com/reference/dockerfile/
[Exec form ENTRYPOINT example]: https://docs.docker.com/reference/dockerfile/#exec-form-entrypoint-example
[github]: https://github.com/docker
[Multi-stage builds]: https://docs.docker.com/build/building/multi-stage/
[Run LLMs Locally with Docker: A Quickstart Guide to Model Runner]: https://www.docker.com/blog/run-llms-locally/
<!-- Others -->
[amazon-ecr-credential-helper]: https://github.com/awslabs/amazon-ecr-credential-helper
@@ -599,12 +746,15 @@ Alternatively, keep the exec form but force invoking a shell in it:
[configuring dns]: https://dockerlabs.collabnix.com/intermediate/networking/Configuring_DNS.html
[configuring healthcheck in docker-compose]: https://medium.com/@saklani1408/configuring-healthcheck-in-docker-compose-3fa6439ee280
[difference between expose and ports in docker compose]: https://www.baeldung.com/ops/docker-compose-expose-vs-ports
[Diffusers]: https://github.com/huggingface/diffusers
[docker arg, env and .env - a complete guide]: https://vsupalov.com/docker-arg-env-variable-guide/
[docker buildx bake + gitlab ci matrix]: https://teymorian.medium.com/docker-buildx-bake-gitlab-ci-matrix-77edb6b9863f
[Docker Model Runner Cheatsheet: Commands & Examples]: https://www.glukhov.org/post/2025/10/docker-model-runner-cheatsheet/
[getting around docker's host network limitation on mac]: https://medium.com/@lailadahi/getting-around-dockers-host-network-limitation-on-mac-9e4e6bfee44b
[How to Connect to Localhost Within a Docker Container]: https://www.howtogeek.com/devops/how-to-connect-to-localhost-within-a-docker-container/
[how to list the content of a named volume in docker 1.9+?]: https://stackoverflow.com/questions/34803466/how-to-list-the-content-of-a-named-volume-in-docker-1-9
[How to Use a .dockerignore File: A Comprehensive Guide with Examples]: https://hn.mrugesh.dev/how-to-use-a-dockerignore-file-a-comprehensive-guide-with-examples
[improve docker volume performance on macos with a ram disk]: https://thoughts.theden.sh/posts/docker-ramdisk-macos-benchmark/
[llama.cpp]: https://github.com/ggml-org/llama.cpp
[opencontainers image spec]: https://specs.opencontainers.org/image-spec/
[unable to reach services behind vpn from docker container]: https://github.com/docker/for-mac/issues/5322

View File

@@ -34,6 +34,9 @@ cat <<EOF | tee '/etc/containers/registries.conf.d/shortnames.conf'
[aliases]
"orclinx" = "container-registry.oracle.com/os/oraclelinux"
EOF
# Allow containers to use devices on systems with SELinux.
sudo setsebool container_use_devices=1
```
</details>