Add --build-arg llamacpp_native & llamacpp_cpu_arm_arch

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-09-19 00:04:51 +00:00 · 2025-03-05 15:49:35 +00:00 · 2025-03-05 15:49:35 +00:00 · 3f7369d1c1
commit 3f7369d1c1
parent 8a79cfd077
2 changed files with 29 additions and 8 deletions
--- a/4
+++ b/4
@ -2,6 +2,8 @@ FROM nvidia/cuda:12.8.0-cudnn-devel-ubuntu24.04 AS deps

 ARG llamacpp_version=b4827
 ARG llamacpp_cuda=OFF
+ARG llamacpp_native=ON
+ARG llamacpp_cpu_arm_arch=native
 ARG cuda_arch=75-real;80-real;86-real;89-real;90-real

 WORKDIR /opt/src
@ -28,6 +30,8 @@ RUN mkdir -p llama.cpp \
    -DCMAKE_CXX_COMPILER=clang++ \
    -DCMAKE_CUDA_ARCHITECTURES=${cuda_arch} \
    -DGGML_CUDA=${llamacpp_cuda} \
+    -DGGML_NATIVE=${llamacpp_native} \
+    -DGGML_CPU_ARM_ARCH=${llamacpp_cpu_arm_arch} \
    -DLLAMA_BUILD_COMMON=OFF \
    -DLLAMA_BUILD_TESTS=OFF \
    -DLLAMA_BUILD_EXAMPLES=OFF \
--- a/docs/source/backends/llamacpp.md
+++ b/docs/source/backends/llamacpp.md
@ -25,9 +25,12 @@ You will find the best models on [Hugging Face][GGUF].
 ## Build Docker image

 For optimal performance, the Docker image is compiled with native CPU
-instructions, thus it's highly recommended to execute the container on
-the host used during the build process. Efforts are ongoing to enhance
-portability while maintaining high computational efficiency.
+instructions by default. As a result, it is strongly recommended to run
+the container on the same host architecture used during the build
+process. Efforts are ongoing to improve portability across different
+systems while preserving high computational efficiency.
+
+To build the Docker image, use the following command:

 ```bash
 docker build \
@ -38,11 +41,25 @@ docker build \

 ### Build parameters

-| Parameter                            | Description                       |
-| ------------------------------------ | --------------------------------- |
-| `--build-arg llamacpp_version=bXXXX` | Specific version of llama.cpp     |
-| `--build-arg llamacpp_cuda=ON`       | Enables CUDA acceleration         |
-| `--build-arg cuda_arch=ARCH`         | Defines target CUDA architecture  |
+| Parameter (with --build-arg)              | Description                      |
+| ----------------------------------------- | -------------------------------- |
+| `llamacpp_version=bXXXX`                  | Specific version of llama.cpp    |
+| `llamacpp_cuda=ON`                        | Enables CUDA acceleration        |
+| `llamacpp_native=OFF`                     | Disable automatic CPU detection  |
+| `llamacpp_cpu_arm_arch=ARCH[+FEATURE]...` | Specific ARM CPU and features    |
+| `cuda_arch=ARCH`                          | Defines target CUDA architecture |
+
+For example, to target Graviton4 when building on another ARM
+architecture:
+
+```bash
+docker build \
+    -t tgi-llamacpp \
+    --build-arg llamacpp_native=OFF \
+    --build-arg llamacpp_cpu_arm_arch=armv9-a+i8mm \
+    https://github.com/huggingface/text-generation-inference.git \
+    -f Dockerfile_llamacpp
+```

 ## Run Docker image