text-generation-inference/server
Antti Kervinen 8863f3728c Fix CPU and memory affinity under external resource management
- Fixes CPU affinity when running inference on CPU, and when CPUs
  are externally managed using taskset, numactl, cgroups, Kubernetes
  CPU manager, NRI resource policy plugins, for instance.

- Detect external CPU management and trust the external CPU manager
  completely. It is more likely that external manager has the big picture
  of all other tasks running on the system, their QoS, hardware
  characteristics, etc.

- For instance, do not modify even memory affinity, because the external
  manager may know better which NUMA node has fastest memory, or which
  NUMA nodes have enough free memory for this inference.

Fixes: #3011

Signed-off-by: Antti Kervinen <antti.kervinen@intel.com>
2025-02-11 12:15:58 +02:00
..
custom_kernels All integration tests back everywhere (too many failed CI). (#2428) 2024-08-16 21:19:46 +02:00
exllama_kernels Update ROCM libs and improvements (#2579) 2024-09-30 10:54:32 +02:00
exllamav2_kernels Update ROCM libs and improvements (#2579) 2024-09-30 10:54:32 +02:00
tests feat: improve star coder to support multi lora layers (#2883) 2025-01-16 16:23:55 -05:00
text_generation_server Fix CPU and memory affinity under external resource management 2025-02-11 12:15:58 +02:00
.gitignore Impl simple mamba model (#1480) 2024-02-08 10:19:45 +01:00
bounds-from-nix.py Sync (most) server dependencies with Nix (#2782) 2024-12-03 04:04:06 +01:00
hf-kernels.lock Use kernels from the kernel hub (#2988) 2025-02-10 19:19:25 +01:00
Makefile Using the "lockfile". (#2992) 2025-02-06 12:28:24 +01:00
Makefile-awq chore: add pre-commit (#1569) 2024-02-16 11:58:58 +01:00
Makefile-eetq Sync (most) server dependencies with Nix (#2782) 2024-12-03 04:04:06 +01:00
Makefile-exllamav2 Upgrading exl2. (#2415) 2024-08-14 11:58:08 +02:00
Makefile-flash-att Hotfixing make install. (#2008) 2024-06-04 23:34:03 +02:00
Makefile-flash-att-v2 Add Flash decoding kernel ROCm (#2855) 2025-01-13 11:12:35 +01:00
Makefile-flashinfer Trying to put back the archlist (to fix the oom). (#2947) 2025-01-24 09:32:17 +01:00
Makefile-lorax-punica Enable multiple LoRa adapters (#2010) 2024-06-25 14:46:27 -04:00
Makefile-selective-scan chore: add pre-commit (#1569) 2024-02-16 11:58:58 +01:00
Makefile-vllm Update vllm kernels for ROCM (#2826) 2024-12-18 12:44:42 +01:00
pyproject.toml Use kernels from the kernel hub (#2988) 2025-02-10 19:19:25 +01:00
README.md chore: add pre-commit (#1569) 2024-02-16 11:58:58 +01:00
req.txt Using the "lockfile". (#2992) 2025-02-06 12:28:24 +01:00
requirements_cuda.txt Using the "lockfile". (#2992) 2025-02-06 12:28:24 +01:00
requirements_gen.txt Using the "lockfile". (#2992) 2025-02-06 12:28:24 +01:00
requirements_intel.txt Using the "lockfile". (#2992) 2025-02-06 12:28:24 +01:00
requirements_rocm.txt Using the "lockfile". (#2992) 2025-02-06 12:28:24 +01:00
uv.lock Use kernels from the kernel hub (#2988) 2025-02-10 19:19:25 +01:00

Text Generation Inference Python gRPC Server

A Python gRPC server for Text Generation Inference

Install

make install

Run

make run-dev