text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-04-25 03:12:13 +00:00

History

Antti Kervinen 8863f3728c Fix CPU and memory affinity under external resource management - Fixes CPU affinity when running inference on CPU, and when CPUs are externally managed using taskset, numactl, cgroups, Kubernetes CPU manager, NRI resource policy plugins, for instance. - Detect external CPU management and trust the external CPU manager completely. It is more likely that external manager has the big picture of all other tasks running on the system, their QoS, hardware characteristics, etc. - For instance, do not modify even memory affinity, because the external manager may know better which NUMA node has fastest memory, or which NUMA nodes have enough free memory for this inference. Fixes: #3011 Signed-off-by: Antti Kervinen <antti.kervinen@intel.com>		2025-02-11 12:15:58 +02:00
..
custom_modeling	Use kernels from the kernel hub (#2988 )	2025-02-10 19:19:25 +01:00
__init__.py	Improve qwen vl impl (#2943 )	2025-02-04 12:44:18 -05:00
bloom.py	Refactor dead code - Removing all `flash_xxx.py` files. (#2166 )	2024-07-05 10:29:56 +02:00
causal_lm.py	Sync (most) server dependencies with Nix (#2782 )	2024-12-03 04:04:06 +01:00
flash_causal_lm.py	Fix CPU and memory affinity under external resource management	2025-02-11 12:15:58 +02:00
galactica.py	feat: add ruff and resolve issue (#2262 )	2024-07-26 10:29:09 -04:00
globals.py	Fixing the oom maybe with 2.5.1 change. (#2958 )	2025-01-28 10:30:28 +01:00
idefics_causal_lm.py	feat: prefill chunking (#2600 )	2024-10-16 12:49:33 +02:00
mamba.py	Choosing input/total tokens automatically based on available VRAM? (#2673 )	2024-10-28 04:59:49 +01:00
metadata_kernels.py	feat: add payload limit (#2726 )	2024-11-21 18:20:15 +00:00
mllama_causal_lm.py	feat: add triton kernels to decrease latency of large batches (#2687 )	2024-10-25 21:10:00 +00:00
model.py	Flash decoding kernel adding and prefill-chunking and prefix caching enabling in intel cpu/xpu (#2815 )	2025-01-17 12:04:57 +01:00
pali_gemma.py	feat: add ruff and resolve issue (#2262 )	2024-07-26 10:29:09 -04:00
seq2seq_lm.py	feat: prefill chunking (#2600 )	2024-10-16 12:49:33 +02:00
transformers_flash_causal_lm.py	Transformers backend TP fix (#2945 )	2025-01-23 18:09:57 +01:00
types.py	feat: prefill chunking (#2600 )	2024-10-16 12:49:33 +02:00
vlm_causal_lm.py	Revert "feat: improve qwen2-vl startup " (#2924 )	2025-01-17 12:09:05 -05:00