text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-07-11 02:10:16 +00:00

Author	SHA1	Message	Date
Wang, Yi	204142153f	fix crash in multi-modal (#2245 ) * fix crash in multi-modal Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * update according to review comment Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix llava_next regression in latest main Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 05:39:58 +00:00
OlivierDehaene	a994f6aedd	hotfix: update nccl	2024-09-25 05:39:58 +00:00
OlivierDehaene	34c472bd64	chore: update to torch 2.4 (#2259 ) * chore: update to torch 2.4 * remove un-necessary patch * fix	2024-09-25 05:39:14 +00:00
Daniël de Kok	b1077b077c	hotfix: pin numpy (#2289 )	2024-09-25 05:38:48 +00:00
Daniël de Kok	43f49141fd	Add support for Llama 3 rotary embeddings (#2286 ) * Add support for Llama 3 rotary embeddings * Update transformers to 4.43	2024-09-25 05:38:48 +00:00
Nicolas Patry	5390973c09	Preparing for release. (#2285 ) * Preparing for release. * Updating docs. * Fixing token within the docker image for the launcher.	2024-09-25 05:38:48 +00:00
shaltielshmid	69b67b7add	Add support for Mistral-Nemo by supporting head_dim through config (#2254 ) * Support passing head_dim through config * Using `head_dim` as a fallback is necessary since it's a non standard key in mistralConfig (as defined in transformers). * Shorter diff. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-09-25 05:31:31 +00:00
Daniël de Kok	26460f053d	Add support for repacking AWQ weights for GPTQ-Marlin (#2278 ) * Add support for repacking AWQ weights for GPTQ-Marlin So far we couldn't support AWQ because virtually all AWQ models use symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin has recently added support AWQ repacking and AWQ asymmetric quantization (zero_point=True). This change updates all GPTQ-Marlin kernels from upstream and wires up AWQ support. For now enabling AWQ using Marlin requires running TGI with `--quantize gptq`. * Enable Marlin for supported AWQ configurations by default This makes the AWQ -> GPTQ repack test redundant, since we are now testing this with the regular AWQ test.	2024-09-25 05:31:31 +00:00
OlivierDehaene	919da25c3b	fix(l4): fix fp8 logic on l4 (#2277 ) * fix(l4): fix fp8 logic on l4 * also quant weights with single scale * use marlin even on 89	2024-09-25 05:31:30 +00:00
Nicolas Patry	31eb03dbe2	Fixing mistral nemo. (#2276 )	2024-09-25 05:31:30 +00:00
Nicolas Patry	568cc9f3d0	Softcapping for gemma2. (#2273 ) * Softcapping for gemma2. * Less clutter. * No access to transformers config, only config_dict here. * 0.0 is the null value in the C++ API.	2024-09-25 05:31:08 +00:00
OlivierDehaene	a7515b8af1	fix(server): fix fp8 weight loading (#2268 ) * fix(server): fix fp8 weight loading * fixed scales loading * update snap * revert default dtype	2024-09-25 05:31:08 +00:00
Erik Kaunismäki	758a8b8423	legacy warning on text_generation client (#2271 ) Update README.md point to huggingface_hub inference clients instead	2024-09-25 05:30:41 +00:00
icyboy™	a5aee82a69	Hotfix: fix of use of unquantized weights in Mixtral GQA loading (#2269 ) * Update idefics_causal_lm.py Fix syntax issues * fix dbrx & opt model prefix bug * Hotfix: fix of use of unquantized weights in Mixtral GQA loading	2024-09-25 05:30:41 +00:00
OlivierDehaene	d13215da8f	fix(server): fix deepseekv2 loading (#2266 )	2024-09-25 05:30:41 +00:00
OlivierDehaene	85f10ec5c9	feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248 ) * feat(fp8): add support for fbgemm * allow loading fp8 weights directly * update outlines * fix makefile * build fbgemm * avoid circular import and fix dockerfile * add default dtype * refactored weights loader * fix auto conversion * fix quantization config parsing * force new nccl on install * missing get_weights implementation * increase timeout	2024-09-25 05:30:41 +00:00
Daniël de Kok	50149c3800	Add FP8 release test (#2261 )	2024-09-25 05:29:35 +00:00
Daniël de Kok	c1638a56f1	Add support for Deepseek V2 (#2224 ) Deepseek V2 is a MoE model from Deepseek. Relevant variations compared to other models: - Grouped top-K in expert selection. - mscale in yarn is calculated using the `mscale` and `mscale_all_dim` configuration options. - `mscale_all_dim` is also used in scaling attention softmax. - Permuting of the query/key representations before applying rotary embeddings. - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`). So, we need weight loads that supports quantized weights. To this end `{Weights,WeightLoader}.get_weight` was added. - The query/key head dimensionality differs from that of the value, so we need to pad during attention. - Heads with size 192, needs an extension to our paged attention fork and we need to ensure that the KV cache is allocated with the correct size. - Shared experts.	2024-09-25 05:27:40 +00:00
drbh	898a892082	fix: adjust default tool choice (#2244 ) * fix: adjust default tool choice * feat: improve tool choice syntax and response parsing/errors * fix: remove dev tests * feat: add ToolChoice to docs	2024-09-25 05:27:40 +00:00
Erik Kaunismäki	8afc17396d	add usage stats to toctree (#2260 ) quick fix	2024-09-25 05:27:40 +00:00
Erik Kaunismäki	66f3de583e	usage stats and crash reports (#2220 ) * draft of usage stats * fix wrong link * launcher doesn't need sysinfo dep * only tokenizer class instead of hole struct * unused import * fix clippy errors * update openAPI doc * cargo fmt * fix error in passing flags to router * try again to update docs * run pre-commit locally * Update router/src/main.rs Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> * Update router/src/main.rs Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> * on crash use anonymous error event * delete json_output and ngrok * more robust way of checking if is in container * more robust nvidia smi * parse xpu more robustly * fix errors * add nvidia-smi details in docs * cargo fmt * fix clippy * should make docs check pass * Update router/src/usage_stats.rs Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> * error reason can't be in nested json * cargo fmt --------- Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> Co-authored-by: Erik Kaunismäki <erikkaum@Eriks-MacBook-Pro.local>	2024-09-25 05:27:40 +00:00
Daniël de Kok	e658d95c23	Hotfix: pass through model revision in `VlmCausalLM` (#2258 )	2024-09-25 05:27:40 +00:00
Daniël de Kok	990ea793c0	Hotfix: fix MPT after recent refactor (#2257 )	2024-09-25 05:27:40 +00:00
Daniël de Kok	ba0dfb6fb1	Hotfix: various GPT-based model fixes (#2256 )	2024-09-25 05:27:40 +00:00
Daniël de Kok	394f8c7d2b	Hotfix: fix of use of unquantized weights in Gemma GQA loading (#2255 )	2024-09-25 05:27:40 +00:00
Daniël de Kok	2dd680b799	Improve the handling of quantized weights (#2250 ) * Improve the handling of quantized weights Handling of quantized weights was split between two mechanisms: - For quantized checkpoints, we used the new weight loader infrastructure. - For quantization while loading (EETQ, FP8, bitsandbytes) we instead relied on conditional in `get_linear`. Weight loaders support context managers to selectively load particular layers with different weight loaders, which is useful for models like Idefics2 AWQ, which uses a quantized text model, but unquantized vision and connector models. However, the context manager would be overrided by `get_linear`, which string-checks `quantizer`. Also, the context manager would not work with EETQ, FP8, and bitsandbytes. This change migrates all quantizers to the weight loader infrastructure. This has several benefits: - We can use context managers with all quantizers. - All the implementation details move down to the quantizer layers, `get_linear` does not need to know how to handle quantizer linear layers. - All quantizer weights are strongly typed, we don't pass around raw tensors. - We don't have to pass around the `quantizer` string everywhere. * Exclude non-MLP layers when using FP8 quantization with Llama	2024-09-25 05:27:40 +00:00
OlivierDehaene	118ee57f82	fix(server): fix cohere (#2249 )	2024-09-25 05:27:40 +00:00
Daniël de Kok	e0710ccbeb	Remove stray `quantize` argument in `get_weights_col_packed_qkv` (#2237 ) Fixes #2236.	2024-09-25 05:27:40 +00:00
Daniël de Kok	7177da0df6	`server quantize`: expose groupsize option (#2225 )	2024-09-25 05:27:40 +00:00
Daniël de Kok	e955f7b536	Add support for AWQ-quantized Idefics2 (#2233 ) Fixes #2036.	2024-09-25 05:27:40 +00:00
Hugo Larcher	8a223eb6ac	fix: Remove bitsandbytes installation when running cpu-only install (#2216 ) Remove bitsandbytes installation when running cpu-only install	2024-09-25 05:27:40 +00:00
Erik Kaunismäki	271ebb7e20	fix custom cache dir (#2226 ) * fix to not ignore HUGGINGFACE_HUB_CACHE in cache * delete printlns * delete newlines * maybe fix trailing whitespace	2024-09-25 05:27:40 +00:00
drbh	619eeded47	feat: simple mistral lora integration tests (#2180 ) * feat: simple mistral lora integration tests * fix: include args in docker launcher * fix: disable cuda graphs with lora and warn * fix: adjust docs and precommit issues * fix: re update docs	2024-09-25 05:27:40 +00:00
Daniël de Kok	ee56266044	Use symmetric quantization in the `quantize` subcommand (#2120 ) Packing of asymmetric quantization is broken, all (q)zeros values of `0` get reset to `1`, resulting in a loss of accuracy. So instead use symmetric quantization. To be able to distinguish models with symmetric and asymmetric quantization, a new config tensor `gptq_sym` is added. If this tensor is not present, we assume `sym=False`.	2024-09-25 05:27:40 +00:00
SeongBeomLEE	dedeb3cfa0	Modifying base in yarn embedding (#2212 )	2024-09-25 05:27:40 +00:00
drbh	5029e7215c	fix: append DONE message to chat stream (#2221 ) * fix: append DONE message to chat stream * fix: update completions endpoint	2024-09-25 05:27:40 +00:00
Daniël de Kok	85c3c5d64f	Add support for FP8 on compute capability >=8.0, <8.9 (#2213 ) Use FP8 GPTQ-Marlin kernels to enable FP8 support on CUDA GPUs with compute capability >=8.0 and <8.9. Co-authored-by: Florian Zimmermeister <flozi00.fz@gmail.com>	2024-09-25 05:27:40 +00:00
Daniël de Kok	2a6c3caf1d	Move quantized weight handling out of the `Weights` class (#2194 ) Quantized weights were loaded in the `Weights` class, but this was getting quite unwieldy, where every higher level method to load weights was a long conditional to cover all the different quantizers. This change moves loading of quantized weights out of the `Weights` class. This is done by defining a simple `WeightsLoader` interface that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`, and `MarlinWeightsLoader`. These implementations are in the quantizers' respective modules. The `Weights` class provides the low-level load operations (such as loading tensors or sharded tensors), but delegates loads that need quantizer-specific weight processing to a loader. The loaders still use the low-level functionality provided by `Weights`. I initially tried making a hierarchy where a class like `GPTQWeights` would inherit from `Weights`. But it is not very flexible (e.g. does not work well with the new weight storage mock used in tests) and the implicit indirections made the code harder to follow.	2024-09-25 05:27:40 +00:00
Nicolas Patry	cc4fceb21d	Updating the self check (#2209 ) * Updating the self check * Fix. * Revert the CLI . * cli. * Space. * Revert cargo update.	2024-09-25 05:27:40 +00:00
Nicolas Patry	591f9f70eb	Adding sanity check to openapi docs.	2024-09-25 05:26:10 +00:00
fxmarty	eaaea91e2b	Fix nccl regression on PyTorch 2.3 upgrade (#2099 ) * fix nccl issue * add note in dockerfile * use v2.22.3 that also fixes @samsamoa's repro * poetry actually can't handle the conflict between torch and nccl * set LD_PRELOAD	2024-09-25 05:22:56 +00:00
drbh	48f1196da8	feat: use model name as adapter id in chat endpoints (#2128 )	2024-09-25 05:21:34 +00:00
Wang, Yi	74edda9c23	update to metrics 0.23.0 or could work with metrics-exporter-promethe… (#2190 ) update to metrics 0.23.0 or could work with metrics-exporter-prometheus 0.15.1 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 05:21:34 +00:00
Javier Martinez	4a54e41920	fix: python deserialization (#2178 )	2024-09-25 05:21:34 +00:00
Wang, Yi	8dd9b2b135	add doc for intel gpus (#2181 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 05:21:34 +00:00
Daniël de Kok	540e710c3f	Falcon/DBRX: get correct number of key-value heads (#2205 )	2024-09-25 05:21:34 +00:00
Daniël de Kok	17594916ed	Fix incorrect cache allocation with multi-query (#2203 ) We wouldn't allocate any memory in multi-query (1 KV head). Fixes Starcoder et al.	2024-09-25 05:21:34 +00:00
Daniël de Kok	f11fd699b6	hotfix: Fix number of KV heads (#2202 ) Fix number of KV heads	2024-09-25 05:21:34 +00:00
icyboy™	8e3d1e6c3f	fix dbrx & opt model prefix bug (#2201 ) * Update idefics_causal_lm.py Fix syntax issues * fix dbrx & opt model prefix bug	2024-09-25 05:21:34 +00:00
Daniël de Kok	508e308088	Consistently take `prefix` in model constructors (#2191 ) * Consistently take `prefix` in model constructors * Release test check fix * Misc refactor-related fixes	2024-09-25 05:21:34 +00:00

1 2 3 4 5 ...

1119 Commits