text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-10 23:45:23 +00:00

Author	SHA1	Message	Date
Daniël de Kok	ba0dfb6fb1	Hotfix: various GPT-based model fixes (#2256 )	2024-09-25 05:27:40 +00:00
Daniël de Kok	394f8c7d2b	Hotfix: fix of use of unquantized weights in Gemma GQA loading (#2255 )	2024-09-25 05:27:40 +00:00
Daniël de Kok	2dd680b799	Improve the handling of quantized weights (#2250 ) * Improve the handling of quantized weights Handling of quantized weights was split between two mechanisms: - For quantized checkpoints, we used the new weight loader infrastructure. - For quantization while loading (EETQ, FP8, bitsandbytes) we instead relied on conditional in `get_linear`. Weight loaders support context managers to selectively load particular layers with different weight loaders, which is useful for models like Idefics2 AWQ, which uses a quantized text model, but unquantized vision and connector models. However, the context manager would be overrided by `get_linear`, which string-checks `quantizer`. Also, the context manager would not work with EETQ, FP8, and bitsandbytes. This change migrates all quantizers to the weight loader infrastructure. This has several benefits: - We can use context managers with all quantizers. - All the implementation details move down to the quantizer layers, `get_linear` does not need to know how to handle quantizer linear layers. - All quantizer weights are strongly typed, we don't pass around raw tensors. - We don't have to pass around the `quantizer` string everywhere. * Exclude non-MLP layers when using FP8 quantization with Llama	2024-09-25 05:27:40 +00:00
OlivierDehaene	118ee57f82	fix(server): fix cohere (#2249 )	2024-09-25 05:27:40 +00:00
Daniël de Kok	e0710ccbeb	Remove stray `quantize` argument in `get_weights_col_packed_qkv` (#2237 ) Fixes #2236.	2024-09-25 05:27:40 +00:00
Daniël de Kok	7177da0df6	`server quantize`: expose groupsize option (#2225 )	2024-09-25 05:27:40 +00:00
Daniël de Kok	e955f7b536	Add support for AWQ-quantized Idefics2 (#2233 ) Fixes #2036.	2024-09-25 05:27:40 +00:00
Hugo Larcher	8a223eb6ac	fix: Remove bitsandbytes installation when running cpu-only install (#2216 ) Remove bitsandbytes installation when running cpu-only install	2024-09-25 05:27:40 +00:00
Erik Kaunismäki	271ebb7e20	fix custom cache dir (#2226 ) * fix to not ignore HUGGINGFACE_HUB_CACHE in cache * delete printlns * delete newlines * maybe fix trailing whitespace	2024-09-25 05:27:40 +00:00
drbh	619eeded47	feat: simple mistral lora integration tests (#2180 ) * feat: simple mistral lora integration tests * fix: include args in docker launcher * fix: disable cuda graphs with lora and warn * fix: adjust docs and precommit issues * fix: re update docs	2024-09-25 05:27:40 +00:00
Daniël de Kok	ee56266044	Use symmetric quantization in the `quantize` subcommand (#2120 ) Packing of asymmetric quantization is broken, all (q)zeros values of `0` get reset to `1`, resulting in a loss of accuracy. So instead use symmetric quantization. To be able to distinguish models with symmetric and asymmetric quantization, a new config tensor `gptq_sym` is added. If this tensor is not present, we assume `sym=False`.	2024-09-25 05:27:40 +00:00
SeongBeomLEE	dedeb3cfa0	Modifying base in yarn embedding (#2212 )	2024-09-25 05:27:40 +00:00
drbh	5029e7215c	fix: append DONE message to chat stream (#2221 ) * fix: append DONE message to chat stream * fix: update completions endpoint	2024-09-25 05:27:40 +00:00
Daniël de Kok	85c3c5d64f	Add support for FP8 on compute capability >=8.0, <8.9 (#2213 ) Use FP8 GPTQ-Marlin kernels to enable FP8 support on CUDA GPUs with compute capability >=8.0 and <8.9. Co-authored-by: Florian Zimmermeister <flozi00.fz@gmail.com>	2024-09-25 05:27:40 +00:00
Daniël de Kok	2a6c3caf1d	Move quantized weight handling out of the `Weights` class (#2194 ) Quantized weights were loaded in the `Weights` class, but this was getting quite unwieldy, where every higher level method to load weights was a long conditional to cover all the different quantizers. This change moves loading of quantized weights out of the `Weights` class. This is done by defining a simple `WeightsLoader` interface that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`, and `MarlinWeightsLoader`. These implementations are in the quantizers' respective modules. The `Weights` class provides the low-level load operations (such as loading tensors or sharded tensors), but delegates loads that need quantizer-specific weight processing to a loader. The loaders still use the low-level functionality provided by `Weights`. I initially tried making a hierarchy where a class like `GPTQWeights` would inherit from `Weights`. But it is not very flexible (e.g. does not work well with the new weight storage mock used in tests) and the implicit indirections made the code harder to follow.	2024-09-25 05:27:40 +00:00
Nicolas Patry	cc4fceb21d	Updating the self check (#2209 ) * Updating the self check * Fix. * Revert the CLI . * cli. * Space. * Revert cargo update.	2024-09-25 05:27:40 +00:00
Nicolas Patry	591f9f70eb	Adding sanity check to openapi docs.	2024-09-25 05:26:10 +00:00
fxmarty	eaaea91e2b	Fix nccl regression on PyTorch 2.3 upgrade (#2099 ) * fix nccl issue * add note in dockerfile * use v2.22.3 that also fixes @samsamoa's repro * poetry actually can't handle the conflict between torch and nccl * set LD_PRELOAD	2024-09-25 05:22:56 +00:00
drbh	48f1196da8	feat: use model name as adapter id in chat endpoints (#2128 )	2024-09-25 05:21:34 +00:00
Wang, Yi	74edda9c23	update to metrics 0.23.0 or could work with metrics-exporter-promethe… (#2190 ) update to metrics 0.23.0 or could work with metrics-exporter-prometheus 0.15.1 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 05:21:34 +00:00
Javier Martinez	4a54e41920	fix: python deserialization (#2178 )	2024-09-25 05:21:34 +00:00
Wang, Yi	8dd9b2b135	add doc for intel gpus (#2181 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 05:21:34 +00:00
Daniël de Kok	540e710c3f	Falcon/DBRX: get correct number of key-value heads (#2205 )	2024-09-25 05:21:34 +00:00
Daniël de Kok	17594916ed	Fix incorrect cache allocation with multi-query (#2203 ) We wouldn't allocate any memory in multi-query (1 KV head). Fixes Starcoder et al.	2024-09-25 05:21:34 +00:00
Daniël de Kok	f11fd699b6	hotfix: Fix number of KV heads (#2202 ) Fix number of KV heads	2024-09-25 05:21:34 +00:00
icyboy™	8e3d1e6c3f	fix dbrx & opt model prefix bug (#2201 ) * Update idefics_causal_lm.py Fix syntax issues * fix dbrx & opt model prefix bug	2024-09-25 05:21:34 +00:00
Daniël de Kok	508e308088	Consistently take `prefix` in model constructors (#2191 ) * Consistently take `prefix` in model constructors * Release test check fix * Misc refactor-related fixes	2024-09-25 05:21:34 +00:00
Daniël de Kok	54c194dfa6	GPTQ CI improvements (#2151 ) * Add more representative Llama GPTQ test The Llama GPTQ test is updated to use a model with the commonly-used quantizer config format and activation sorting. The old test is kept around (but renamed) since it tests the format produced by `text-generation-server quantize`. * Add support for manually triggering a release build	2024-09-25 05:21:03 +00:00
Daniël de Kok	1e7ce69f20	Fix Starcoder2 after refactor (#2189 )	2024-09-25 05:20:28 +00:00
Nicolas Patry	e481a9bb9b	Hotfixing after refactor.	2024-09-25 05:20:28 +00:00
Nicolas Patry	1b434e8019	Refactor dead code - Removing all `flash_xxx.py` files. (#2166 ) * Refactor dead code. * First working step. * Remove a lot of duplicated code. * More dead code. * More cleanup. * Fix Santacoder test. * Fixing the simple tests. * Fixing sharding. * Fixes for VLM. * Fixing santacoder (num_kv_heads hardcoded). * Removing more dead code. * Fixing `config.n_head`. * Stopping earlier because of `<end_of_utterance>` in idefics2. * Addresses comments. * Removing the dead code. * Fuse back mistral into FlashCausalLM. * Finish removal. * Fixing docs + causal_lm `batch_class`. * Fixing docs + causal.lm. * Add default to Gemma Causality. * Default value for gemma/gemma2. * Wrong default.	2024-09-25 05:20:28 +00:00
Aaron Mihalik	835ad0a923	Adding "longrope" for Phi-3 (#2172 ) (#2179 ) Adding "longrope" for phi-3	2024-09-24 04:08:02 +00:00
Nicolas Patry	2e09ebecf6	Preparing patch release. (#2186 )	2024-09-24 04:08:02 +00:00
Nicolas Patry	74ddd1265a	Version 2.1.1	2024-09-24 04:01:22 +00:00
Nicolas Patry	e93c830e66	Fixing missing `object` field for regular completions. (#2175 ) * Fixing missing `object` field for regular completions. * Fixing docs by re-adding missing `Prompt`.	2024-09-24 04:00:11 +00:00
Nicolas Patry	64989f9439	Fixing the dockerfile warnings. (#2173 )	2024-09-24 04:00:11 +00:00
Nicolas Patry	878491cd5b	Revert "Fixing missing `object` field for regular completions." This reverts commit `2bbb7fa4b2`.	2024-09-24 03:59:15 +00:00
Nicolas Patry	b6c8984658	Fixing missing `object` field for regular completions.	2024-09-24 03:59:15 +00:00
drbh	233e46409a	feat: improve update_docs for openapi schema (#2169 ) * feat: add pre commit step to force schema update when router changes * fix: prefer improved update_doc and start server and compare * fix: adjust typo * fix: adjust revert typo * fix: update workflow to use update_doc md command * feat: improve workflow to check openapi schema too * fix: adjust timeout for CI * fix: adjust raise condition and install server in ci * fix: install protoc before server * feat: improve update doc and add command to print router schema * fix: adjust autodoc workflow * fix: explicitly install protoc and python * fix: alllow trailing space in openapi schema diff	2024-09-24 03:59:15 +00:00
Nicolas Patry	d580215a24	Hotfixing qwen2 and starcoder2 (which also get clamping). (#2167 )	2024-09-24 03:58:36 +00:00
Nicolas Patry	bc5a792dc8	Fixing rocm. (#2164 )	2024-09-24 03:58:13 +00:00
drbh	e913f3ad2d	fix: use the base layers weight in mistral rocm (#2155 )	2024-09-24 03:58:13 +00:00
Wang, Yi	71b0189cd5	fix FlashDecoding change's regression in intel platform (#2161 ) install triton because GPTQParams needs it. Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-24 03:58:13 +00:00
Nicolas Patry	9b3d3a3690	Fixing graph capture for flash decoding. (#2163 )	2024-09-24 03:58:13 +00:00
Nicolas Patry	b80bd724e1	Move to FlashDecoding instead of PagedAttention kernel. (#1940 ) * Using flash decoding Conditional flashdecoding. Fix max_q. Working kvcache Working version with flash decoding. Make it work for mistral. Fix after rebase.. Less intrusive. REvert changes in modeling. Speedup flashdecoding. HHachweew Hack to make other models work. Fixing non flash decoding llama path. Router logic knows about page size. Missing 2 models. Missing cohere. Fixing cohere flash decoding. Revamped all this architecture. Fix cohere. Fixing falcon. Enabling custom block size schedule. Update router/src/infer.rs Not sending preallocated output. * Making it work on non flash decoding. * Fix Cohere. * Fix non decoding paths. * Rebased. * No need for cache_manager anymore. * Update? * "ipex" -> "cpu" * These do not belong. * Factoring cu_seqlen_qk for better abstracting over every model. * Fixing non flash tests/imports. * Changing return everywhere. * Update mistral past. * Fixing Mi{s,x}tral (non functional in Flash Decoding mode though). * Fixup mistral clamping (had issues with cuda graphs). * No need to recreate anything actually.	2024-09-24 03:58:13 +00:00
Nicolas Patry	2b9339c65b	Fixing baichuan override. (#2158 )	2024-09-24 03:58:13 +00:00
drbh	381c5c02a6	fix: prefer serde structs over custom functions (#2127 ) * fix: prefer enum for chat object * fix: adjust typo * fix: enum CompletionType not ObjectType * fix: adjust typo * feat: leverage serde for conditional deser * fix: adjust HubTokenizerConfig after rebase * fix: update create_post_processor logic for token type * fix: adjust unwrap syntax in template * Fixing the post processor. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-09-24 03:57:32 +00:00
Wang, Yi	6265956bc4	refine get xpu free memory/enable Qwen2/gemma2/gemma/phi in intel platform (#2132 ) * refine get xpu free memory Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable qwen2 in xpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable gemma/gemma2/phi in intel platform Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-24 03:57:32 +00:00
icyboy™	5b977c3141	fix AttributeError: 'MixtralLayer' object has no attribute 'mlp' (#2123 ) https://github.com/huggingface/text-generation-inference/issues/2122	2024-09-24 03:57:32 +00:00
Daniël de Kok	e0d168ba20	Use GPTQ-Marlin for supported GPTQ configurations (#2111 ) GPTQ-Marlin is currently the best-performing kernel for GPTQ models. So let's use it by default if the kernels are installed, the GPU supports it, and the kernels support the configuration. For models generated by `text-generation-server quantize`, use `sym=False`. This subcommand symmetric quantization since the beginning and incorrectly reporting the model to be symmetric will use GPTQ-Marlin (which does not support asymmetric quantization).	2024-09-24 03:57:32 +00:00

1 2 3 4 5 ...

996 Commits