Commit Graph

996 Commits

Author SHA1 Message Date
Daniël de Kok
ba0dfb6fb1 Hotfix: various GPT-based model fixes (#2256) 2024-09-25 05:27:40 +00:00
Daniël de Kok
394f8c7d2b Hotfix: fix of use of unquantized weights in Gemma GQA loading (#2255) 2024-09-25 05:27:40 +00:00
Daniël de Kok
2dd680b799 Improve the handling of quantized weights (#2250)
* Improve the handling of quantized weights

Handling of quantized weights was split between two mechanisms:

- For quantized checkpoints, we used the new weight loader
  infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
  instead relied on conditional in `get_linear`.

Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.

This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:

- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
  `get_linear` does not need to know how to handle quantizer linear
  layers.
- All quantizer weights are strongly typed, we don't pass around
  raw tensors.
- We don't have to pass around the `quantizer` string everywhere.

* Exclude non-MLP layers when using FP8 quantization with Llama
2024-09-25 05:27:40 +00:00
OlivierDehaene
118ee57f82 fix(server): fix cohere (#2249) 2024-09-25 05:27:40 +00:00
Daniël de Kok
e0710ccbeb Remove stray quantize argument in get_weights_col_packed_qkv (#2237)
Fixes #2236.
2024-09-25 05:27:40 +00:00
Daniël de Kok
7177da0df6 server quantize: expose groupsize option (#2225) 2024-09-25 05:27:40 +00:00
Daniël de Kok
e955f7b536 Add support for AWQ-quantized Idefics2 (#2233)
Fixes #2036.
2024-09-25 05:27:40 +00:00
Hugo Larcher
8a223eb6ac fix: Remove bitsandbytes installation when running cpu-only install (#2216)
Remove bitsandbytes installation when running cpu-only install
2024-09-25 05:27:40 +00:00
Erik Kaunismäki
271ebb7e20 fix custom cache dir (#2226)
* fix to not ignore HUGGINGFACE_HUB_CACHE in cache

* delete printlns

* delete newlines

* maybe fix trailing whitespace
2024-09-25 05:27:40 +00:00
drbh
619eeded47 feat: simple mistral lora integration tests (#2180)
* feat: simple mistral lora integration tests

* fix: include args in docker launcher

* fix: disable cuda graphs with lora and warn

* fix: adjust docs and precommit issues

* fix: re update docs
2024-09-25 05:27:40 +00:00
Daniël de Kok
ee56266044 Use symmetric quantization in the quantize subcommand (#2120)
Packing of asymmetric quantization is broken, all (q)zeros values
of `0` get reset to `1`, resulting in a loss of accuracy. So instead
use symmetric quantization. To be able to distinguish models with
symmetric and asymmetric quantization, a new config tensor `gptq_sym` is
added. If this tensor is not present, we assume `sym=False`.
2024-09-25 05:27:40 +00:00
SeongBeomLEE
dedeb3cfa0 Modifying base in yarn embedding (#2212) 2024-09-25 05:27:40 +00:00
drbh
5029e7215c fix: append DONE message to chat stream (#2221)
* fix: append DONE message to chat stream

* fix: update completions endpoint
2024-09-25 05:27:40 +00:00
Daniël de Kok
85c3c5d64f Add support for FP8 on compute capability >=8.0, <8.9 (#2213)
Use FP8 GPTQ-Marlin kernels to enable FP8 support on CUDA GPUs
with compute capability >=8.0 and <8.9.

Co-authored-by: Florian Zimmermeister <flozi00.fz@gmail.com>
2024-09-25 05:27:40 +00:00
Daniël de Kok
2a6c3caf1d Move quantized weight handling out of the Weights class (#2194)
Quantized weights were loaded in the `Weights` class, but this was
getting quite unwieldy, where every higher level method to load weights
was a long conditional to cover all the different quantizers.

This change moves loading of quantized weights out of the `Weights`
class. This is done by defining a simple `WeightsLoader` interface
that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`,
and `MarlinWeightsLoader`. These implementations are in the quantizers'
respective modules. The `Weights` class provides the low-level load
operations (such as loading tensors or sharded tensors), but delegates
loads that need quantizer-specific weight processing to a loader. The
loaders still use the low-level functionality provided by `Weights`.

I initially tried making a hierarchy where a class like `GPTQWeights`
would inherit from `Weights`. But it is not very flexible (e.g. does
not work well with the new weight storage mock used in tests) and
the implicit indirections made the code harder to follow.
2024-09-25 05:27:40 +00:00
Nicolas Patry
cc4fceb21d Updating the self check (#2209)
* Updating the self check

* Fix.

* Revert the CLI .

* cli.

* Space.

* Revert cargo update.
2024-09-25 05:27:40 +00:00
Nicolas Patry
591f9f70eb Adding sanity check to openapi docs. 2024-09-25 05:26:10 +00:00
fxmarty
eaaea91e2b Fix nccl regression on PyTorch 2.3 upgrade (#2099)
* fix nccl issue

* add note in dockerfile

* use v2.22.3 that also fixes @samsamoa's repro

* poetry actually can't handle the conflict between torch and nccl

* set LD_PRELOAD
2024-09-25 05:22:56 +00:00
drbh
48f1196da8 feat: use model name as adapter id in chat endpoints (#2128) 2024-09-25 05:21:34 +00:00
Wang, Yi
74edda9c23 update to metrics 0.23.0 or could work with metrics-exporter-promethe… (#2190)
update to metrics 0.23.0 or could work with metrics-exporter-prometheus 0.15.1

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 05:21:34 +00:00
Javier Martinez
4a54e41920 fix: python deserialization (#2178) 2024-09-25 05:21:34 +00:00
Wang, Yi
8dd9b2b135 add doc for intel gpus (#2181)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 05:21:34 +00:00
Daniël de Kok
540e710c3f Falcon/DBRX: get correct number of key-value heads (#2205) 2024-09-25 05:21:34 +00:00
Daniël de Kok
17594916ed Fix incorrect cache allocation with multi-query (#2203)
We wouldn't allocate any memory in multi-query (1 KV head). Fixes
Starcoder et al.
2024-09-25 05:21:34 +00:00
Daniël de Kok
f11fd699b6 hotfix: Fix number of KV heads (#2202)
Fix number of KV heads
2024-09-25 05:21:34 +00:00
icyboy™
8e3d1e6c3f fix dbrx & opt model prefix bug (#2201)
* Update idefics_causal_lm.py

Fix syntax issues

* fix dbrx & opt model prefix bug
2024-09-25 05:21:34 +00:00
Daniël de Kok
508e308088 Consistently take prefix in model constructors (#2191)
* Consistently take `prefix` in model constructors

* Release test check fix

* Misc refactor-related fixes
2024-09-25 05:21:34 +00:00
Daniël de Kok
54c194dfa6 GPTQ CI improvements (#2151)
* Add more representative Llama GPTQ test

The Llama GPTQ test is updated to use a model with the commonly-used
quantizer config format and activation sorting. The old test is
kept around (but renamed) since it tests the format produced by
`text-generation-server quantize`.

* Add support for manually triggering a release build
2024-09-25 05:21:03 +00:00
Daniël de Kok
1e7ce69f20 Fix Starcoder2 after refactor (#2189) 2024-09-25 05:20:28 +00:00
Nicolas Patry
e481a9bb9b Hotfixing after refactor. 2024-09-25 05:20:28 +00:00
Nicolas Patry
1b434e8019 Refactor dead code - Removing all flash_xxx.py files. (#2166)
* Refactor dead code.

* First working step.

* Remove a lot of duplicated code.

* More dead code.

* More cleanup.

* Fix Santacoder test.

* Fixing the simple tests.

* Fixing sharding.

* Fixes for VLM.

* Fixing santacoder (num_kv_heads hardcoded).

* Removing more dead code.

* Fixing `config.n_head`.

* Stopping earlier because of `<end_of_utterance>` in idefics2.

* Addresses comments.

* Removing the dead code.

* Fuse back mistral into FlashCausalLM.

* Finish removal.

* Fixing docs + causal_lm `batch_class`.

* Fixing docs + causal.lm.

* Add default to Gemma Causality.

* Default value for gemma/gemma2.

* Wrong default.
2024-09-25 05:20:28 +00:00
Aaron Mihalik
835ad0a923 Adding "longrope" for Phi-3 (#2172) (#2179)
Adding "longrope" for phi-3
2024-09-24 04:08:02 +00:00
Nicolas Patry
2e09ebecf6 Preparing patch release. (#2186) 2024-09-24 04:08:02 +00:00
Nicolas Patry
74ddd1265a Version 2.1.1 2024-09-24 04:01:22 +00:00
Nicolas Patry
e93c830e66 Fixing missing object field for regular completions. (#2175)
* Fixing missing `object` field for regular completions.

* Fixing docs by re-adding missing `Prompt`.
2024-09-24 04:00:11 +00:00
Nicolas Patry
64989f9439 Fixing the dockerfile warnings. (#2173) 2024-09-24 04:00:11 +00:00
Nicolas Patry
878491cd5b Revert "Fixing missing object field for regular completions."
This reverts commit 2bbb7fa4b2.
2024-09-24 03:59:15 +00:00
Nicolas Patry
b6c8984658 Fixing missing object field for regular completions. 2024-09-24 03:59:15 +00:00
drbh
233e46409a feat: improve update_docs for openapi schema (#2169)
* feat: add pre commit step to force schema update when router changes

* fix: prefer improved update_doc and start server and compare

* fix: adjust typo

* fix: adjust revert typo

* fix: update workflow to use update_doc md command

* feat: improve workflow to check openapi schema too

* fix: adjust timeout for CI

* fix: adjust raise condition and install server in ci

* fix: install protoc before server

* feat: improve update doc and add command to print router schema

* fix: adjust autodoc workflow

* fix: explicitly install protoc and python

* fix: alllow trailing space in openapi schema diff
2024-09-24 03:59:15 +00:00
Nicolas Patry
d580215a24 Hotfixing qwen2 and starcoder2 (which also get clamping). (#2167) 2024-09-24 03:58:36 +00:00
Nicolas Patry
bc5a792dc8 Fixing rocm. (#2164) 2024-09-24 03:58:13 +00:00
drbh
e913f3ad2d fix: use the base layers weight in mistral rocm (#2155) 2024-09-24 03:58:13 +00:00
Wang, Yi
71b0189cd5 fix FlashDecoding change's regression in intel platform (#2161)
install triton because GPTQParams needs it.

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-24 03:58:13 +00:00
Nicolas Patry
9b3d3a3690 Fixing graph capture for flash decoding. (#2163) 2024-09-24 03:58:13 +00:00
Nicolas Patry
b80bd724e1 Move to FlashDecoding instead of PagedAttention kernel. (#1940)
* Using flash decoding

Conditional flashdecoding.

Fix max_q.

Working kvcache

Working version with flash decoding.

Make it work for mistral.

Fix after rebase..

Less intrusive.

REvert changes in modeling.

Speedup flashdecoding.

HHachweew
Hack to make other models work.

Fixing non flash decoding llama path.

Router logic knows about page size.

Missing 2 models.

Missing cohere.

Fixing cohere flash decoding.

Revamped all this architecture.

Fix cohere.

Fixing falcon.

Enabling custom block size schedule.

Update router/src/infer.rs

Not sending preallocated output.

* Making it work on non flash decoding.

* Fix Cohere.

* Fix non decoding paths.

* Rebased.

* No need for cache_manager anymore.

* Update?

* "ipex" -> "cpu"

* These do not belong.

* Factoring cu_seqlen_qk for better abstracting over every model.

* Fixing non flash tests/imports.

* Changing return everywhere.

* Update mistral past.

* Fixing Mi{s,x}tral (non functional in Flash Decoding mode though).

* Fixup mistral clamping (had issues with cuda graphs).

* No need to recreate anything actually.
2024-09-24 03:58:13 +00:00
Nicolas Patry
2b9339c65b Fixing baichuan override. (#2158) 2024-09-24 03:58:13 +00:00
drbh
381c5c02a6 fix: prefer serde structs over custom functions (#2127)
* fix: prefer enum for chat object

* fix: adjust typo

* fix: enum CompletionType not ObjectType

* fix: adjust typo

* feat: leverage serde for conditional deser

* fix: adjust HubTokenizerConfig after rebase

* fix: update create_post_processor logic for token type

* fix: adjust unwrap syntax in template

* Fixing the post processor.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-09-24 03:57:32 +00:00
Wang, Yi
6265956bc4 refine get xpu free memory/enable Qwen2/gemma2/gemma/phi in intel platform (#2132)
* refine get xpu free memory

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable qwen2 in xpu

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable gemma/gemma2/phi in intel platform

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-24 03:57:32 +00:00
icyboy™
5b977c3141 fix AttributeError: 'MixtralLayer' object has no attribute 'mlp' (#2123)
https://github.com/huggingface/text-generation-inference/issues/2122
2024-09-24 03:57:32 +00:00
Daniël de Kok
e0d168ba20 Use GPTQ-Marlin for supported GPTQ configurations (#2111)
GPTQ-Marlin is currently the best-performing kernel for GPTQ models. So
let's use it by default if the kernels are installed, the GPU supports
it, and the kernels support the configuration.

For models generated by `text-generation-server quantize`, use
`sym=False`. This subcommand symmetric quantization since the beginning
and incorrectly reporting the model to be symmetric will use
GPTQ-Marlin (which does not support asymmetric quantization).
2024-09-24 03:57:32 +00:00