Commit Graph

1350 Commits

Author SHA1 Message Date
Daniël de Kok
8511669cb2
Move quantized weight handling out of the Weights class (#2194)
Quantized weights were loaded in the `Weights` class, but this was
getting quite unwieldy, where every higher level method to load weights
was a long conditional to cover all the different quantizers.

This change moves loading of quantized weights out of the `Weights`
class. This is done by defining a simple `WeightsLoader` interface
that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`,
and `MarlinWeightsLoader`. These implementations are in the quantizers'
respective modules. The `Weights` class provides the low-level load
operations (such as loading tensors or sharded tensors), but delegates
loads that need quantizer-specific weight processing to a loader. The
loaders still use the low-level functionality provided by `Weights`.

I initially tried making a hierarchy where a class like `GPTQWeights`
would inherit from `Weights`. But it is not very flexible (e.g. does
not work well with the new weight storage mock used in tests) and
the implicit indirections made the code harder to follow.
2024-07-09 20:04:03 +02:00
Nicolas Patry
4c976fb406
Updating the self check (#2209)
* Updating the self check

* Fix.

* Revert the CLI .

* cli.

* Space.

* Revert cargo update.
2024-07-09 17:23:48 +02:00
vinkamath
f5ba9bfd52
Fixed README ToC (#2196)
Co-authored-by: Vinayak Kamath <Vinayak.Kamath@target.com>
2024-07-09 11:22:08 +02:00
Nicolas Patry
fe710af25f
Adding sanity check to openapi docs. 2024-07-09 11:13:48 +02:00
Guillaume LEGENDRE
5e2a305880
Fix buildx cache + change runner type (#2176)
* Update build.yaml

* Update build.yaml

* change to S3 cache

* change to CPU Runners

* remove comments
2024-07-08 18:13:32 +02:00
fxmarty
4c50b6d04b
Fix nccl regression on PyTorch 2.3 upgrade (#2099)
* fix nccl issue

* add note in dockerfile

* use v2.22.3 that also fixes @samsamoa's repro

* poetry actually can't handle the conflict between torch and nccl

* set LD_PRELOAD
2024-07-08 17:52:10 +02:00
drbh
87ebb6477b
feat: use model name as adapter id in chat endpoints (#2128) 2024-07-08 16:06:49 +02:00
Wang, Yi
58effe78b5
update to metrics 0.23.0 or could work with metrics-exporter-promethe… (#2190)
update to metrics 0.23.0 or could work with metrics-exporter-prometheus 0.15.1

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-07-08 16:03:59 +02:00
Javier Martinez
16d9e505fd
fix: python deserialization (#2178) 2024-07-08 15:59:16 +02:00
Wang, Yi
07e240ca37
add doc for intel gpus (#2181)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-07-08 15:57:06 +02:00
Daniël de Kok
5c7c9f1390
Falcon/DBRX: get correct number of key-value heads (#2205) 2024-07-08 13:22:38 +02:00
Daniël de Kok
153fcf7739
Fix incorrect cache allocation with multi-query (#2203)
We wouldn't allocate any memory in multi-query (1 KV head). Fixes
Starcoder et al.
2024-07-08 11:19:48 +02:00
Daniël de Kok
cce475a949
hotfix: Fix number of KV heads (#2202)
Fix number of KV heads
2024-07-08 09:52:12 +02:00
icyboy™
521d0d990f
fix dbrx & opt model prefix bug (#2201)
* Update idefics_causal_lm.py

Fix syntax issues

* fix dbrx & opt model prefix bug
2024-07-08 09:01:14 +02:00
Daniël de Kok
05c094fcfa
Consistently take prefix in model constructors (#2191)
* Consistently take `prefix` in model constructors

* Release test check fix

* Misc refactor-related fixes
2024-07-05 16:07:48 +02:00
Daniël de Kok
67ef0649cf
GPTQ CI improvements (#2151)
* Add more representative Llama GPTQ test

The Llama GPTQ test is updated to use a model with the commonly-used
quantizer config format and activation sorting. The old test is
kept around (but renamed) since it tests the format produced by
`text-generation-server quantize`.

* Add support for manually triggering a release build
2024-07-05 14:12:16 +02:00
Daniël de Kok
b67d46336e
Fix Starcoder2 after refactor (#2189) 2024-07-05 12:22:45 +02:00
Nicolas Patry
853d4eb9cf
Hotfixing after refactor. 2024-07-05 09:25:29 +00:00
Nicolas Patry
fb2f74e2b9
Refactor dead code - Removing all flash_xxx.py files. (#2166)
* Refactor dead code.

* First working step.

* Remove a lot of duplicated code.

* More dead code.

* More cleanup.

* Fix Santacoder test.

* Fixing the simple tests.

* Fixing sharding.

* Fixes for VLM.

* Fixing santacoder (num_kv_heads hardcoded).

* Removing more dead code.

* Fixing `config.n_head`.

* Stopping earlier because of `<end_of_utterance>` in idefics2.

* Addresses comments.

* Removing the dead code.

* Fuse back mistral into FlashCausalLM.

* Finish removal.

* Fixing docs + causal_lm `batch_class`.

* Fixing docs + causal.lm.

* Add default to Gemma Causality.

* Default value for gemma/gemma2.

* Wrong default.
2024-07-05 10:29:56 +02:00
Aaron Mihalik
c6bcadf883
Adding "longrope" for Phi-3 (#2172) (#2179)
Adding "longrope" for phi-3
2024-07-05 09:46:41 +02:00
Nicolas Patry
245d3de948
Preparing patch release. (#2186) 2024-07-04 10:55:33 +02:00
Nicolas Patry
5ad41aa2a6
Fixing missing object field for regular completions. (#2175)
* Fixing missing `object` field for regular completions.

* Fixing docs by re-adding missing `Prompt`.
2024-07-03 12:56:27 +02:00
Nicolas Patry
2b3bd1e008
Fixing the dockerfile warnings. (#2173) 2024-07-03 12:48:45 +02:00
Nicolas Patry
be4a4c47f9
Revert "Fixing missing object field for regular completions."
This reverts commit 2bbb7fa4b2.
2024-07-03 10:41:39 +00:00
Nicolas Patry
2bbb7fa4b2
Fixing missing object field for regular completions. 2024-07-03 10:40:22 +00:00
drbh
571530dd9a
feat: improve update_docs for openapi schema (#2169)
* feat: add pre commit step to force schema update when router changes

* fix: prefer improved update_doc and start server and compare

* fix: adjust typo

* fix: adjust revert typo

* fix: update workflow to use update_doc md command

* feat: improve workflow to check openapi schema too

* fix: adjust timeout for CI

* fix: adjust raise condition and install server in ci

* fix: install protoc before server

* feat: improve update doc and add command to print router schema

* fix: adjust autodoc workflow

* fix: explicitly install protoc and python

* fix: alllow trailing space in openapi schema diff
2024-07-03 09:53:35 +02:00
Nicolas Patry
0759ec495e
Hotfixing qwen2 and starcoder2 (which also get clamping). (#2167) 2024-07-02 14:26:47 +02:00
Guillaume LEGENDRE
963b6c6f0f
Ci test (#2124)
* first test with registry mirror

* change push registry

* remove comments

* Move cache to push registry

* fix registry url

* Update .github/workflows/ci_build.yaml

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-07-02 12:45:38 +02:00
Nicolas Patry
dea9c0dc74
Fixing rocm. (#2164) 2024-07-02 12:01:08 +02:00
drbh
b966bc0d35
fix: use the base layers weight in mistral rocm (#2155) 2024-07-02 11:56:25 +02:00
Wang, Yi
5d97e0c4a3
fix FlashDecoding change's regression in intel platform (#2161)
install triton because GPTQParams needs it.

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-07-02 11:56:07 +02:00
Nicolas Patry
022f6515a4
Fixing graph capture for flash decoding. (#2163) 2024-07-02 11:43:07 +02:00
Nicolas Patry
4327210e6b
[Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. (#1940)
* Using flash decoding

Conditional flashdecoding.

Fix max_q.

Working kvcache

Working version with flash decoding.

Make it work for mistral.

Fix after rebase..

Less intrusive.

REvert changes in modeling.

Speedup flashdecoding.

HHachweew
Hack to make other models work.

Fixing non flash decoding llama path.

Router logic knows about page size.

Missing 2 models.

Missing cohere.

Fixing cohere flash decoding.

Revamped all this architecture.

Fix cohere.

Fixing falcon.

Enabling custom block size schedule.

Update router/src/infer.rs

Not sending preallocated output.

* Making it work on non flash decoding.

* Fix Cohere.

* Fix non decoding paths.

* Rebased.

* No need for cache_manager anymore.

* Update?

* "ipex" -> "cpu"

* These do not belong.

* Factoring cu_seqlen_qk for better abstracting over every model.

* Fixing non flash tests/imports.

* Changing return everywhere.

* Update mistral past.

* Fixing Mi{s,x}tral (non functional in Flash Decoding mode though).

* Fixup mistral clamping (had issues with cuda graphs).

* No need to recreate anything actually.
2024-07-01 23:28:00 +02:00
Nicolas Patry
4f55f15840
Fixing baichuan override. (#2158) 2024-07-01 23:25:54 +02:00
Nicolas Patry
d0225b1015
GH router. (#2153) 2024-07-01 15:42:26 +02:00
Nicolas Patry
17cebc4506
Fixing test. (#2152) 2024-07-01 15:24:17 +02:00
drbh
9eefb2f672
fix: prefer serde structs over custom functions (#2127)
* fix: prefer enum for chat object

* fix: adjust typo

* fix: enum CompletionType not ObjectType

* fix: adjust typo

* feat: leverage serde for conditional deser

* fix: adjust HubTokenizerConfig after rebase

* fix: update create_post_processor logic for token type

* fix: adjust unwrap syntax in template

* Fixing the post processor.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-07-01 15:08:05 +02:00
Wang, Yi
5da4cfab1c
refine get xpu free memory/enable Qwen2/gemma2/gemma/phi in intel platform (#2132)
* refine get xpu free memory

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable qwen2 in xpu

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable gemma/gemma2/phi in intel platform

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-07-01 14:32:54 +02:00
icyboy™
9d0ca503a8
fix AttributeError: 'MixtralLayer' object has no attribute 'mlp' (#2123)
https://github.com/huggingface/text-generation-inference/issues/2122
2024-07-01 14:17:22 +02:00
Daniël de Kok
2ce8019480
Use GPTQ-Marlin for supported GPTQ configurations (#2111)
GPTQ-Marlin is currently the best-performing kernel for GPTQ models. So
let's use it by default if the kernels are installed, the GPU supports
it, and the kernels support the configuration.

For models generated by `text-generation-server quantize`, use
`sym=False`. This subcommand symmetric quantization since the beginning
and incorrectly reporting the model to be symmetric will use
GPTQ-Marlin (which does not support asymmetric quantization).
2024-07-01 12:59:12 +02:00
drbh
0d97a93c1e
feat: download lora adapter weights from launcher (#2140) 2024-07-01 12:58:49 +02:00
drbh
25f57e2e98
fix: use weights from base_layer (#2141) 2024-07-01 12:58:40 +02:00
Nicolas Patry
b4552f9de9
Fixing clippy. (#2149) 2024-07-01 12:02:19 +02:00
Wang, Yi
6ea570ddfe
fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_… (#2148)
* fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_indices]

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Apply suggestions from code review

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-07-01 11:27:53 +02:00
Nicolas Patry
fb98ab273f
Fixing the CI to also run in release when it's a tag ? (#2138) 2024-06-28 09:31:09 +02:00
drbh
74b0231b19
fix: refactor post_processor logic and add test (#2137)
* fix: refactor post_processor logic and add test

* fix: remove dev comment

* fix: adjust when post_processor is overridden and  improve create_post_processor
2024-06-27 23:16:19 +02:00
Nicolas Patry
3ea8259af1
Fixing gemma2. (#2135)
* Fixing gemma2.

* Adding new model.
2024-06-27 16:04:20 +02:00
Nicolas Patry
0e4ab6d31c
Fixing malformed rust tokenizers (#2134)
* Fixing malformed rust tokenizers

* Fix for deepseek too.
2024-06-27 16:04:03 +02:00
Daniël de Kok
dd2d91b043
Idefics2: sync added image tokens with transformers (#2080)
Before this change, the number of reserved image tokens was not the
same as the number of images. Fixes #2029.

While at it, also remove all the image token handling duplication
in `prepare_input`.
2024-06-27 15:54:35 +02:00
Nicolas Patry
b53b21c63a
Bumping to 2.1 (#2131) 2024-06-27 12:34:43 +02:00