Commit Graph

587 Commits

Author SHA1 Message Date
Daniël de Kok
e955f7b536 Add support for AWQ-quantized Idefics2 (#2233)
Fixes #2036.
2024-09-25 05:27:40 +00:00
Hugo Larcher
8a223eb6ac fix: Remove bitsandbytes installation when running cpu-only install (#2216)
Remove bitsandbytes installation when running cpu-only install
2024-09-25 05:27:40 +00:00
drbh
619eeded47 feat: simple mistral lora integration tests (#2180)
* feat: simple mistral lora integration tests

* fix: include args in docker launcher

* fix: disable cuda graphs with lora and warn

* fix: adjust docs and precommit issues

* fix: re update docs
2024-09-25 05:27:40 +00:00
Daniël de Kok
ee56266044 Use symmetric quantization in the quantize subcommand (#2120)
Packing of asymmetric quantization is broken, all (q)zeros values
of `0` get reset to `1`, resulting in a loss of accuracy. So instead
use symmetric quantization. To be able to distinguish models with
symmetric and asymmetric quantization, a new config tensor `gptq_sym` is
added. If this tensor is not present, we assume `sym=False`.
2024-09-25 05:27:40 +00:00
SeongBeomLEE
dedeb3cfa0 Modifying base in yarn embedding (#2212) 2024-09-25 05:27:40 +00:00
Daniël de Kok
85c3c5d64f Add support for FP8 on compute capability >=8.0, <8.9 (#2213)
Use FP8 GPTQ-Marlin kernels to enable FP8 support on CUDA GPUs
with compute capability >=8.0 and <8.9.

Co-authored-by: Florian Zimmermeister <flozi00.fz@gmail.com>
2024-09-25 05:27:40 +00:00
Daniël de Kok
2a6c3caf1d Move quantized weight handling out of the Weights class (#2194)
Quantized weights were loaded in the `Weights` class, but this was
getting quite unwieldy, where every higher level method to load weights
was a long conditional to cover all the different quantizers.

This change moves loading of quantized weights out of the `Weights`
class. This is done by defining a simple `WeightsLoader` interface
that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`,
and `MarlinWeightsLoader`. These implementations are in the quantizers'
respective modules. The `Weights` class provides the low-level load
operations (such as loading tensors or sharded tensors), but delegates
loads that need quantizer-specific weight processing to a loader. The
loaders still use the low-level functionality provided by `Weights`.

I initially tried making a hierarchy where a class like `GPTQWeights`
would inherit from `Weights`. But it is not very flexible (e.g. does
not work well with the new weight storage mock used in tests) and
the implicit indirections made the code harder to follow.
2024-09-25 05:27:40 +00:00
fxmarty
eaaea91e2b Fix nccl regression on PyTorch 2.3 upgrade (#2099)
* fix nccl issue

* add note in dockerfile

* use v2.22.3 that also fixes @samsamoa's repro

* poetry actually can't handle the conflict between torch and nccl

* set LD_PRELOAD
2024-09-25 05:22:56 +00:00
Daniël de Kok
540e710c3f Falcon/DBRX: get correct number of key-value heads (#2205) 2024-09-25 05:21:34 +00:00
Daniël de Kok
17594916ed Fix incorrect cache allocation with multi-query (#2203)
We wouldn't allocate any memory in multi-query (1 KV head). Fixes
Starcoder et al.
2024-09-25 05:21:34 +00:00
Daniël de Kok
f11fd699b6 hotfix: Fix number of KV heads (#2202)
Fix number of KV heads
2024-09-25 05:21:34 +00:00
icyboy™
8e3d1e6c3f fix dbrx & opt model prefix bug (#2201)
* Update idefics_causal_lm.py

Fix syntax issues

* fix dbrx & opt model prefix bug
2024-09-25 05:21:34 +00:00
Daniël de Kok
508e308088 Consistently take prefix in model constructors (#2191)
* Consistently take `prefix` in model constructors

* Release test check fix

* Misc refactor-related fixes
2024-09-25 05:21:34 +00:00
Daniël de Kok
1e7ce69f20 Fix Starcoder2 after refactor (#2189) 2024-09-25 05:20:28 +00:00
Nicolas Patry
e481a9bb9b Hotfixing after refactor. 2024-09-25 05:20:28 +00:00
Nicolas Patry
1b434e8019 Refactor dead code - Removing all flash_xxx.py files. (#2166)
* Refactor dead code.

* First working step.

* Remove a lot of duplicated code.

* More dead code.

* More cleanup.

* Fix Santacoder test.

* Fixing the simple tests.

* Fixing sharding.

* Fixes for VLM.

* Fixing santacoder (num_kv_heads hardcoded).

* Removing more dead code.

* Fixing `config.n_head`.

* Stopping earlier because of `<end_of_utterance>` in idefics2.

* Addresses comments.

* Removing the dead code.

* Fuse back mistral into FlashCausalLM.

* Finish removal.

* Fixing docs + causal_lm `batch_class`.

* Fixing docs + causal.lm.

* Add default to Gemma Causality.

* Default value for gemma/gemma2.

* Wrong default.
2024-09-25 05:20:28 +00:00
Aaron Mihalik
835ad0a923 Adding "longrope" for Phi-3 (#2172) (#2179)
Adding "longrope" for phi-3
2024-09-24 04:08:02 +00:00
Nicolas Patry
d580215a24 Hotfixing qwen2 and starcoder2 (which also get clamping). (#2167) 2024-09-24 03:58:36 +00:00
Nicolas Patry
bc5a792dc8 Fixing rocm. (#2164) 2024-09-24 03:58:13 +00:00
drbh
e913f3ad2d fix: use the base layers weight in mistral rocm (#2155) 2024-09-24 03:58:13 +00:00
Wang, Yi
71b0189cd5 fix FlashDecoding change's regression in intel platform (#2161)
install triton because GPTQParams needs it.

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-24 03:58:13 +00:00
Nicolas Patry
9b3d3a3690 Fixing graph capture for flash decoding. (#2163) 2024-09-24 03:58:13 +00:00
Nicolas Patry
b80bd724e1 Move to FlashDecoding instead of PagedAttention kernel. (#1940)
* Using flash decoding

Conditional flashdecoding.

Fix max_q.

Working kvcache

Working version with flash decoding.

Make it work for mistral.

Fix after rebase..

Less intrusive.

REvert changes in modeling.

Speedup flashdecoding.

HHachweew
Hack to make other models work.

Fixing non flash decoding llama path.

Router logic knows about page size.

Missing 2 models.

Missing cohere.

Fixing cohere flash decoding.

Revamped all this architecture.

Fix cohere.

Fixing falcon.

Enabling custom block size schedule.

Update router/src/infer.rs

Not sending preallocated output.

* Making it work on non flash decoding.

* Fix Cohere.

* Fix non decoding paths.

* Rebased.

* No need for cache_manager anymore.

* Update?

* "ipex" -> "cpu"

* These do not belong.

* Factoring cu_seqlen_qk for better abstracting over every model.

* Fixing non flash tests/imports.

* Changing return everywhere.

* Update mistral past.

* Fixing Mi{s,x}tral (non functional in Flash Decoding mode though).

* Fixup mistral clamping (had issues with cuda graphs).

* No need to recreate anything actually.
2024-09-24 03:58:13 +00:00
Nicolas Patry
2b9339c65b Fixing baichuan override. (#2158) 2024-09-24 03:58:13 +00:00
Wang, Yi
6265956bc4 refine get xpu free memory/enable Qwen2/gemma2/gemma/phi in intel platform (#2132)
* refine get xpu free memory

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable qwen2 in xpu

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable gemma/gemma2/phi in intel platform

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-24 03:57:32 +00:00
icyboy™
5b977c3141 fix AttributeError: 'MixtralLayer' object has no attribute 'mlp' (#2123)
https://github.com/huggingface/text-generation-inference/issues/2122
2024-09-24 03:57:32 +00:00
Daniël de Kok
e0d168ba20 Use GPTQ-Marlin for supported GPTQ configurations (#2111)
GPTQ-Marlin is currently the best-performing kernel for GPTQ models. So
let's use it by default if the kernels are installed, the GPU supports
it, and the kernels support the configuration.

For models generated by `text-generation-server quantize`, use
`sym=False`. This subcommand symmetric quantization since the beginning
and incorrectly reporting the model to be symmetric will use
GPTQ-Marlin (which does not support asymmetric quantization).
2024-09-24 03:57:32 +00:00
drbh
3e02d4fdbf fix: use weights from base_layer (#2141) 2024-09-24 03:57:32 +00:00
Nicolas Patry
bc15e960ea Fixing gemma2. (#2135)
* Fixing gemma2.

* Adding new model.
2024-09-24 03:57:07 +00:00
Daniël de Kok
d731866245 Idefics2: sync added image tokens with transformers (#2080)
Before this change, the number of reserved image tokens was not the
same as the number of images. Fixes #2029.

While at it, also remove all the image token handling duplication
in `prepare_input`.
2024-09-24 03:56:28 +00:00
Daniël de Kok
4700ea413f Add support for Marlin 2:4 sparsity (#2102)
This change adds support for 2:4 sparsity when using Marlin
quantization. The 2:4 kernel is used when:

* The quantizer is `marlin`;
* the quantizer checkpoint format is `marlin_24`.

Fixes #2098.
2024-09-24 03:55:04 +00:00
Daniël de Kok
18a8364d94 Support AWQ quantization with bias (#2117)
When the AWQ quantizer was used with a layer that uses a bias,
the bias tensor was not correctly passed/used. Instead, the
value `true`/`1.0` was added to the linear transformation.

Correctly pass through the bias when it is not `None`.

Fixes #2106.
2024-09-24 03:55:04 +00:00
drbh
8a155b2d5b Enable multiple LoRa adapters (#2010)
* feat: first draft load multiple lora

* feat: load weights within layer and refactor lora pass

* fix: refactor and reduce lora math

* feat: baseline impl single request multi lora support

* feat: prefer lorax implementation and port loading logic

* fix: prefer adapter_data and refactors

* feat: perfer loraxs custom punica kernels and add mlp loras

* fix: adjust batch for bgmv

* fix: adjust adapter_segments logic when in batch

* fix: refactor and move changes to v3 proto

* fix: pass model_id for all flash causal lms

* fix: pass model_id for all causal and seq2seq lms

* fix: add model_id to model test

* feat: add lora support to mistral and refactors

* feat: prefer model id in request

* fix: include rust code for adapter id

* feat: bump launcher and add new lora docs

* feat: support base model generation and refactors

* fix: rename doc to retry ci build

* feat: support if vlm models

* fix: add adapter_data param and avoid missing layers

* fix: add adapter_data param to phi and neox

* fix: update all models forwards to include adapter_data

* fix: add model_id to IdeficsCausalLM

* Update lora.md

Fixed a typo

* Update lora.md

Fixing spam image

* fix: add lora kernel to dockerfile, support running without kernels and refactors

* fix: avoid dockerfile conflict

* fix: refactors and adjust flash llama lora logic

* fix: skip llama test due to CI issue (temp)

* fix: skip llama test CI (temp) 2

* fix: revert skips and prefer updated ci token for tests

* fix: refactors and helpful comments

* fix: add noop in TensorParallelAdapterRowLinear too

* fix: refactor and move shard_lora_weights logic

* fix: exit early if no adapter_data

---------

Co-authored-by: Derek <datavistics@gmail.com>
2024-09-24 03:55:04 +00:00
Wang, Yi
27ae4f7916 fix cpu and xpu issue (#2116)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-24 03:52:23 +00:00
Nicolas Patry
d626685039 Removing IPEX_AVAIL. (#2115)
* Removing IPEX_AVAIL.

Chose to unify CPU and XPU under `ipex`. Most code is exactly similar
except for a very few spots.

The biggest number of spots is the kv-cache layout and the flash_xxx.py
files.
Since those files should be removed soon and factored away, we should
not need them.

* Forgot a few places.

* Unrelated change.

* Fixing HF_TOKEN.

* HF_TOKEN
2024-09-24 03:52:23 +00:00
drbh
1f70bb75e3 feat: add simple tests for weights (#2092)
* feat: add simple tests for weights

* fix: adjust types and add tests

* fix: adjust so all tests pass

* feat: improve weight tests

* fix: add missing tests and renames

* fix: tweak shapes
2024-09-24 03:51:26 +00:00
Wang, Yi
0d879fe66e Cpu tgi (#1936)
* add CPU tgi support

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* ipex distributed ops support

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Funtowicz Morgan <mfuntowicz@users.noreply.github.com>
2024-09-24 03:51:26 +00:00
Wang, Yi
e49aed4713 use xpu-smi to dump used memory (#2047)
* use xpu-smi to dump used memory
xpu use "ZE_AFFINITY_MASK" to control card, usage is like CUDA_VISIBLE_DEVICES

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Update server/text_generation_server/utils/import_utils.py

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
2024-09-24 03:51:26 +00:00
KevinDuffy94
76c6a5ca2a Add OTLP Service Name Environment Variable (#2076)
* Adding Service Name Environment variable for https://github.com/huggingface/text-generation-inference/issues/2069

* Update Docs

* Update README.md

* Update Launcher Docs

* Update Launcher Docs
Removing Option
2024-09-24 03:51:26 +00:00
drbh
d930724e82 feat: sort cuda graphs in descending order (#2104) 2024-09-24 03:46:09 +00:00
Daniël de Kok
f0ed8d294f Fix text-generation-server quantize (#2103)
The subcommand did not work due to some broken imports.
2024-09-24 03:46:09 +00:00
Daniël de Kok
c61ef1ce85 Factor out sharding of packed tensors (#2059)
For Phi-3-Small I need to shard a packed QKV bias tensor, for which
I implemented the `Weights.get_packed_sharded` method. However, this
method can also replace the `Weights._get_qweight` method and the
custom sharding code from `Weights.get_weights_col_packed`.
2024-09-24 03:46:09 +00:00
Daniël de Kok
38741feff0 Support exl2-quantized Qwen2 models (#2085)
Fixes #2081.
2024-09-24 03:46:09 +00:00
Daniël de Kok
6b2cbd0169 Set maximum grpc message receive size to 2GiB (#2075)
* Set maximum grpc message receive size to 2GiB

The previous default was 4MiB, which doesn't really work well for
multi-modal models.

* Update to Rust 1.79.0

* Fixup formatting to make PR pass
2024-09-24 03:44:36 +00:00
Daniël de Kok
fb939370a3 Support different image sizes in prefill in VLMs (#2065)
When a batch contained images if different sizes during prefill, the
server would fail (see e.g. #2056). Images were processed separately and
then concatenated. However, this can fail for images with different sizes.

Fix this by preprocessing all images in the batch together, so that the
image processor can ensure that all image tensors have compatible sizes.
2024-09-24 03:43:31 +00:00
Tiezhen WANG
b07a2518d9 Update the link for qwen2 (#2068)
* Update the link for qwen2

* Fix Qwen2 model URL in model table

* Fix too eager staging

---------

Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-09-24 03:43:30 +00:00
Daniël de Kok
f1f28404e7 Add support for GPTQ Marlin (#2052)
Add support for GPTQ Marlin kernels

GPTQ Marlin extends the Marlin kernels to support common GPTQ
configurations:

- bits: 4 or 8
- groupsize: -1, 32, 64, or 128
- desc_act: true/false

Using the GPTQ Marlin kernels requires repacking the parameters in the
Marlin quantizer format.

The kernels were contributed by Neural Magic to VLLM. We vendor them
here for convenience.
2024-09-24 03:43:30 +00:00
OlivierDehaene
2fdad64ece fix(layers): fix SuRotaryEmbedding (#2060)
* fix(layers): fix SuRotaryEmbedding

* change arange

* remove logs
2024-09-24 03:42:29 +00:00
OlivierDehaene
e85e7ac4f9 fix(server): fix OPT implementation (#2061) 2024-09-24 03:42:29 +00:00
fxmarty
eb8b76d1d2 Update LLMM1 bound (#2050)
update commit
2024-09-24 03:42:29 +00:00