Commit Graph

976 Commits

Author SHA1 Message Date
Erik Kaunismäki
07441f5a7a
legacy warning on text_generation client (#2271)
Update README.md

point to huggingface_hub inference clients instead
2024-07-22 12:00:17 +02:00
icyboy™
4e4207224e
Hotfix: fix of use of unquantized weights in Mixtral GQA loading (#2269)
* Update idefics_causal_lm.py

Fix syntax issues

* fix dbrx & opt model prefix bug

* Hotfix: fix of use of unquantized weights in Mixtral GQA loading
2024-07-22 11:31:00 +02:00
OlivierDehaene
f3435bab8c
fix(server): fix deepseekv2 loading (#2266) 2024-07-21 18:48:04 +02:00
OlivierDehaene
53ec0b790b
feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248)
* feat(fp8): add support for fbgemm

* allow loading fp8 weights directly

* update outlines

* fix makefile

* build fbgemm

* avoid circular import and fix dockerfile

* add default dtype

* refactored weights loader

* fix auto conversion

* fix quantization config parsing

* force new nccl on install

* missing get_weights implementation

* increase timeout
2024-07-20 19:02:04 +02:00
Daniël de Kok
e5c1d6d611
Add FP8 release test (#2261) 2024-07-20 10:26:06 +00:00
Adrien
11123a8e99
re-push to internal registry (#2242)
* re-push to internal registry

Signed-off-by: Adrien <adrien@huggingface.co>

* fix name

Signed-off-by: Adrien <adrien@huggingface.co>

* debug

Signed-off-by: Adrien <adrien@huggingface.co>

* debug

Signed-off-by: Adrien <adrien@huggingface.co>

* wip

Signed-off-by: Adrien <adrien@huggingface.co>

* wip

Signed-off-by: Adrien <adrien@huggingface.co>

* wip debug

Signed-off-by: Adrien <adrien@huggingface.co>

* add debug

Signed-off-by: Adrien <adrien@huggingface.co>

* should

Signed-off-by: Adrien <adrien@huggingface.co>

* wip

Signed-off-by: Adrien <adrien@huggingface.co>

* ww

Signed-off-by: Adrien <adrien@huggingface.co>

* wip

Signed-off-by: Adrien <adrien@huggingface.co>

* wip

Signed-off-by: Adrien <adrien@huggingface.co>

* ww

Signed-off-by: Adrien <adrien@huggingface.co>

* wip

Signed-off-by: Adrien <adrien@huggingface.co>

* wip

Signed-off-by: Adrien <adrien@huggingface.co>

* debug

Signed-off-by: Adrien <adrien@huggingface.co>

* w

Signed-off-by: Adrien <adrien@huggingface.co>

* revert tests

Signed-off-by: Adrien <adrien@huggingface.co>

* last reverts

Signed-off-by: Adrien <adrien@huggingface.co>

* another one

Signed-off-by: Adrien <adrien@huggingface.co>

---------

Signed-off-by: Adrien <adrien@huggingface.co>
2024-07-20 05:06:40 +00:00
Daniël de Kok
e52be9bba2
Add support for Deepseek V2 (#2224)
Deepseek V2 is a MoE model from Deepseek. Relevant variations
compared to other models:

- Grouped top-K in expert selection.
- mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
  configuration options.
- `mscale_all_dim` is also used in scaling attention softmax.
- Permuting of the query/key representations before applying rotary
  embeddings.
- Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
  So, we need weight loads that supports quantized weights. To this
  end `{Weights,WeightLoader}.get_weight` was added.
- The query/key head dimensionality differs from that of the value,
  so we need to pad during attention.
- Heads with size 192, needs an extension to our paged attention
  fork and we need to ensure that the KV cache is allocated with the
  correct size.
- Shared experts.
2024-07-19 17:23:20 +02:00
drbh
68a9685f1b
fix: adjust default tool choice (#2244)
* fix: adjust default tool choice

* feat: improve tool choice syntax and response parsing/errors

* fix: remove dev tests

* feat: add ToolChoice to docs
2024-07-19 11:12:02 -04:00
Erik Kaunismäki
40f5dc3ed6
add usage stats to toctree (#2260)
quick fix
2024-07-19 16:34:04 +02:00
Erik Kaunismäki
4c19593a90
usage stats and crash reports (#2220)
* draft of usage stats

* fix wrong link

* launcher doesn't need sysinfo dep

* only tokenizer class instead of hole struct

* unused import

* fix clippy errors

* update openAPI doc

* cargo fmt

* fix error in passing flags to router

* try again to update docs

* run pre-commit locally

* Update router/src/main.rs

Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* Update router/src/main.rs

Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* on crash use anonymous error event

* delete json_output and ngrok

* more robust way of checking if is in container

* more robust nvidia smi

* parse xpu more robustly

* fix errors

* add nvidia-smi details in docs

* cargo fmt

* fix clippy

* should make docs check pass

* Update router/src/usage_stats.rs

Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* error reason can't be in nested json

* cargo fmt

---------

Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>
Co-authored-by: Erik Kaunismäki <erikkaum@Eriks-MacBook-Pro.local>
2024-07-19 16:17:56 +02:00
Daniël de Kok
3f37a66774
Hotfix: pass through model revision in VlmCausalLM (#2258) 2024-07-19 15:59:00 +02:00
Daniël de Kok
3b41e93a09
Hotfix: fix MPT after recent refactor (#2257) 2024-07-19 14:42:35 +02:00
Daniël de Kok
18db78f295
Hotfix: various GPT-based model fixes (#2256) 2024-07-19 14:42:19 +02:00
Daniël de Kok
80adb5be16
Hotfix: fix of use of unquantized weights in Gemma GQA loading (#2255) 2024-07-19 12:55:59 +02:00
Daniël de Kok
ba291dad9f
Improve the handling of quantized weights (#2250)
* Improve the handling of quantized weights

Handling of quantized weights was split between two mechanisms:

- For quantized checkpoints, we used the new weight loader
  infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
  instead relied on conditional in `get_linear`.

Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.

This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:

- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
  `get_linear` does not need to know how to handle quantizer linear
  layers.
- All quantizer weights are strongly typed, we don't pass around
  raw tensors.
- We don't have to pass around the `quantizer` string everywhere.

* Exclude non-MLP layers when using FP8 quantization with Llama
2024-07-19 09:37:39 +02:00
OlivierDehaene
1d1b1efa01
fix(server): fix cohere (#2249) 2024-07-18 16:00:13 +02:00
Daniël de Kok
da82c63a4f
Remove stray quantize argument in get_weights_col_packed_qkv (#2237)
Fixes #2236.
2024-07-16 09:30:57 +02:00
Daniël de Kok
2cb1842852
server quantize: expose groupsize option (#2225) 2024-07-16 08:36:05 +02:00
Daniël de Kok
06d0e880e0
Add support for AWQ-quantized Idefics2 (#2233)
Fixes #2036.
2024-07-16 07:58:25 +02:00
Hugo Larcher
0ad7f6f87d
fix: Remove bitsandbytes installation when running cpu-only install (#2216)
Remove bitsandbytes installation when running cpu-only install
2024-07-15 15:34:20 +02:00
Erik Kaunismäki
457fb0a188
fix custom cache dir (#2226)
* fix to not ignore HUGGINGFACE_HUB_CACHE in cache

* delete printlns

* delete newlines

* maybe fix trailing whitespace
2024-07-15 15:17:13 +02:00
drbh
5a65066922
feat: simple mistral lora integration tests (#2180)
* feat: simple mistral lora integration tests

* fix: include args in docker launcher

* fix: disable cuda graphs with lora and warn

* fix: adjust docs and precommit issues

* fix: re update docs
2024-07-15 09:16:15 -04:00
Daniël de Kok
dbb23fbfa8
Use symmetric quantization in the quantize subcommand (#2120)
Packing of asymmetric quantization is broken, all (q)zeros values
of `0` get reset to `1`, resulting in a loss of accuracy. So instead
use symmetric quantization. To be able to distinguish models with
symmetric and asymmetric quantization, a new config tensor `gptq_sym` is
added. If this tensor is not present, we assume `sym=False`.
2024-07-12 12:20:12 +02:00
SeongBeomLEE
c46eaf707b
[fix] Modifying base in yarn embedding (#2212) 2024-07-12 10:04:51 +02:00
drbh
d789de329a
fix: append DONE message to chat stream (#2221)
* fix: append DONE message to chat stream

* fix: update completions endpoint
2024-07-11 10:42:58 -04:00
Daniël de Kok
cb150eb295
Add support for FP8 on compute capability >=8.0, <8.9 (#2213)
Use FP8 GPTQ-Marlin kernels to enable FP8 support on CUDA GPUs
with compute capability >=8.0 and <8.9.

Co-authored-by: Florian Zimmermeister <flozi00.fz@gmail.com>
2024-07-11 16:03:26 +02:00
Daniël de Kok
8511669cb2
Move quantized weight handling out of the Weights class (#2194)
Quantized weights were loaded in the `Weights` class, but this was
getting quite unwieldy, where every higher level method to load weights
was a long conditional to cover all the different quantizers.

This change moves loading of quantized weights out of the `Weights`
class. This is done by defining a simple `WeightsLoader` interface
that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`,
and `MarlinWeightsLoader`. These implementations are in the quantizers'
respective modules. The `Weights` class provides the low-level load
operations (such as loading tensors or sharded tensors), but delegates
loads that need quantizer-specific weight processing to a loader. The
loaders still use the low-level functionality provided by `Weights`.

I initially tried making a hierarchy where a class like `GPTQWeights`
would inherit from `Weights`. But it is not very flexible (e.g. does
not work well with the new weight storage mock used in tests) and
the implicit indirections made the code harder to follow.
2024-07-09 20:04:03 +02:00
Nicolas Patry
4c976fb406
Updating the self check (#2209)
* Updating the self check

* Fix.

* Revert the CLI .

* cli.

* Space.

* Revert cargo update.
2024-07-09 17:23:48 +02:00
vinkamath
f5ba9bfd52
Fixed README ToC (#2196)
Co-authored-by: Vinayak Kamath <Vinayak.Kamath@target.com>
2024-07-09 11:22:08 +02:00
Nicolas Patry
fe710af25f
Adding sanity check to openapi docs. 2024-07-09 11:13:48 +02:00
Guillaume LEGENDRE
5e2a305880
Fix buildx cache + change runner type (#2176)
* Update build.yaml

* Update build.yaml

* change to S3 cache

* change to CPU Runners

* remove comments
2024-07-08 18:13:32 +02:00
fxmarty
4c50b6d04b
Fix nccl regression on PyTorch 2.3 upgrade (#2099)
* fix nccl issue

* add note in dockerfile

* use v2.22.3 that also fixes @samsamoa's repro

* poetry actually can't handle the conflict between torch and nccl

* set LD_PRELOAD
2024-07-08 17:52:10 +02:00
drbh
87ebb6477b
feat: use model name as adapter id in chat endpoints (#2128) 2024-07-08 16:06:49 +02:00
Wang, Yi
58effe78b5
update to metrics 0.23.0 or could work with metrics-exporter-promethe… (#2190)
update to metrics 0.23.0 or could work with metrics-exporter-prometheus 0.15.1

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-07-08 16:03:59 +02:00
Javier Martinez
16d9e505fd
fix: python deserialization (#2178) 2024-07-08 15:59:16 +02:00
Wang, Yi
07e240ca37
add doc for intel gpus (#2181)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-07-08 15:57:06 +02:00
Daniël de Kok
5c7c9f1390
Falcon/DBRX: get correct number of key-value heads (#2205) 2024-07-08 13:22:38 +02:00
Daniël de Kok
153fcf7739
Fix incorrect cache allocation with multi-query (#2203)
We wouldn't allocate any memory in multi-query (1 KV head). Fixes
Starcoder et al.
2024-07-08 11:19:48 +02:00
Daniël de Kok
cce475a949
hotfix: Fix number of KV heads (#2202)
Fix number of KV heads
2024-07-08 09:52:12 +02:00
icyboy™
521d0d990f
fix dbrx & opt model prefix bug (#2201)
* Update idefics_causal_lm.py

Fix syntax issues

* fix dbrx & opt model prefix bug
2024-07-08 09:01:14 +02:00
Daniël de Kok
05c094fcfa
Consistently take prefix in model constructors (#2191)
* Consistently take `prefix` in model constructors

* Release test check fix

* Misc refactor-related fixes
2024-07-05 16:07:48 +02:00
Daniël de Kok
67ef0649cf
GPTQ CI improvements (#2151)
* Add more representative Llama GPTQ test

The Llama GPTQ test is updated to use a model with the commonly-used
quantizer config format and activation sorting. The old test is
kept around (but renamed) since it tests the format produced by
`text-generation-server quantize`.

* Add support for manually triggering a release build
2024-07-05 14:12:16 +02:00
Daniël de Kok
b67d46336e
Fix Starcoder2 after refactor (#2189) 2024-07-05 12:22:45 +02:00
Nicolas Patry
853d4eb9cf
Hotfixing after refactor. 2024-07-05 09:25:29 +00:00
Nicolas Patry
fb2f74e2b9
Refactor dead code - Removing all flash_xxx.py files. (#2166)
* Refactor dead code.

* First working step.

* Remove a lot of duplicated code.

* More dead code.

* More cleanup.

* Fix Santacoder test.

* Fixing the simple tests.

* Fixing sharding.

* Fixes for VLM.

* Fixing santacoder (num_kv_heads hardcoded).

* Removing more dead code.

* Fixing `config.n_head`.

* Stopping earlier because of `<end_of_utterance>` in idefics2.

* Addresses comments.

* Removing the dead code.

* Fuse back mistral into FlashCausalLM.

* Finish removal.

* Fixing docs + causal_lm `batch_class`.

* Fixing docs + causal.lm.

* Add default to Gemma Causality.

* Default value for gemma/gemma2.

* Wrong default.
2024-07-05 10:29:56 +02:00
Aaron Mihalik
c6bcadf883
Adding "longrope" for Phi-3 (#2172) (#2179)
Adding "longrope" for phi-3
2024-07-05 09:46:41 +02:00
Nicolas Patry
245d3de948
Preparing patch release. (#2186) 2024-07-04 10:55:33 +02:00
Nicolas Patry
5ad41aa2a6
Fixing missing object field for regular completions. (#2175)
* Fixing missing `object` field for regular completions.

* Fixing docs by re-adding missing `Prompt`.
2024-07-03 12:56:27 +02:00
Nicolas Patry
2b3bd1e008
Fixing the dockerfile warnings. (#2173) 2024-07-03 12:48:45 +02:00
Nicolas Patry
be4a4c47f9
Revert "Fixing missing object field for regular completions."
This reverts commit 2bbb7fa4b2.
2024-07-03 10:41:39 +00:00