Commit Graph

560 Commits

Author SHA1 Message Date
drbh
5fc0e0c589 fix: pass missing revision arg for lora adapter when loading multiple… (#2510)
fix: pass missing revision arg for lora adapter when loading multiple adapters
2024-09-25 06:15:35 +00:00
Nicolas Patry
c6b568b892 Fix tokenization yi (#2507)
* Fixing odd tokenization self modifications on the Rust side (load and
resave in Python).

* Fixing the builds ?

* Fix the gh action?

* Fixing the location ?

* Validation is odd.

* Try a faster runner

* Upgrade python version.

* Remove sccache

* No sccache.

* Getting libpython maybe ?

* List stuff.

* Monkey it up.

* have no idea at this point

* Tmp.

* Shot in the dark.

* Tmate the hell out of this.

* Desperation.

* WTF.

* -y.

* Apparently 3.10 is not available anymore.

* Updating the dockerfile to make libpython discoverable at runtime too.

* Put back rust tests.

* Why do we want mkl on AMD ?

* Forcing 3.11 ?
2024-09-25 06:15:35 +00:00
Nicolas Patry
510d1c76c8 Prefix test - Different kind of load test to trigger prefix test bugs. (#2490)
* Adding prefix test.

* [WIP] tmp dump of integration load tests.

* Remove other tensor creation.

* Fixed the radix tree.

Used a slice everywhere in radix.rs to keep the cheap Arc cloning
instead of recomputing the input_ids.

* Fix parsing

* Is it really flashinfer version ?

* Remove some comments.

* Revert the max prefix hit.

* Adding numpy to diff.

* Upgraded flashinfer.

* Upgrading some stuff.

* Are we done yet ?

* Minor fixup

* Remove 1 log and put back the other.

* Add comment for why slot 0 is OK.

* Mounting on the job.

* Get me a debug branch

* Debugging CIs is fun.

* Attempt #28

* wip

* Tmate.

* Praying.

* Updating VLM causal model with updated context.

* Important line got squashed.

* Tmate again.

* Fingers crossed.

* We want only 1 run of integration tests.....

---------

Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>
2024-09-25 06:14:07 +00:00
Wang, Yi
938a7f3c3a hotfix: fix regression of attention api change in intel platform (#2439)
fix regression caused by attention api change. ipex.varlen_attention does not support paged-cache
format kv input now.

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 06:13:36 +00:00
drbh
34a6399a50 feat: support lora revisions and qkv_proj weights (#2482)
* feat: support lora revisions and qkv_proj weights

* fix: add qkv_proj weights to weight test
2024-09-25 06:13:11 +00:00
Nicolas Patry
a313355d2b Tied embeddings in MLP speculator. (#2473)
* Tied embeddings in MLP speculator.

* Fixing the scale_weight when users decide to not use the speculation as
much as defined in the config.

* Adding scaling support + optimize some ops.
2024-09-25 06:13:11 +00:00
Nicolas Patry
4e1ca8d7bd Lots of improvements (Still 2 allocators) (#2449)
* Making prefix/flashinfer the default and testing the full release tests.

* Include flashinfer in the docker.

* Using prebuilt.

* Allowing window_left_size (dummy version).

* Disabling flashinfer/prefix caching on odd head_dim

* Disable prefix caching for lora.

* More specific codes.

* Update lock

* Updating integration tests with new values with FI/FD.

Remove paged as a default too, and using FD everywhere.

* Update cargo lock ?

* Upgrade to 1.80 because of bitstream...

* Everywhere 1.80

* Forgot last default place.

* Apply suggestions from code review

Co-authored-by: drbh <david.richard.holtz@gmail.com>

* Updated flake lock

* Tmp

* Upgrade resolution system for less errors in resolution.

* Remove lambda for cleaner function.

* Handling debugger.

* OVerride the env in server tests.

* Is this enough to make it work ?

* This seems to be working.

* Downgrade some logs.

* Fixing the default for vlm.

* Don't enable prefix caching on VLM just yet.

* Change `add_special_tokens` in order to have the correct tokens for chat
input and not (since it's super important with the prefixing now)

* Fixing prefix caching for flashdecoding.

* Update all models.

* Fixed flashinfer version.

* add_special_tokens is internal only

* Fixing seqlen with the new vlms.

* Fixing the issue with `add_special_tokens` not being passed around.

* Fixing the test.

* Removing encoder_decoder (seq2seq).

* Update the chat test.

* Fixing the batching tokenization in flash causal lm.

* Truncating left for radix purposes.

* Oops this doesn't belong here.

* Put back default pure shell.

* Update server tests

- Default to throughput test in k6
- Use TGI_WIGGLE_ROOM to adjust wiggle room

* Only n_heads / process_group.size() are necessary.

* Revert the integrationt tests change (seem linked to head_size
modification).

* Adding error message when assert is violated.

* Fixing the free algorithm to handle times where the common prefix is
smaller.

* Apply suggestions from code review

Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Update server/text_generation_server/layers/attention/common.py

Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Fix disabling prefix caching - Fix windowing checks.

* Revert the Cohere tokenizer change (for now using a revision instead).

* Fmt.

---------

Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
2024-09-25 06:13:11 +00:00
drbh
7aebb953e2 Fix: don't apply post layernorm in SiglipVisionTransformer (#2459)
* Fix: don't apply post layernorm in SiglipVisionTransformer

This fixes a bug with LLaVA Next when using Siglip as the vision model. LLaVA Next expects the output of the vision model to be the encoder outputs before layernorm (see original transformers implementation here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_next/modeling_llava_next.py#L813).

This also makes Siglip consistent with the existing Clip implementation:

https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/models/custom_modeling/clip.py#L613

* fix: adjust pali gemma for post layer norm and small refactors

---------

Co-authored-by: Travis Addair <tgaddair@gmail.com>
2024-09-25 06:10:59 +00:00
Nicolas Patry
635dde8af9 Prefix caching (#2402)
* Prefix caching WIP

* Fixing prefix attention.

* Fixing flashinfer import.

* Fixing black.

* Fixing medusa (still wrong outputs, but functional).

* Just medusa values now.

* Fixing medusa without prefix caching.

* Fixing prefix caching.

* Medusa requires reshaping.

* Removing the logs.

* Remove router.nix

* Fixup:

- Remove logs
- Disable VLMs (they do not work)
- Disable prefix caching when user wants prefill logprobs.

* Update flake.lock

---------

Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-09-25 06:10:59 +00:00
Nicolas Patry
df6ea89da9 Fixing exl2 and other quanize tests again. (#2419)
* Fixing exl2 and other quanize tests again.

* Mark exl2 as non release (so CI tests them, needs to be removed latet).

* Fixing exl2 (by disabling cuda graphs)

* Fix quantization defaults without cuda graphs on exl2 (linked to new
issues with it).

* Removing serde override.

* Go back to released exl2 and remove log.

* Adding warnings for deprecated bitsandbytes + upgrade info to warn.
2024-09-25 06:08:38 +00:00
Nicolas Patry
4baa6ff59f Upgrading exl2. (#2415)
* Upgrading exl2.

* Fixing the other pathways.

* Fix idefics.
2024-09-25 06:07:40 +00:00
drbh
ffc8fb0850 fix: adds causal to attention params (#2408)
fix: adds causal to attention params to check when using flash attn v1
2024-09-25 06:06:17 +00:00
Wang, Yi
7a4d831d17 add numa to improve cpu inference perf (#2330)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 06:06:17 +00:00
drbh
10b2be6536 fix: include create_exllama_buffers and set_device for exllama (#2407) 2024-09-25 06:06:17 +00:00
drbh
3079865b60 fix: allocate tmp based on sgmv kernel if available (#2345)
* fix: allocate tmp based on sgmv kernel if available

* fix: re add copy build artifacts step for punica kernels
2024-09-25 06:06:17 +00:00
drbh
8e6bfa2fc5 feat: validate template variables before apply and improve sliding wi… (#2403)
* feat: validate template variables before apply and improve sliding window check

* fix: improve missing template var test
2024-09-25 06:05:43 +00:00
Daniël de Kok
f586cc7f0c Add support for prefix caching to the v3 router (#2392)
This change adds support for prefix caching to the v3 router. This
is broken up from the backend support to ease reviewing.

For now prefix caching is only enabled with `USE_PREFIX_CACHING=1`
in this case, the router will switch to `RadixAllocator`. This
allocator uses a radix trie to keep track of prefills that were
seen prior. If a new prefill is a prefix of a previously-seen
prefil, the router will send a request with `prefix_len>0`, which
can be used by the backend to decide to reuse KV blocks from the
cache, rather than recomputing them.

Even though backend support is not added in this PR, the backend
will still work with prefix caching enabled. The prefix lengths
are just ignored and not used.
2024-09-25 06:05:08 +00:00
Nicolas Patry
1daaddd072 Fixing import exl2 (#2399) 2024-09-25 06:04:51 +00:00
Nicolas Patry
849bd93dc3 Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385)
* Using an enum for flash backens (paged/flashdecoding/flashinfer)

* Early exit on server too.

* Clippy.

* Fix clippy and fmt.
2024-09-25 06:04:51 +00:00
Vaibhav Srivastav
1d4a35a23c Update documentation for Supported models (#2386)
* Minor doc fixes

* up.

* Other minor updates.
2024-09-25 06:04:51 +00:00
Daniël de Kok
4a16da5d49 Add FlashInfer support (#2354)
This change adds support for FlashInfer. FlashInfer can be enabled using
`FLASH_INFER=1` and is currently only implemented in `FlashCausalLM`.
Since this functionality is currently only for testing, FlashInfer is
not installed anywhere yet.

The FlashInfer API is quite different from FlashAttention/vLLM in that
it requires more global bookkeeping:

* A wrapper class needs to be contstructed (which we just call *state*).
  Since this is fairly expensive (due to pinned host memory allocation),
  we only do this once in a FlashCausalLM instance or for each CUDA
  Graph size.
* Each model forward call needs to be wrapped in `begin_forward` and
  `end_forward`. This sets up data structures that can be reused for all
  calls to attention for that forward call.

When calling attention, we need access to the state object. To avoid
passing an argument down the call chain (which would require changes to
all models), we use a context variable.

Each model forward call is wrapped using a context manager that does all
the bookkeeping for such a call:

* Set the context variable to the forward call's state.
* Call `begin_forward` on the state.
* Yield.
* Call `end_forward` on the state.
* Reset the context variable.

We cannot use a single shared global variable for this, since e.g. CUDA
Graphs of different sizes each have their own state.
2024-09-25 06:01:59 +00:00
drbh
853fb96fec fix: prefer hidden_activation over hidden_act in gemma2 (#2381) 2024-09-25 05:55:39 +00:00
drbh
1057f28128 Pr 2337 ci branch (#2379)
* hotfix: fix xpu crash brought by code refine. torch.xpu rely on import ipex

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* reable gemma2 in xpu

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix in regression in ipex flashattention

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 05:55:39 +00:00
Wang, Yi
3893d00927 fix EleutherAI/gpt-neox-20b does not work in tgi (#2346)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 05:55:39 +00:00
drbh
06b638f310 Pr 2374 ci branch (#2378)
* Update __init__.py

Fix issue with NoneType comparison for max_input_tokens and sliding_window

- Add default values for max_input_tokens and sliding_window to handle None cases.
- Ensure the comparison between max_input_tokens and sliding_window is handled correctly to prevent TypeError.
- This change addresses the error: TypeError: '<=' not supported between instances of 'int' and 'NoneType'.

* Update __init__.py

Handle NoneType in sliding_window comparison to fix TypeError in __init__.py by ensuring the comparison logic accounts for NoneType values, preventing errors and improving code robustness.

* fix: syntax/style tweak

---------

Co-authored-by: Praz <prazanth2006@gmail.com>
2024-09-25 05:55:39 +00:00
drbh
9b1b545bb4 Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) (#2371)
* Fix the bug

* fix: run lints

* fix: small syntax tweak

---------

Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com>
2024-09-25 05:55:39 +00:00
drbh
3ea8e8a2d5 add gptj modeling in TGI #2366 (CI RUN) (#2372)
* add gptj modeling

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix: update docs for model addition

* fix: adjust syntax typo

* fix: adjust syntax typo again

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 05:55:39 +00:00
almersawi
11fab8a20c fix: fix num_ln_in_parallel_attn attribute name typo in RWConfig (#2350)
Co-authored-by: Islam Almersawi <islam.almersawi@openinnovation.ai>
2024-09-25 05:55:39 +00:00
drbh
3ccde430d9 fix: prefer original layernorm names for 180B (#2365) 2024-09-25 05:55:39 +00:00
drbh
db873be177 fix: default num_ln_in_parallel_attn to one if not supplied (#2364) 2024-09-25 05:55:39 +00:00
drbh
83d1f23fea fix: return the out tensor rather then the functions return value (#2361) 2024-09-25 05:55:39 +00:00
drbh
688321bcc4 fix: attempt forward on flash attn2 to check hardware support (#2335)
* fix: attempt forward on flash attn2 to check hardware support

* fix: warn window_size_left when using flash attn 1

* fix: prefer version check over test op and avoid window_size_left if not flash attn2

* fix: improve condtional and error message

* fix: update sliding window conditional

* fix: simplify changes and revert model changes

* fix: avoid changing conditional

* fix: typo tweak
2024-09-25 05:55:39 +00:00
Daniël de Kok
48fec7b198 Unify attention output handling (#2343)
- Always return the hidden states.
- Create the output tensor inside the `attention` and `paged_attention`
  functions.

This removes the difference between how the output is handled between
attention (output parameter) and paged attention (return value). This
also removes the assumption that the attention implementation can
write to an output tensor (in preparation of FlashInfer).
2024-09-25 05:55:39 +00:00
Wang, Yi
d70da59c25 enable HuggingFaceM4/idefics-9b in intel gpu (#2338)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 05:55:39 +00:00
drbh
c73d1d604f Pr 2290 ci run (#2329)
* MODEL_ID propagation fix

* fix: remove global model id

---------

Co-authored-by: root <root@tw031.pit.tensorwave.lan>
2024-09-25 05:55:39 +00:00
Daniël de Kok
468e5c6874 Handle GPTQ-Marlin loading in GPTQMarlinWeightLoader (#2300)
The `GPTWeightLoader` was structured like this in pseudocode:

if marlin:
  Set up tensors in a way that GPTQ-Marlin expects
else:
  Set up tensors in a way that ExLlama/GPTQ/AWQ expect

However, the GPT-Marlin implementation details should really be in the
`marlin` module. So move the former part out to a separate
`GPTQMarlinWeightsLoader`.
2024-09-25 05:55:39 +00:00
Daniël de Kok
247a29f77c server quantize: store quantizer config in standard format (#2299)
- Create `quantization_config` option in the model config.
- Don't store the quantizer config in tensors anymore.
2024-09-25 05:50:17 +00:00
Erik Kaunismäki
b1d1d26559 patch-error-on-invalid-grammar (#2282)
* quick fix

* allow silent failure

* explicit todo that this is only short term
2024-09-25 05:50:17 +00:00
Daniël de Kok
23a3927eb6 Install Marlin from standalone package (#2320) 2024-09-25 05:50:17 +00:00
drbh
a87791d7c9 feat: add ruff and resolve issue (#2262)
* feat: add ruff and resolve issue

* fix: update client exports and adjust after rebase

* fix: adjust syntax to avoid circular import

* fix: adjust client ruff settings

* fix: lint and refactor import check and avoid model enum as global names

* fix: improve fbgemm_gpu check and lints

* fix: update lints

* fix: prefer comparing model enum over str

* fix: adjust lints and ignore specific rules

* fix: avoid unneeded quantize check
2024-09-25 05:46:24 +00:00
Daniël de Kok
fc6d80fdb8 Support tied embeddings in 0.5B and 1.5B Qwen2 models (#2313) 2024-09-25 05:41:43 +00:00
Daniël de Kok
64ffd642fa Some small fixes for the Torch 2.4.0 update (#2304)
* Fix GPTQ autotune data type to be compatible with Torch 2.4.0

* Update poetry lock file

* Fix small PaliGemma logprob differences after the torch update
2024-09-25 05:40:25 +00:00
drbh
7ebee37641 fix: refactor adapter weight loading and mapping (#2193)
* fix: refactor adapter weight loading and mapping

* feat: enable lora load from directory

* fix: adjust launcher for local lora adapters

* feat: improve weight loading and add tests

* fix: improve logging and rebase syntax issue

* fix: impove adapter merge comments and remove unused conditional

* fix: improve get_model_with_lora_adapters naming

* fix: comment typo
2024-09-25 05:39:58 +00:00
Daniël de Kok
457791f511 Split up layers.marlin into several files (#2292)
The marlin.py file was getting large, split it up.
2024-09-25 05:39:58 +00:00
Wang, Yi
d93931567d fix of use of unquantized weights in cohere GQA loading, also enable … (#2291)
fix of use of unquantized weights in cohere GQA loading, also enable the model in intel platform

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 05:39:58 +00:00
Wang, Yi
204142153f fix crash in multi-modal (#2245)
* fix crash in multi-modal

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* update according to review comment

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix llava_next regression in latest main

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 05:39:58 +00:00
Daniël de Kok
43f49141fd Add support for Llama 3 rotary embeddings (#2286)
* Add support for Llama 3 rotary embeddings

* Update transformers to 4.43
2024-09-25 05:38:48 +00:00
shaltielshmid
69b67b7add Add support for Mistral-Nemo by supporting head_dim through config (#2254)
* Support passing head_dim through config

* Using `head_dim` as a fallback is necessary since it's a non standard
key in mistralConfig (as defined in transformers).

* Shorter diff.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-09-25 05:31:31 +00:00
Daniël de Kok
26460f053d Add support for repacking AWQ weights for GPTQ-Marlin (#2278)
* Add support for repacking AWQ weights for GPTQ-Marlin

So far we couldn't support AWQ because virtually all AWQ models use
symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin
has recently added support AWQ repacking and AWQ asymmetric quantization
(zero_point=True).

This change updates all GPTQ-Marlin kernels from upstream and wires up
AWQ support. For now enabling AWQ using Marlin requires running TGI with
`--quantize gptq`.

* Enable Marlin for supported AWQ configurations by default

This makes the AWQ -> GPTQ repack test redundant, since we are now
testing this with the regular AWQ test.
2024-09-25 05:31:31 +00:00
OlivierDehaene
919da25c3b fix(l4): fix fp8 logic on l4 (#2277)
* fix(l4): fix fp8 logic on l4

* also quant weights with single scale

* use marlin even on 89
2024-09-25 05:31:30 +00:00
Nicolas Patry
31eb03dbe2 Fixing mistral nemo. (#2276) 2024-09-25 05:31:30 +00:00
Nicolas Patry
568cc9f3d0 Softcapping for gemma2. (#2273)
* Softcapping for gemma2.

* Less clutter.

* No access to transformers config, only config_dict here.

* 0.0 is the null value in the C++ API.
2024-09-25 05:31:08 +00:00
OlivierDehaene
a7515b8af1 fix(server): fix fp8 weight loading (#2268)
* fix(server): fix fp8 weight loading

* fixed scales loading

* update snap

* revert default dtype
2024-09-25 05:31:08 +00:00
icyboy™
a5aee82a69 Hotfix: fix of use of unquantized weights in Mixtral GQA loading (#2269)
* Update idefics_causal_lm.py

Fix syntax issues

* fix dbrx & opt model prefix bug

* Hotfix: fix of use of unquantized weights in Mixtral GQA loading
2024-09-25 05:30:41 +00:00
OlivierDehaene
d13215da8f fix(server): fix deepseekv2 loading (#2266) 2024-09-25 05:30:41 +00:00
OlivierDehaene
85f10ec5c9 feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248)
* feat(fp8): add support for fbgemm

* allow loading fp8 weights directly

* update outlines

* fix makefile

* build fbgemm

* avoid circular import and fix dockerfile

* add default dtype

* refactored weights loader

* fix auto conversion

* fix quantization config parsing

* force new nccl on install

* missing get_weights implementation

* increase timeout
2024-09-25 05:30:41 +00:00
Daniël de Kok
c1638a56f1 Add support for Deepseek V2 (#2224)
Deepseek V2 is a MoE model from Deepseek. Relevant variations
compared to other models:

- Grouped top-K in expert selection.
- mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
  configuration options.
- `mscale_all_dim` is also used in scaling attention softmax.
- Permuting of the query/key representations before applying rotary
  embeddings.
- Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
  So, we need weight loads that supports quantized weights. To this
  end `{Weights,WeightLoader}.get_weight` was added.
- The query/key head dimensionality differs from that of the value,
  so we need to pad during attention.
- Heads with size 192, needs an extension to our paged attention
  fork and we need to ensure that the KV cache is allocated with the
  correct size.
- Shared experts.
2024-09-25 05:27:40 +00:00
Daniël de Kok
e658d95c23 Hotfix: pass through model revision in VlmCausalLM (#2258) 2024-09-25 05:27:40 +00:00
Daniël de Kok
990ea793c0 Hotfix: fix MPT after recent refactor (#2257) 2024-09-25 05:27:40 +00:00
Daniël de Kok
ba0dfb6fb1 Hotfix: various GPT-based model fixes (#2256) 2024-09-25 05:27:40 +00:00
Daniël de Kok
394f8c7d2b Hotfix: fix of use of unquantized weights in Gemma GQA loading (#2255) 2024-09-25 05:27:40 +00:00
Daniël de Kok
2dd680b799 Improve the handling of quantized weights (#2250)
* Improve the handling of quantized weights

Handling of quantized weights was split between two mechanisms:

- For quantized checkpoints, we used the new weight loader
  infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
  instead relied on conditional in `get_linear`.

Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.

This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:

- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
  `get_linear` does not need to know how to handle quantizer linear
  layers.
- All quantizer weights are strongly typed, we don't pass around
  raw tensors.
- We don't have to pass around the `quantizer` string everywhere.

* Exclude non-MLP layers when using FP8 quantization with Llama
2024-09-25 05:27:40 +00:00
OlivierDehaene
118ee57f82 fix(server): fix cohere (#2249) 2024-09-25 05:27:40 +00:00
Daniël de Kok
e0710ccbeb Remove stray quantize argument in get_weights_col_packed_qkv (#2237)
Fixes #2236.
2024-09-25 05:27:40 +00:00
Daniël de Kok
7177da0df6 server quantize: expose groupsize option (#2225) 2024-09-25 05:27:40 +00:00
Daniël de Kok
e955f7b536 Add support for AWQ-quantized Idefics2 (#2233)
Fixes #2036.
2024-09-25 05:27:40 +00:00
drbh
619eeded47 feat: simple mistral lora integration tests (#2180)
* feat: simple mistral lora integration tests

* fix: include args in docker launcher

* fix: disable cuda graphs with lora and warn

* fix: adjust docs and precommit issues

* fix: re update docs
2024-09-25 05:27:40 +00:00
Daniël de Kok
ee56266044 Use symmetric quantization in the quantize subcommand (#2120)
Packing of asymmetric quantization is broken, all (q)zeros values
of `0` get reset to `1`, resulting in a loss of accuracy. So instead
use symmetric quantization. To be able to distinguish models with
symmetric and asymmetric quantization, a new config tensor `gptq_sym` is
added. If this tensor is not present, we assume `sym=False`.
2024-09-25 05:27:40 +00:00
SeongBeomLEE
dedeb3cfa0 Modifying base in yarn embedding (#2212) 2024-09-25 05:27:40 +00:00
Daniël de Kok
85c3c5d64f Add support for FP8 on compute capability >=8.0, <8.9 (#2213)
Use FP8 GPTQ-Marlin kernels to enable FP8 support on CUDA GPUs
with compute capability >=8.0 and <8.9.

Co-authored-by: Florian Zimmermeister <flozi00.fz@gmail.com>
2024-09-25 05:27:40 +00:00
Daniël de Kok
2a6c3caf1d Move quantized weight handling out of the Weights class (#2194)
Quantized weights were loaded in the `Weights` class, but this was
getting quite unwieldy, where every higher level method to load weights
was a long conditional to cover all the different quantizers.

This change moves loading of quantized weights out of the `Weights`
class. This is done by defining a simple `WeightsLoader` interface
that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`,
and `MarlinWeightsLoader`. These implementations are in the quantizers'
respective modules. The `Weights` class provides the low-level load
operations (such as loading tensors or sharded tensors), but delegates
loads that need quantizer-specific weight processing to a loader. The
loaders still use the low-level functionality provided by `Weights`.

I initially tried making a hierarchy where a class like `GPTQWeights`
would inherit from `Weights`. But it is not very flexible (e.g. does
not work well with the new weight storage mock used in tests) and
the implicit indirections made the code harder to follow.
2024-09-25 05:27:40 +00:00
Daniël de Kok
540e710c3f Falcon/DBRX: get correct number of key-value heads (#2205) 2024-09-25 05:21:34 +00:00
Daniël de Kok
17594916ed Fix incorrect cache allocation with multi-query (#2203)
We wouldn't allocate any memory in multi-query (1 KV head). Fixes
Starcoder et al.
2024-09-25 05:21:34 +00:00
Daniël de Kok
f11fd699b6 hotfix: Fix number of KV heads (#2202)
Fix number of KV heads
2024-09-25 05:21:34 +00:00
icyboy™
8e3d1e6c3f fix dbrx & opt model prefix bug (#2201)
* Update idefics_causal_lm.py

Fix syntax issues

* fix dbrx & opt model prefix bug
2024-09-25 05:21:34 +00:00
Daniël de Kok
508e308088 Consistently take prefix in model constructors (#2191)
* Consistently take `prefix` in model constructors

* Release test check fix

* Misc refactor-related fixes
2024-09-25 05:21:34 +00:00
Daniël de Kok
1e7ce69f20 Fix Starcoder2 after refactor (#2189) 2024-09-25 05:20:28 +00:00
Nicolas Patry
e481a9bb9b Hotfixing after refactor. 2024-09-25 05:20:28 +00:00
Nicolas Patry
1b434e8019 Refactor dead code - Removing all flash_xxx.py files. (#2166)
* Refactor dead code.

* First working step.

* Remove a lot of duplicated code.

* More dead code.

* More cleanup.

* Fix Santacoder test.

* Fixing the simple tests.

* Fixing sharding.

* Fixes for VLM.

* Fixing santacoder (num_kv_heads hardcoded).

* Removing more dead code.

* Fixing `config.n_head`.

* Stopping earlier because of `<end_of_utterance>` in idefics2.

* Addresses comments.

* Removing the dead code.

* Fuse back mistral into FlashCausalLM.

* Finish removal.

* Fixing docs + causal_lm `batch_class`.

* Fixing docs + causal.lm.

* Add default to Gemma Causality.

* Default value for gemma/gemma2.

* Wrong default.
2024-09-25 05:20:28 +00:00
Aaron Mihalik
835ad0a923 Adding "longrope" for Phi-3 (#2172) (#2179)
Adding "longrope" for phi-3
2024-09-24 04:08:02 +00:00
Nicolas Patry
d580215a24 Hotfixing qwen2 and starcoder2 (which also get clamping). (#2167) 2024-09-24 03:58:36 +00:00
Nicolas Patry
bc5a792dc8 Fixing rocm. (#2164) 2024-09-24 03:58:13 +00:00
drbh
e913f3ad2d fix: use the base layers weight in mistral rocm (#2155) 2024-09-24 03:58:13 +00:00
Wang, Yi
71b0189cd5 fix FlashDecoding change's regression in intel platform (#2161)
install triton because GPTQParams needs it.

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-24 03:58:13 +00:00
Nicolas Patry
9b3d3a3690 Fixing graph capture for flash decoding. (#2163) 2024-09-24 03:58:13 +00:00
Nicolas Patry
b80bd724e1 Move to FlashDecoding instead of PagedAttention kernel. (#1940)
* Using flash decoding

Conditional flashdecoding.

Fix max_q.

Working kvcache

Working version with flash decoding.

Make it work for mistral.

Fix after rebase..

Less intrusive.

REvert changes in modeling.

Speedup flashdecoding.

HHachweew
Hack to make other models work.

Fixing non flash decoding llama path.

Router logic knows about page size.

Missing 2 models.

Missing cohere.

Fixing cohere flash decoding.

Revamped all this architecture.

Fix cohere.

Fixing falcon.

Enabling custom block size schedule.

Update router/src/infer.rs

Not sending preallocated output.

* Making it work on non flash decoding.

* Fix Cohere.

* Fix non decoding paths.

* Rebased.

* No need for cache_manager anymore.

* Update?

* "ipex" -> "cpu"

* These do not belong.

* Factoring cu_seqlen_qk for better abstracting over every model.

* Fixing non flash tests/imports.

* Changing return everywhere.

* Update mistral past.

* Fixing Mi{s,x}tral (non functional in Flash Decoding mode though).

* Fixup mistral clamping (had issues with cuda graphs).

* No need to recreate anything actually.
2024-09-24 03:58:13 +00:00
Nicolas Patry
2b9339c65b Fixing baichuan override. (#2158) 2024-09-24 03:58:13 +00:00
Wang, Yi
6265956bc4 refine get xpu free memory/enable Qwen2/gemma2/gemma/phi in intel platform (#2132)
* refine get xpu free memory

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable qwen2 in xpu

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable gemma/gemma2/phi in intel platform

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-24 03:57:32 +00:00
icyboy™
5b977c3141 fix AttributeError: 'MixtralLayer' object has no attribute 'mlp' (#2123)
https://github.com/huggingface/text-generation-inference/issues/2122
2024-09-24 03:57:32 +00:00
Daniël de Kok
e0d168ba20 Use GPTQ-Marlin for supported GPTQ configurations (#2111)
GPTQ-Marlin is currently the best-performing kernel for GPTQ models. So
let's use it by default if the kernels are installed, the GPU supports
it, and the kernels support the configuration.

For models generated by `text-generation-server quantize`, use
`sym=False`. This subcommand symmetric quantization since the beginning
and incorrectly reporting the model to be symmetric will use
GPTQ-Marlin (which does not support asymmetric quantization).
2024-09-24 03:57:32 +00:00
drbh
3e02d4fdbf fix: use weights from base_layer (#2141) 2024-09-24 03:57:32 +00:00
Nicolas Patry
bc15e960ea Fixing gemma2. (#2135)
* Fixing gemma2.

* Adding new model.
2024-09-24 03:57:07 +00:00
Daniël de Kok
d731866245 Idefics2: sync added image tokens with transformers (#2080)
Before this change, the number of reserved image tokens was not the
same as the number of images. Fixes #2029.

While at it, also remove all the image token handling duplication
in `prepare_input`.
2024-09-24 03:56:28 +00:00
Daniël de Kok
4700ea413f Add support for Marlin 2:4 sparsity (#2102)
This change adds support for 2:4 sparsity when using Marlin
quantization. The 2:4 kernel is used when:

* The quantizer is `marlin`;
* the quantizer checkpoint format is `marlin_24`.

Fixes #2098.
2024-09-24 03:55:04 +00:00
Daniël de Kok
18a8364d94 Support AWQ quantization with bias (#2117)
When the AWQ quantizer was used with a layer that uses a bias,
the bias tensor was not correctly passed/used. Instead, the
value `true`/`1.0` was added to the linear transformation.

Correctly pass through the bias when it is not `None`.

Fixes #2106.
2024-09-24 03:55:04 +00:00
drbh
8a155b2d5b Enable multiple LoRa adapters (#2010)
* feat: first draft load multiple lora

* feat: load weights within layer and refactor lora pass

* fix: refactor and reduce lora math

* feat: baseline impl single request multi lora support

* feat: prefer lorax implementation and port loading logic

* fix: prefer adapter_data and refactors

* feat: perfer loraxs custom punica kernels and add mlp loras

* fix: adjust batch for bgmv

* fix: adjust adapter_segments logic when in batch

* fix: refactor and move changes to v3 proto

* fix: pass model_id for all flash causal lms

* fix: pass model_id for all causal and seq2seq lms

* fix: add model_id to model test

* feat: add lora support to mistral and refactors

* feat: prefer model id in request

* fix: include rust code for adapter id

* feat: bump launcher and add new lora docs

* feat: support base model generation and refactors

* fix: rename doc to retry ci build

* feat: support if vlm models

* fix: add adapter_data param and avoid missing layers

* fix: add adapter_data param to phi and neox

* fix: update all models forwards to include adapter_data

* fix: add model_id to IdeficsCausalLM

* Update lora.md

Fixed a typo

* Update lora.md

Fixing spam image

* fix: add lora kernel to dockerfile, support running without kernels and refactors

* fix: avoid dockerfile conflict

* fix: refactors and adjust flash llama lora logic

* fix: skip llama test due to CI issue (temp)

* fix: skip llama test CI (temp) 2

* fix: revert skips and prefer updated ci token for tests

* fix: refactors and helpful comments

* fix: add noop in TensorParallelAdapterRowLinear too

* fix: refactor and move shard_lora_weights logic

* fix: exit early if no adapter_data

---------

Co-authored-by: Derek <datavistics@gmail.com>
2024-09-24 03:55:04 +00:00
Wang, Yi
27ae4f7916 fix cpu and xpu issue (#2116)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-24 03:52:23 +00:00
Nicolas Patry
d626685039 Removing IPEX_AVAIL. (#2115)
* Removing IPEX_AVAIL.

Chose to unify CPU and XPU under `ipex`. Most code is exactly similar
except for a very few spots.

The biggest number of spots is the kv-cache layout and the flash_xxx.py
files.
Since those files should be removed soon and factored away, we should
not need them.

* Forgot a few places.

* Unrelated change.

* Fixing HF_TOKEN.

* HF_TOKEN
2024-09-24 03:52:23 +00:00
Wang, Yi
0d879fe66e Cpu tgi (#1936)
* add CPU tgi support

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* ipex distributed ops support

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Funtowicz Morgan <mfuntowicz@users.noreply.github.com>
2024-09-24 03:51:26 +00:00
Wang, Yi
e49aed4713 use xpu-smi to dump used memory (#2047)
* use xpu-smi to dump used memory
xpu use "ZE_AFFINITY_MASK" to control card, usage is like CUDA_VISIBLE_DEVICES

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Update server/text_generation_server/utils/import_utils.py

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
2024-09-24 03:51:26 +00:00