Commit Graph

648 Commits

Author SHA1 Message Date
Nicolas Patry
df6ea89da9 Fixing exl2 and other quanize tests again. (#2419)
* Fixing exl2 and other quanize tests again.

* Mark exl2 as non release (so CI tests them, needs to be removed latet).

* Fixing exl2 (by disabling cuda graphs)

* Fix quantization defaults without cuda graphs on exl2 (linked to new
issues with it).

* Removing serde override.

* Go back to released exl2 and remove log.

* Adding warnings for deprecated bitsandbytes + upgrade info to warn.
2024-09-25 06:08:38 +00:00
Nicolas Patry
4baa6ff59f Upgrading exl2. (#2415)
* Upgrading exl2.

* Fixing the other pathways.

* Fix idefics.
2024-09-25 06:07:40 +00:00
drbh
ffc8fb0850 fix: adds causal to attention params (#2408)
fix: adds causal to attention params to check when using flash attn v1
2024-09-25 06:06:17 +00:00
Wang, Yi
7a4d831d17 add numa to improve cpu inference perf (#2330)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 06:06:17 +00:00
drbh
10b2be6536 fix: include create_exllama_buffers and set_device for exllama (#2407) 2024-09-25 06:06:17 +00:00
drbh
3079865b60 fix: allocate tmp based on sgmv kernel if available (#2345)
* fix: allocate tmp based on sgmv kernel if available

* fix: re add copy build artifacts step for punica kernels
2024-09-25 06:06:17 +00:00
drbh
8e6bfa2fc5 feat: validate template variables before apply and improve sliding wi… (#2403)
* feat: validate template variables before apply and improve sliding window check

* fix: improve missing template var test
2024-09-25 06:05:43 +00:00
Daniël de Kok
f586cc7f0c Add support for prefix caching to the v3 router (#2392)
This change adds support for prefix caching to the v3 router. This
is broken up from the backend support to ease reviewing.

For now prefix caching is only enabled with `USE_PREFIX_CACHING=1`
in this case, the router will switch to `RadixAllocator`. This
allocator uses a radix trie to keep track of prefills that were
seen prior. If a new prefill is a prefix of a previously-seen
prefil, the router will send a request with `prefix_len>0`, which
can be used by the backend to decide to reuse KV blocks from the
cache, rather than recomputing them.

Even though backend support is not added in this PR, the backend
will still work with prefix caching enabled. The prefix lengths
are just ignored and not used.
2024-09-25 06:05:08 +00:00
Nicolas Patry
1daaddd072 Fixing import exl2 (#2399) 2024-09-25 06:04:51 +00:00
Nicolas Patry
8750dc878e Upgrade fbgemm (#2398)
* Upgrade fbgemm

* Fix fbgemm version
2024-09-25 06:04:51 +00:00
Nicolas Patry
849bd93dc3 Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385)
* Using an enum for flash backens (paged/flashdecoding/flashinfer)

* Early exit on server too.

* Clippy.

* Fix clippy and fmt.
2024-09-25 06:04:51 +00:00
Vaibhav Srivastav
1d4a35a23c Update documentation for Supported models (#2386)
* Minor doc fixes

* up.

* Other minor updates.
2024-09-25 06:04:51 +00:00
Daniël de Kok
4a16da5d49 Add FlashInfer support (#2354)
This change adds support for FlashInfer. FlashInfer can be enabled using
`FLASH_INFER=1` and is currently only implemented in `FlashCausalLM`.
Since this functionality is currently only for testing, FlashInfer is
not installed anywhere yet.

The FlashInfer API is quite different from FlashAttention/vLLM in that
it requires more global bookkeeping:

* A wrapper class needs to be contstructed (which we just call *state*).
  Since this is fairly expensive (due to pinned host memory allocation),
  we only do this once in a FlashCausalLM instance or for each CUDA
  Graph size.
* Each model forward call needs to be wrapped in `begin_forward` and
  `end_forward`. This sets up data structures that can be reused for all
  calls to attention for that forward call.

When calling attention, we need access to the state object. To avoid
passing an argument down the call chain (which would require changes to
all models), we use a context variable.

Each model forward call is wrapped using a context manager that does all
the bookkeeping for such a call:

* Set the context variable to the forward call's state.
* Call `begin_forward` on the state.
* Yield.
* Call `end_forward` on the state.
* Reset the context variable.

We cannot use a single shared global variable for this, since e.g. CUDA
Graphs of different sizes each have their own state.
2024-09-25 06:01:59 +00:00
drbh
853fb96fec fix: prefer hidden_activation over hidden_act in gemma2 (#2381) 2024-09-25 05:55:39 +00:00
drbh
1057f28128 Pr 2337 ci branch (#2379)
* hotfix: fix xpu crash brought by code refine. torch.xpu rely on import ipex

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* reable gemma2 in xpu

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix in regression in ipex flashattention

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 05:55:39 +00:00
Wang, Yi
3893d00927 fix EleutherAI/gpt-neox-20b does not work in tgi (#2346)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 05:55:39 +00:00
drbh
06b638f310 Pr 2374 ci branch (#2378)
* Update __init__.py

Fix issue with NoneType comparison for max_input_tokens and sliding_window

- Add default values for max_input_tokens and sliding_window to handle None cases.
- Ensure the comparison between max_input_tokens and sliding_window is handled correctly to prevent TypeError.
- This change addresses the error: TypeError: '<=' not supported between instances of 'int' and 'NoneType'.

* Update __init__.py

Handle NoneType in sliding_window comparison to fix TypeError in __init__.py by ensuring the comparison logic accounts for NoneType values, preventing errors and improving code robustness.

* fix: syntax/style tweak

---------

Co-authored-by: Praz <prazanth2006@gmail.com>
2024-09-25 05:55:39 +00:00
drbh
9b1b545bb4 Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) (#2371)
* Fix the bug

* fix: run lints

* fix: small syntax tweak

---------

Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com>
2024-09-25 05:55:39 +00:00
drbh
3ea8e8a2d5 add gptj modeling in TGI #2366 (CI RUN) (#2372)
* add gptj modeling

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix: update docs for model addition

* fix: adjust syntax typo

* fix: adjust syntax typo again

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 05:55:39 +00:00
almersawi
11fab8a20c fix: fix num_ln_in_parallel_attn attribute name typo in RWConfig (#2350)
Co-authored-by: Islam Almersawi <islam.almersawi@openinnovation.ai>
2024-09-25 05:55:39 +00:00
drbh
3ccde430d9 fix: prefer original layernorm names for 180B (#2365) 2024-09-25 05:55:39 +00:00
drbh
db873be177 fix: default num_ln_in_parallel_attn to one if not supplied (#2364) 2024-09-25 05:55:39 +00:00
drbh
83d1f23fea fix: return the out tensor rather then the functions return value (#2361) 2024-09-25 05:55:39 +00:00
drbh
688321bcc4 fix: attempt forward on flash attn2 to check hardware support (#2335)
* fix: attempt forward on flash attn2 to check hardware support

* fix: warn window_size_left when using flash attn 1

* fix: prefer version check over test op and avoid window_size_left if not flash attn2

* fix: improve condtional and error message

* fix: update sliding window conditional

* fix: simplify changes and revert model changes

* fix: avoid changing conditional

* fix: typo tweak
2024-09-25 05:55:39 +00:00
Daniël de Kok
48fec7b198 Unify attention output handling (#2343)
- Always return the hidden states.
- Create the output tensor inside the `attention` and `paged_attention`
  functions.

This removes the difference between how the output is handled between
attention (output parameter) and paged attention (return value). This
also removes the assumption that the attention implementation can
write to an output tensor (in preparation of FlashInfer).
2024-09-25 05:55:39 +00:00
Wang, Yi
d70da59c25 enable HuggingFaceM4/idefics-9b in intel gpu (#2338)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 05:55:39 +00:00
drbh
c73d1d604f Pr 2290 ci run (#2329)
* MODEL_ID propagation fix

* fix: remove global model id

---------

Co-authored-by: root <root@tw031.pit.tensorwave.lan>
2024-09-25 05:55:39 +00:00
Daniël de Kok
468e5c6874 Handle GPTQ-Marlin loading in GPTQMarlinWeightLoader (#2300)
The `GPTWeightLoader` was structured like this in pseudocode:

if marlin:
  Set up tensors in a way that GPTQ-Marlin expects
else:
  Set up tensors in a way that ExLlama/GPTQ/AWQ expect

However, the GPT-Marlin implementation details should really be in the
`marlin` module. So move the former part out to a separate
`GPTQMarlinWeightsLoader`.
2024-09-25 05:55:39 +00:00
Daniël de Kok
247a29f77c server quantize: store quantizer config in standard format (#2299)
- Create `quantization_config` option in the model config.
- Don't store the quantizer config in tensors anymore.
2024-09-25 05:50:17 +00:00
Erik Kaunismäki
b1d1d26559 patch-error-on-invalid-grammar (#2282)
* quick fix

* allow silent failure

* explicit todo that this is only short term
2024-09-25 05:50:17 +00:00
Daniël de Kok
23a3927eb6 Install Marlin from standalone package (#2320) 2024-09-25 05:50:17 +00:00
drbh
a87791d7c9 feat: add ruff and resolve issue (#2262)
* feat: add ruff and resolve issue

* fix: update client exports and adjust after rebase

* fix: adjust syntax to avoid circular import

* fix: adjust client ruff settings

* fix: lint and refactor import check and avoid model enum as global names

* fix: improve fbgemm_gpu check and lints

* fix: update lints

* fix: prefer comparing model enum over str

* fix: adjust lints and ignore specific rules

* fix: avoid unneeded quantize check
2024-09-25 05:46:24 +00:00
Daniël de Kok
fc6d80fdb8 Support tied embeddings in 0.5B and 1.5B Qwen2 models (#2313) 2024-09-25 05:41:43 +00:00
Adrien
1674f441d0 Fix registry name (#2307) 2024-09-25 05:41:43 +00:00
Daniël de Kok
64ffd642fa Some small fixes for the Torch 2.4.0 update (#2304)
* Fix GPTQ autotune data type to be compatible with Torch 2.4.0

* Update poetry lock file

* Fix small PaliGemma logprob differences after the torch update
2024-09-25 05:40:25 +00:00
drbh
7ebee37641 fix: refactor adapter weight loading and mapping (#2193)
* fix: refactor adapter weight loading and mapping

* feat: enable lora load from directory

* fix: adjust launcher for local lora adapters

* feat: improve weight loading and add tests

* fix: improve logging and rebase syntax issue

* fix: impove adapter merge comments and remove unused conditional

* fix: improve get_model_with_lora_adapters naming

* fix: comment typo
2024-09-25 05:39:58 +00:00
Daniël de Kok
457791f511 Split up layers.marlin into several files (#2292)
The marlin.py file was getting large, split it up.
2024-09-25 05:39:58 +00:00
Wang, Yi
d93931567d fix of use of unquantized weights in cohere GQA loading, also enable … (#2291)
fix of use of unquantized weights in cohere GQA loading, also enable the model in intel platform

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 05:39:58 +00:00
Wang, Yi
204142153f fix crash in multi-modal (#2245)
* fix crash in multi-modal

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* update according to review comment

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix llava_next regression in latest main

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 05:39:58 +00:00
OlivierDehaene
a994f6aedd hotfix: update nccl 2024-09-25 05:39:58 +00:00
OlivierDehaene
34c472bd64 chore: update to torch 2.4 (#2259)
* chore: update to torch 2.4

* remove un-necessary patch

* fix
2024-09-25 05:39:14 +00:00
Daniël de Kok
b1077b077c hotfix: pin numpy (#2289) 2024-09-25 05:38:48 +00:00
Daniël de Kok
43f49141fd Add support for Llama 3 rotary embeddings (#2286)
* Add support for Llama 3 rotary embeddings

* Update transformers to 4.43
2024-09-25 05:38:48 +00:00
shaltielshmid
69b67b7add Add support for Mistral-Nemo by supporting head_dim through config (#2254)
* Support passing head_dim through config

* Using `head_dim` as a fallback is necessary since it's a non standard
key in mistralConfig (as defined in transformers).

* Shorter diff.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-09-25 05:31:31 +00:00
Daniël de Kok
26460f053d Add support for repacking AWQ weights for GPTQ-Marlin (#2278)
* Add support for repacking AWQ weights for GPTQ-Marlin

So far we couldn't support AWQ because virtually all AWQ models use
symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin
has recently added support AWQ repacking and AWQ asymmetric quantization
(zero_point=True).

This change updates all GPTQ-Marlin kernels from upstream and wires up
AWQ support. For now enabling AWQ using Marlin requires running TGI with
`--quantize gptq`.

* Enable Marlin for supported AWQ configurations by default

This makes the AWQ -> GPTQ repack test redundant, since we are now
testing this with the regular AWQ test.
2024-09-25 05:31:31 +00:00
OlivierDehaene
919da25c3b fix(l4): fix fp8 logic on l4 (#2277)
* fix(l4): fix fp8 logic on l4

* also quant weights with single scale

* use marlin even on 89
2024-09-25 05:31:30 +00:00
Nicolas Patry
31eb03dbe2 Fixing mistral nemo. (#2276) 2024-09-25 05:31:30 +00:00
Nicolas Patry
568cc9f3d0 Softcapping for gemma2. (#2273)
* Softcapping for gemma2.

* Less clutter.

* No access to transformers config, only config_dict here.

* 0.0 is the null value in the C++ API.
2024-09-25 05:31:08 +00:00
OlivierDehaene
a7515b8af1 fix(server): fix fp8 weight loading (#2268)
* fix(server): fix fp8 weight loading

* fixed scales loading

* update snap

* revert default dtype
2024-09-25 05:31:08 +00:00
icyboy™
a5aee82a69 Hotfix: fix of use of unquantized weights in Mixtral GQA loading (#2269)
* Update idefics_causal_lm.py

Fix syntax issues

* fix dbrx & opt model prefix bug

* Hotfix: fix of use of unquantized weights in Mixtral GQA loading
2024-09-25 05:30:41 +00:00