Commit Graph

1089 Commits

Author SHA1 Message Date
Nicolas Patry
df6ea89da9 Fixing exl2 and other quanize tests again. (#2419)
* Fixing exl2 and other quanize tests again.

* Mark exl2 as non release (so CI tests them, needs to be removed latet).

* Fixing exl2 (by disabling cuda graphs)

* Fix quantization defaults without cuda graphs on exl2 (linked to new
issues with it).

* Removing serde override.

* Go back to released exl2 and remove log.

* Adding warnings for deprecated bitsandbytes + upgrade info to warn.
2024-09-25 06:08:38 +00:00
Daniël de Kok
e5c39a5545 nix: build router incrementally (#2422) 2024-09-25 06:08:00 +00:00
Funtowicz Morgan
c3401e0b99 More fixes trtllm (#2342)
* (backend) use parking_lot crate for RwLock fairness

* (docker) let's put rust in the TRTLLM folder when building

* (docker) build ompi with SLURM support

* (launcher) default new server::run parameters to false for now

* (chore) fmt ... why?
2024-09-25 06:08:00 +00:00
Nicolas Patry
4baa6ff59f Upgrading exl2. (#2415)
* Upgrading exl2.

* Fixing the other pathways.

* Fix idefics.
2024-09-25 06:07:40 +00:00
Daniël de Kok
bae161ab84 nix: partial incremental build of the router (#2416)
This is less incremental than crate2nix, but does build all dependencies
separately, so avoids full rebuilds.
2024-09-25 06:06:17 +00:00
drbh
ffc8fb0850 fix: adds causal to attention params (#2408)
fix: adds causal to attention params to check when using flash attn v1
2024-09-25 06:06:17 +00:00
Wang, Yi
7a4d831d17 add numa to improve cpu inference perf (#2330)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 06:06:17 +00:00
Nicolas Patry
c5e4c1877b Adding more kernels to flake. (#2411) 2024-09-25 06:06:17 +00:00
Daniël de Kok
eb561bb715 nix: incremental build of the launcher (#2410) 2024-09-25 06:06:17 +00:00
drbh
10b2be6536 fix: include create_exllama_buffers and set_device for exllama (#2407) 2024-09-25 06:06:17 +00:00
drbh
1f8c0f83e3 Pr 2395 ci run (#2406)
* fix(router): Fix appending to message content

* feat: add message and chat template test

---------

Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>
2024-09-25 06:06:17 +00:00
Nicolas Patry
18d6be6af4 Updating the flake. (#2404) 2024-09-25 06:06:17 +00:00
drbh
96e8fa37b0 fix: improve completions to send a final chunk with usage details (#2336)
* fix: improve completions to send a final chunk with usage details

* fix: include finish reason string

* fix: remove dev debug trait and unneeded mut

* fix: update openapi schema
2024-09-25 06:06:17 +00:00
drbh
3079865b60 fix: allocate tmp based on sgmv kernel if available (#2345)
* fix: allocate tmp based on sgmv kernel if available

* fix: re add copy build artifacts step for punica kernels
2024-09-25 06:06:17 +00:00
drbh
8e6bfa2fc5 feat: validate template variables before apply and improve sliding wi… (#2403)
* feat: validate template variables before apply and improve sliding window check

* fix: improve missing template var test
2024-09-25 06:05:43 +00:00
Nicolas Patry
6393cdee63 Keeping the benchmark somewhere (#2401)
Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-09-25 06:05:43 +00:00
Daniël de Kok
f586cc7f0c Add support for prefix caching to the v3 router (#2392)
This change adds support for prefix caching to the v3 router. This
is broken up from the backend support to ease reviewing.

For now prefix caching is only enabled with `USE_PREFIX_CACHING=1`
in this case, the router will switch to `RadixAllocator`. This
allocator uses a radix trie to keep track of prefills that were
seen prior. If a new prefill is a prefix of a previously-seen
prefil, the router will send a request with `prefix_len>0`, which
can be used by the backend to decide to reuse KV blocks from the
cache, rather than recomputing them.

Even though backend support is not added in this PR, the backend
will still work with prefix caching enabled. The prefix lengths
are just ignored and not used.
2024-09-25 06:05:08 +00:00
Wang, Yi
b8efd6d00c Cpu dockerimage (#2367)
add intel-cpu docker image

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 06:05:08 +00:00
Nicolas Patry
1daaddd072 Fixing import exl2 (#2399) 2024-09-25 06:04:51 +00:00
Nicolas Patry
fbe59c6267 Adding launcher to build. (#2397) 2024-09-25 06:04:51 +00:00
Nicolas Patry
8750dc878e Upgrade fbgemm (#2398)
* Upgrade fbgemm

* Fix fbgemm version
2024-09-25 06:04:51 +00:00
Daniël de Kok
197dd3af12 nix: add router to the devshell (#2396) 2024-09-25 06:04:51 +00:00
Daniël de Kok
bb833389e0 Update flake for 9.0a capability in Torch (#2394) 2024-09-25 06:04:51 +00:00
drbh
959add5e9b feat: add guideline to chat request and template (#2391)
* feat: add guideline to chat request and template

* fix: add template test and update docs
2024-09-25 06:04:51 +00:00
Nicolas Patry
849bd93dc3 Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385)
* Using an enum for flash backens (paged/flashdecoding/flashinfer)

* Early exit on server too.

* Clippy.

* Fix clippy and fmt.
2024-09-25 06:04:51 +00:00
Daniël de Kok
df719fd527 flake: use rust-overlay (#2390) 2024-09-25 06:04:51 +00:00
Vaibhav Srivastav
1d4a35a23c Update documentation for Supported models (#2386)
* Minor doc fixes

* up.

* Other minor updates.
2024-09-25 06:04:51 +00:00
Daniël de Kok
e9ba044250 flake: add fmt and clippy (#2389) 2024-09-25 06:03:56 +00:00
Nicolas Patry
afa14b7595 Using HF_HOME instead of CACHE to get token read in addition to models. (#2288) 2024-09-25 06:03:56 +00:00
Daniël de Kok
dc0fa60f55 Add experimental flake (#2384)
Add flake.nix
2024-09-25 06:01:59 +00:00
Daniël de Kok
4a16da5d49 Add FlashInfer support (#2354)
This change adds support for FlashInfer. FlashInfer can be enabled using
`FLASH_INFER=1` and is currently only implemented in `FlashCausalLM`.
Since this functionality is currently only for testing, FlashInfer is
not installed anywhere yet.

The FlashInfer API is quite different from FlashAttention/vLLM in that
it requires more global bookkeeping:

* A wrapper class needs to be contstructed (which we just call *state*).
  Since this is fairly expensive (due to pinned host memory allocation),
  we only do this once in a FlashCausalLM instance or for each CUDA
  Graph size.
* Each model forward call needs to be wrapped in `begin_forward` and
  `end_forward`. This sets up data structures that can be reused for all
  calls to attention for that forward call.

When calling attention, we need access to the state object. To avoid
passing an argument down the call chain (which would require changes to
all models), we use a context variable.

Each model forward call is wrapped using a context manager that does all
the bookkeeping for such a call:

* Set the context variable to the forward call's state.
* Call `begin_forward` on the state.
* Yield.
* Call `end_forward` on the state.
* Reset the context variable.

We cannot use a single shared global variable for this, since e.g. CUDA
Graphs of different sizes each have their own state.
2024-09-25 06:01:59 +00:00
drbh
6f2a468a64 Pr 2352 ci branch (#2382)
* Fix unsigned integer underflow

Passing --max-batch-size to the launcher actually had no effect
because after a few requests the max_size passed to State::next_batch
would underflow becoming a largo positive number.

In the scheduler, as soon as the cached batch size reached the
max_batch_size the max_size passed to next_batch becomes 0.
Since the only check in that funcion is
```
if Some(batch_requests.len()) == max_size {
    break;
}
```
and it's called after the `batch_requests.len()` has
become 1, it doesn't do anything to prevent more than 0
requests from being batched.

Now we have cached batch in the server that is large than
max_batch_size and `max_size - batch_size as usize`
underflows.

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

* fix: update v3 scheduler and ensure max_batch_size > 0

---------

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com>
2024-09-25 06:01:59 +00:00
Vaibhav Srivastav
b1bc0ecb7f Update Quantization docs and minor doc fix. (#2368)
* Update Quantization docs and minor doc fix.

* update readme with latest quants info

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* up

---------

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
2024-09-25 06:01:59 +00:00
drbh
853fb96fec fix: prefer hidden_activation over hidden_act in gemma2 (#2381) 2024-09-25 05:55:39 +00:00
drbh
1057f28128 Pr 2337 ci branch (#2379)
* hotfix: fix xpu crash brought by code refine. torch.xpu rely on import ipex

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* reable gemma2 in xpu

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix in regression in ipex flashattention

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 05:55:39 +00:00
Wang, Yi
3893d00927 fix EleutherAI/gpt-neox-20b does not work in tgi (#2346)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 05:55:39 +00:00
drbh
06b638f310 Pr 2374 ci branch (#2378)
* Update __init__.py

Fix issue with NoneType comparison for max_input_tokens and sliding_window

- Add default values for max_input_tokens and sliding_window to handle None cases.
- Ensure the comparison between max_input_tokens and sliding_window is handled correctly to prevent TypeError.
- This change addresses the error: TypeError: '<=' not supported between instances of 'int' and 'NoneType'.

* Update __init__.py

Handle NoneType in sliding_window comparison to fix TypeError in __init__.py by ensuring the comparison logic accounts for NoneType values, preventing errors and improving code robustness.

* fix: syntax/style tweak

---------

Co-authored-by: Praz <prazanth2006@gmail.com>
2024-09-25 05:55:39 +00:00
drbh
9b1b545bb4 Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) (#2371)
* Fix the bug

* fix: run lints

* fix: small syntax tweak

---------

Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com>
2024-09-25 05:55:39 +00:00
drbh
3ea8e8a2d5 add gptj modeling in TGI #2366 (CI RUN) (#2372)
* add gptj modeling

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix: update docs for model addition

* fix: adjust syntax typo

* fix: adjust syntax typo again

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 05:55:39 +00:00
almersawi
11fab8a20c fix: fix num_ln_in_parallel_attn attribute name typo in RWConfig (#2350)
Co-authored-by: Islam Almersawi <islam.almersawi@openinnovation.ai>
2024-09-25 05:55:39 +00:00
drbh
3ccde430d9 fix: prefer original layernorm names for 180B (#2365) 2024-09-25 05:55:39 +00:00
drbh
db873be177 fix: default num_ln_in_parallel_attn to one if not supplied (#2364) 2024-09-25 05:55:39 +00:00
drbh
5400c7155d feat: return the generated text when parsing fails (#2353) 2024-09-25 05:55:39 +00:00
drbh
b4562e1369 feat: prefer stop over eos_token to align with openai finish_reason (#2344) 2024-09-25 05:55:39 +00:00
drbh
88e07f12cc feat: implement a templated endpoint for visibility into chat requests (#2333)
* feat: implement a templated endpoint for visibility into chat requests

* feat: improve to tokenize too

* fix: adjust return type

* feat: simplify prepare_chat_input logic and adjust start stop chars
2024-09-25 05:55:39 +00:00
drbh
83d1f23fea fix: return the out tensor rather then the functions return value (#2361) 2024-09-25 05:55:39 +00:00
drbh
8b0f5feb02 feat: include local lora adapter loading docs (#2359) 2024-09-25 05:55:39 +00:00
drbh
688321bcc4 fix: attempt forward on flash attn2 to check hardware support (#2335)
* fix: attempt forward on flash attn2 to check hardware support

* fix: warn window_size_left when using flash attn 1

* fix: prefer version check over test op and avoid window_size_left if not flash attn2

* fix: improve condtional and error message

* fix: update sliding window conditional

* fix: simplify changes and revert model changes

* fix: avoid changing conditional

* fix: typo tweak
2024-09-25 05:55:39 +00:00
Daniël de Kok
48fec7b198 Unify attention output handling (#2343)
- Always return the hidden states.
- Create the output tensor inside the `attention` and `paged_attention`
  functions.

This removes the difference between how the output is handled between
attention (output parameter) and paged attention (return value). This
also removes the assumption that the attention implementation can
write to an output tensor (in preparation of FlashInfer).
2024-09-25 05:55:39 +00:00
Daniël de Kok
ccddb30c02 Fix cache block size for flash decoding (#2351)
* Fix cache block size for flash decoding

This seems to have been accidentally dropped during the TRT-LLM
PR rebase.

* Also run CI on changes to `backends`
2024-09-25 05:55:39 +00:00