text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-04-24 16:32:12 +00:00

Author	SHA1	Message	Date
drbh	5fc0e0c589	fix: pass missing revision arg for lora adapter when loading multiple… (#2510 ) fix: pass missing revision arg for lora adapter when loading multiple adapters	2024-09-25 06:15:35 +00:00
Nicolas Patry	c6b568b892	Fix tokenization yi (#2507 ) * Fixing odd tokenization self modifications on the Rust side (load and resave in Python). * Fixing the builds ? * Fix the gh action? * Fixing the location ? * Validation is odd. * Try a faster runner * Upgrade python version. * Remove sccache * No sccache. * Getting libpython maybe ? * List stuff. * Monkey it up. * have no idea at this point * Tmp. * Shot in the dark. * Tmate the hell out of this. * Desperation. * WTF. * -y. * Apparently 3.10 is not available anymore. * Updating the dockerfile to make libpython discoverable at runtime too. * Put back rust tests. * Why do we want mkl on AMD ? * Forcing 3.11 ?	2024-09-25 06:15:35 +00:00
Nicolas Patry	510d1c76c8	Prefix test - Different kind of load test to trigger prefix test bugs. (#2490 ) * Adding prefix test. * [WIP] tmp dump of integration load tests. * Remove other tensor creation. * Fixed the radix tree. Used a slice everywhere in radix.rs to keep the cheap Arc cloning instead of recomputing the input_ids. * Fix parsing * Is it really flashinfer version ? * Remove some comments. * Revert the max prefix hit. * Adding numpy to diff. * Upgraded flashinfer. * Upgrading some stuff. * Are we done yet ? * Minor fixup * Remove 1 log and put back the other. * Add comment for why slot 0 is OK. * Mounting on the job. * Get me a debug branch * Debugging CIs is fun. * Attempt #28 * wip * Tmate. * Praying. * Updating VLM causal model with updated context. * Important line got squashed. * Tmate again. * Fingers crossed. * We want only 1 run of integration tests..... --------- Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>	2024-09-25 06:14:07 +00:00
Wang, Yi	938a7f3c3a	hotfix: fix regression of attention api change in intel platform (#2439 ) fix regression caused by attention api change. ipex.varlen_attention does not support paged-cache format kv input now. Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 06:13:36 +00:00
drbh	34a6399a50	feat: support lora revisions and qkv_proj weights (#2482 ) * feat: support lora revisions and qkv_proj weights * fix: add qkv_proj weights to weight test	2024-09-25 06:13:11 +00:00
Nicolas Patry	a313355d2b	Tied embeddings in MLP speculator. (#2473 ) * Tied embeddings in MLP speculator. * Fixing the scale_weight when users decide to not use the speculation as much as defined in the config. * Adding scaling support + optimize some ops.	2024-09-25 06:13:11 +00:00
Nicolas Patry	4e1ca8d7bd	Lots of improvements (Still 2 allocators) (#2449 ) * Making prefix/flashinfer the default and testing the full release tests. * Include flashinfer in the docker. * Using prebuilt. * Allowing window_left_size (dummy version). * Disabling flashinfer/prefix caching on odd head_dim * Disable prefix caching for lora. * More specific codes. * Update lock * Updating integration tests with new values with FI/FD. Remove paged as a default too, and using FD everywhere. * Update cargo lock ? * Upgrade to 1.80 because of bitstream... * Everywhere 1.80 * Forgot last default place. * Apply suggestions from code review Co-authored-by: drbh <david.richard.holtz@gmail.com> * Updated flake lock * Tmp * Upgrade resolution system for less errors in resolution. * Remove lambda for cleaner function. * Handling debugger. * OVerride the env in server tests. * Is this enough to make it work ? * This seems to be working. * Downgrade some logs. * Fixing the default for vlm. * Don't enable prefix caching on VLM just yet. * Change `add_special_tokens` in order to have the correct tokens for chat input and not (since it's super important with the prefixing now) * Fixing prefix caching for flashdecoding. * Update all models. * Fixed flashinfer version. * add_special_tokens is internal only * Fixing seqlen with the new vlms. * Fixing the issue with `add_special_tokens` not being passed around. * Fixing the test. * Removing encoder_decoder (seq2seq). * Update the chat test. * Fixing the batching tokenization in flash causal lm. * Truncating left for radix purposes. * Oops this doesn't belong here. * Put back default pure shell. * Update server tests - Default to throughput test in k6 - Use TGI_WIGGLE_ROOM to adjust wiggle room * Only n_heads / process_group.size() are necessary. * Revert the integrationt tests change (seem linked to head_size modification). * Adding error message when assert is violated. * Fixing the free algorithm to handle times where the common prefix is smaller. * Apply suggestions from code review Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Update server/text_generation_server/layers/attention/common.py Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Fix disabling prefix caching - Fix windowing checks. * Revert the Cohere tokenizer change (for now using a revision instead). * Fmt. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: OlivierDehaene <olivier@huggingface.co>	2024-09-25 06:13:11 +00:00
drbh	7aebb953e2	Fix: don't apply post layernorm in SiglipVisionTransformer (#2459 ) * Fix: don't apply post layernorm in SiglipVisionTransformer This fixes a bug with LLaVA Next when using Siglip as the vision model. LLaVA Next expects the output of the vision model to be the encoder outputs before layernorm (see original transformers implementation here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_next/modeling_llava_next.py#L813). This also makes Siglip consistent with the existing Clip implementation: https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/models/custom_modeling/clip.py#L613 * fix: adjust pali gemma for post layer norm and small refactors --------- Co-authored-by: Travis Addair <tgaddair@gmail.com>	2024-09-25 06:10:59 +00:00
Nicolas Patry	635dde8af9	Prefix caching (#2402 ) * Prefix caching WIP * Fixing prefix attention. * Fixing flashinfer import. * Fixing black. * Fixing medusa (still wrong outputs, but functional). * Just medusa values now. * Fixing medusa without prefix caching. * Fixing prefix caching. * Medusa requires reshaping. * Removing the logs. * Remove router.nix * Fixup: - Remove logs - Disable VLMs (they do not work) - Disable prefix caching when user wants prefill logprobs. * Update flake.lock --------- Co-authored-by: Daniël de Kok <me@danieldk.eu>	2024-09-25 06:10:59 +00:00
Nicolas Patry	df6ea89da9	Fixing exl2 and other quanize tests again. (#2419 ) * Fixing exl2 and other quanize tests again. * Mark exl2 as non release (so CI tests them, needs to be removed latet). * Fixing exl2 (by disabling cuda graphs) * Fix quantization defaults without cuda graphs on exl2 (linked to new issues with it). * Removing serde override. * Go back to released exl2 and remove log. * Adding warnings for deprecated bitsandbytes + upgrade info to warn.	2024-09-25 06:08:38 +00:00
Nicolas Patry	4baa6ff59f	Upgrading exl2. (#2415 ) * Upgrading exl2. * Fixing the other pathways. * Fix idefics.	2024-09-25 06:07:40 +00:00
drbh	ffc8fb0850	fix: adds causal to attention params (#2408 ) fix: adds causal to attention params to check when using flash attn v1	2024-09-25 06:06:17 +00:00
Wang, Yi	7a4d831d17	add numa to improve cpu inference perf (#2330 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 06:06:17 +00:00
drbh	10b2be6536	fix: include create_exllama_buffers and set_device for exllama (#2407 )	2024-09-25 06:06:17 +00:00
drbh	3079865b60	fix: allocate tmp based on sgmv kernel if available (#2345 ) * fix: allocate tmp based on sgmv kernel if available * fix: re add copy build artifacts step for punica kernels	2024-09-25 06:06:17 +00:00
drbh	8e6bfa2fc5	feat: validate template variables before apply and improve sliding wi… (#2403 ) * feat: validate template variables before apply and improve sliding window check * fix: improve missing template var test	2024-09-25 06:05:43 +00:00
Daniël de Kok	f586cc7f0c	Add support for prefix caching to the v3 router (#2392 ) This change adds support for prefix caching to the v3 router. This is broken up from the backend support to ease reviewing. For now prefix caching is only enabled with `USE_PREFIX_CACHING=1` in this case, the router will switch to `RadixAllocator`. This allocator uses a radix trie to keep track of prefills that were seen prior. If a new prefill is a prefix of a previously-seen prefil, the router will send a request with `prefix_len>0`, which can be used by the backend to decide to reuse KV blocks from the cache, rather than recomputing them. Even though backend support is not added in this PR, the backend will still work with prefix caching enabled. The prefix lengths are just ignored and not used.	2024-09-25 06:05:08 +00:00
Nicolas Patry	1daaddd072	Fixing import exl2 (#2399 )	2024-09-25 06:04:51 +00:00
Nicolas Patry	849bd93dc3	Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385 ) * Using an enum for flash backens (paged/flashdecoding/flashinfer) * Early exit on server too. * Clippy. * Fix clippy and fmt.	2024-09-25 06:04:51 +00:00
Vaibhav Srivastav	1d4a35a23c	Update documentation for Supported models (#2386 ) * Minor doc fixes * up. * Other minor updates.	2024-09-25 06:04:51 +00:00
Daniël de Kok	4a16da5d49	Add FlashInfer support (#2354 ) This change adds support for FlashInfer. FlashInfer can be enabled using `FLASH_INFER=1` and is currently only implemented in `FlashCausalLM`. Since this functionality is currently only for testing, FlashInfer is not installed anywhere yet. The FlashInfer API is quite different from FlashAttention/vLLM in that it requires more global bookkeeping: * A wrapper class needs to be contstructed (which we just call state). Since this is fairly expensive (due to pinned host memory allocation), we only do this once in a FlashCausalLM instance or for each CUDA Graph size. * Each model forward call needs to be wrapped in `begin_forward` and `end_forward`. This sets up data structures that can be reused for all calls to attention for that forward call. When calling attention, we need access to the state object. To avoid passing an argument down the call chain (which would require changes to all models), we use a context variable. Each model forward call is wrapped using a context manager that does all the bookkeeping for such a call: * Set the context variable to the forward call's state. * Call `begin_forward` on the state. * Yield. * Call `end_forward` on the state. * Reset the context variable. We cannot use a single shared global variable for this, since e.g. CUDA Graphs of different sizes each have their own state.	2024-09-25 06:01:59 +00:00
drbh	853fb96fec	fix: prefer hidden_activation over hidden_act in gemma2 (#2381 )	2024-09-25 05:55:39 +00:00
drbh	1057f28128	Pr 2337 ci branch (#2379 ) * hotfix: fix xpu crash brought by code refine. torch.xpu rely on import ipex Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * reable gemma2 in xpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix in regression in ipex flashattention Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 05:55:39 +00:00
Wang, Yi	3893d00927	fix EleutherAI/gpt-neox-20b does not work in tgi (#2346 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 05:55:39 +00:00
drbh	06b638f310	Pr 2374 ci branch (#2378 ) * Update __init__.py Fix issue with NoneType comparison for max_input_tokens and sliding_window - Add default values for max_input_tokens and sliding_window to handle None cases. - Ensure the comparison between max_input_tokens and sliding_window is handled correctly to prevent TypeError. - This change addresses the error: TypeError: '<=' not supported between instances of 'int' and 'NoneType'. * Update __init__.py Handle NoneType in sliding_window comparison to fix TypeError in __init__.py by ensuring the comparison logic accounts for NoneType values, preventing errors and improving code robustness. * fix: syntax/style tweak --------- Co-authored-by: Praz <prazanth2006@gmail.com>	2024-09-25 05:55:39 +00:00
drbh	9b1b545bb4	Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) (#2371 ) * Fix the bug * fix: run lints * fix: small syntax tweak --------- Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com>	2024-09-25 05:55:39 +00:00
drbh	3ea8e8a2d5	add gptj modeling in TGI #2366 (CI RUN) (#2372 ) * add gptj modeling Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix: update docs for model addition * fix: adjust syntax typo * fix: adjust syntax typo again --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 05:55:39 +00:00
almersawi	11fab8a20c	fix: fix num_ln_in_parallel_attn attribute name typo in RWConfig (#2350 ) Co-authored-by: Islam Almersawi <islam.almersawi@openinnovation.ai>	2024-09-25 05:55:39 +00:00
drbh	3ccde430d9	fix: prefer original layernorm names for 180B (#2365 )	2024-09-25 05:55:39 +00:00
drbh	db873be177	fix: default num_ln_in_parallel_attn to one if not supplied (#2364 )	2024-09-25 05:55:39 +00:00
drbh	83d1f23fea	fix: return the out tensor rather then the functions return value (#2361 )	2024-09-25 05:55:39 +00:00
drbh	688321bcc4	fix: attempt forward on flash attn2 to check hardware support (#2335 ) * fix: attempt forward on flash attn2 to check hardware support * fix: warn window_size_left when using flash attn 1 * fix: prefer version check over test op and avoid window_size_left if not flash attn2 * fix: improve condtional and error message * fix: update sliding window conditional * fix: simplify changes and revert model changes * fix: avoid changing conditional * fix: typo tweak	2024-09-25 05:55:39 +00:00
Daniël de Kok	48fec7b198	Unify attention output handling (#2343 ) - Always return the hidden states. - Create the output tensor inside the `attention` and `paged_attention` functions. This removes the difference between how the output is handled between attention (output parameter) and paged attention (return value). This also removes the assumption that the attention implementation can write to an output tensor (in preparation of FlashInfer).	2024-09-25 05:55:39 +00:00
Wang, Yi	d70da59c25	enable HuggingFaceM4/idefics-9b in intel gpu (#2338 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 05:55:39 +00:00
drbh	c73d1d604f	Pr 2290 ci run (#2329 ) * MODEL_ID propagation fix * fix: remove global model id --------- Co-authored-by: root <root@tw031.pit.tensorwave.lan>	2024-09-25 05:55:39 +00:00
Daniël de Kok	468e5c6874	Handle GPTQ-Marlin loading in `GPTQMarlinWeightLoader` (#2300 ) The `GPTWeightLoader` was structured like this in pseudocode: if marlin: Set up tensors in a way that GPTQ-Marlin expects else: Set up tensors in a way that ExLlama/GPTQ/AWQ expect However, the GPT-Marlin implementation details should really be in the `marlin` module. So move the former part out to a separate `GPTQMarlinWeightsLoader`.	2024-09-25 05:55:39 +00:00
Daniël de Kok	247a29f77c	server quantize: store quantizer config in standard format (#2299 ) - Create `quantization_config` option in the model config. - Don't store the quantizer config in tensors anymore.	2024-09-25 05:50:17 +00:00
Erik Kaunismäki	b1d1d26559	patch-error-on-invalid-grammar (#2282 ) * quick fix * allow silent failure * explicit todo that this is only short term	2024-09-25 05:50:17 +00:00
Daniël de Kok	23a3927eb6	Install Marlin from standalone package (#2320 )	2024-09-25 05:50:17 +00:00
drbh	a87791d7c9	feat: add ruff and resolve issue (#2262 ) * feat: add ruff and resolve issue * fix: update client exports and adjust after rebase * fix: adjust syntax to avoid circular import * fix: adjust client ruff settings * fix: lint and refactor import check and avoid model enum as global names * fix: improve fbgemm_gpu check and lints * fix: update lints * fix: prefer comparing model enum over str * fix: adjust lints and ignore specific rules * fix: avoid unneeded quantize check	2024-09-25 05:46:24 +00:00
Daniël de Kok	fc6d80fdb8	Support tied embeddings in 0.5B and 1.5B Qwen2 models (#2313 )	2024-09-25 05:41:43 +00:00
Daniël de Kok	64ffd642fa	Some small fixes for the Torch 2.4.0 update (#2304 ) * Fix GPTQ autotune data type to be compatible with Torch 2.4.0 * Update poetry lock file * Fix small PaliGemma logprob differences after the torch update	2024-09-25 05:40:25 +00:00
drbh	7ebee37641	fix: refactor adapter weight loading and mapping (#2193 ) * fix: refactor adapter weight loading and mapping * feat: enable lora load from directory * fix: adjust launcher for local lora adapters * feat: improve weight loading and add tests * fix: improve logging and rebase syntax issue * fix: impove adapter merge comments and remove unused conditional * fix: improve get_model_with_lora_adapters naming * fix: comment typo	2024-09-25 05:39:58 +00:00
Daniël de Kok	457791f511	Split up `layers.marlin` into several files (#2292 ) The marlin.py file was getting large, split it up.	2024-09-25 05:39:58 +00:00
Wang, Yi	d93931567d	fix of use of unquantized weights in cohere GQA loading, also enable … (#2291 ) fix of use of unquantized weights in cohere GQA loading, also enable the model in intel platform Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 05:39:58 +00:00
Wang, Yi	204142153f	fix crash in multi-modal (#2245 ) * fix crash in multi-modal Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * update according to review comment Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix llava_next regression in latest main Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 05:39:58 +00:00
Daniël de Kok	43f49141fd	Add support for Llama 3 rotary embeddings (#2286 ) * Add support for Llama 3 rotary embeddings * Update transformers to 4.43	2024-09-25 05:38:48 +00:00
shaltielshmid	69b67b7add	Add support for Mistral-Nemo by supporting head_dim through config (#2254 ) * Support passing head_dim through config * Using `head_dim` as a fallback is necessary since it's a non standard key in mistralConfig (as defined in transformers). * Shorter diff. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-09-25 05:31:31 +00:00
Daniël de Kok	26460f053d	Add support for repacking AWQ weights for GPTQ-Marlin (#2278 ) * Add support for repacking AWQ weights for GPTQ-Marlin So far we couldn't support AWQ because virtually all AWQ models use symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin has recently added support AWQ repacking and AWQ asymmetric quantization (zero_point=True). This change updates all GPTQ-Marlin kernels from upstream and wires up AWQ support. For now enabling AWQ using Marlin requires running TGI with `--quantize gptq`. * Enable Marlin for supported AWQ configurations by default This makes the AWQ -> GPTQ repack test redundant, since we are now testing this with the regular AWQ test.	2024-09-25 05:31:31 +00:00
OlivierDehaene	919da25c3b	fix(l4): fix fp8 logic on l4 (#2277 ) * fix(l4): fix fp8 logic on l4 * also quant weights with single scale * use marlin even on 89	2024-09-25 05:31:30 +00:00
Nicolas Patry	31eb03dbe2	Fixing mistral nemo. (#2276 )	2024-09-25 05:31:30 +00:00
Nicolas Patry	568cc9f3d0	Softcapping for gemma2. (#2273 ) * Softcapping for gemma2. * Less clutter. * No access to transformers config, only config_dict here. * 0.0 is the null value in the C++ API.	2024-09-25 05:31:08 +00:00
OlivierDehaene	a7515b8af1	fix(server): fix fp8 weight loading (#2268 ) * fix(server): fix fp8 weight loading * fixed scales loading * update snap * revert default dtype	2024-09-25 05:31:08 +00:00
icyboy™	a5aee82a69	Hotfix: fix of use of unquantized weights in Mixtral GQA loading (#2269 ) * Update idefics_causal_lm.py Fix syntax issues * fix dbrx & opt model prefix bug * Hotfix: fix of use of unquantized weights in Mixtral GQA loading	2024-09-25 05:30:41 +00:00
OlivierDehaene	d13215da8f	fix(server): fix deepseekv2 loading (#2266 )	2024-09-25 05:30:41 +00:00
OlivierDehaene	85f10ec5c9	feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248 ) * feat(fp8): add support for fbgemm * allow loading fp8 weights directly * update outlines * fix makefile * build fbgemm * avoid circular import and fix dockerfile * add default dtype * refactored weights loader * fix auto conversion * fix quantization config parsing * force new nccl on install * missing get_weights implementation * increase timeout	2024-09-25 05:30:41 +00:00
Daniël de Kok	c1638a56f1	Add support for Deepseek V2 (#2224 ) Deepseek V2 is a MoE model from Deepseek. Relevant variations compared to other models: - Grouped top-K in expert selection. - mscale in yarn is calculated using the `mscale` and `mscale_all_dim` configuration options. - `mscale_all_dim` is also used in scaling attention softmax. - Permuting of the query/key representations before applying rotary embeddings. - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`). So, we need weight loads that supports quantized weights. To this end `{Weights,WeightLoader}.get_weight` was added. - The query/key head dimensionality differs from that of the value, so we need to pad during attention. - Heads with size 192, needs an extension to our paged attention fork and we need to ensure that the KV cache is allocated with the correct size. - Shared experts.	2024-09-25 05:27:40 +00:00
Daniël de Kok	e658d95c23	Hotfix: pass through model revision in `VlmCausalLM` (#2258 )	2024-09-25 05:27:40 +00:00
Daniël de Kok	990ea793c0	Hotfix: fix MPT after recent refactor (#2257 )	2024-09-25 05:27:40 +00:00
Daniël de Kok	ba0dfb6fb1	Hotfix: various GPT-based model fixes (#2256 )	2024-09-25 05:27:40 +00:00
Daniël de Kok	394f8c7d2b	Hotfix: fix of use of unquantized weights in Gemma GQA loading (#2255 )	2024-09-25 05:27:40 +00:00
Daniël de Kok	2dd680b799	Improve the handling of quantized weights (#2250 ) * Improve the handling of quantized weights Handling of quantized weights was split between two mechanisms: - For quantized checkpoints, we used the new weight loader infrastructure. - For quantization while loading (EETQ, FP8, bitsandbytes) we instead relied on conditional in `get_linear`. Weight loaders support context managers to selectively load particular layers with different weight loaders, which is useful for models like Idefics2 AWQ, which uses a quantized text model, but unquantized vision and connector models. However, the context manager would be overrided by `get_linear`, which string-checks `quantizer`. Also, the context manager would not work with EETQ, FP8, and bitsandbytes. This change migrates all quantizers to the weight loader infrastructure. This has several benefits: - We can use context managers with all quantizers. - All the implementation details move down to the quantizer layers, `get_linear` does not need to know how to handle quantizer linear layers. - All quantizer weights are strongly typed, we don't pass around raw tensors. - We don't have to pass around the `quantizer` string everywhere. * Exclude non-MLP layers when using FP8 quantization with Llama	2024-09-25 05:27:40 +00:00
OlivierDehaene	118ee57f82	fix(server): fix cohere (#2249 )	2024-09-25 05:27:40 +00:00
Daniël de Kok	e0710ccbeb	Remove stray `quantize` argument in `get_weights_col_packed_qkv` (#2237 ) Fixes #2236.	2024-09-25 05:27:40 +00:00
Daniël de Kok	7177da0df6	`server quantize`: expose groupsize option (#2225 )	2024-09-25 05:27:40 +00:00
Daniël de Kok	e955f7b536	Add support for AWQ-quantized Idefics2 (#2233 ) Fixes #2036.	2024-09-25 05:27:40 +00:00
drbh	619eeded47	feat: simple mistral lora integration tests (#2180 ) * feat: simple mistral lora integration tests * fix: include args in docker launcher * fix: disable cuda graphs with lora and warn * fix: adjust docs and precommit issues * fix: re update docs	2024-09-25 05:27:40 +00:00
Daniël de Kok	ee56266044	Use symmetric quantization in the `quantize` subcommand (#2120 ) Packing of asymmetric quantization is broken, all (q)zeros values of `0` get reset to `1`, resulting in a loss of accuracy. So instead use symmetric quantization. To be able to distinguish models with symmetric and asymmetric quantization, a new config tensor `gptq_sym` is added. If this tensor is not present, we assume `sym=False`.	2024-09-25 05:27:40 +00:00
SeongBeomLEE	dedeb3cfa0	Modifying base in yarn embedding (#2212 )	2024-09-25 05:27:40 +00:00
Daniël de Kok	85c3c5d64f	Add support for FP8 on compute capability >=8.0, <8.9 (#2213 ) Use FP8 GPTQ-Marlin kernels to enable FP8 support on CUDA GPUs with compute capability >=8.0 and <8.9. Co-authored-by: Florian Zimmermeister <flozi00.fz@gmail.com>	2024-09-25 05:27:40 +00:00
Daniël de Kok	2a6c3caf1d	Move quantized weight handling out of the `Weights` class (#2194 ) Quantized weights were loaded in the `Weights` class, but this was getting quite unwieldy, where every higher level method to load weights was a long conditional to cover all the different quantizers. This change moves loading of quantized weights out of the `Weights` class. This is done by defining a simple `WeightsLoader` interface that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`, and `MarlinWeightsLoader`. These implementations are in the quantizers' respective modules. The `Weights` class provides the low-level load operations (such as loading tensors or sharded tensors), but delegates loads that need quantizer-specific weight processing to a loader. The loaders still use the low-level functionality provided by `Weights`. I initially tried making a hierarchy where a class like `GPTQWeights` would inherit from `Weights`. But it is not very flexible (e.g. does not work well with the new weight storage mock used in tests) and the implicit indirections made the code harder to follow.	2024-09-25 05:27:40 +00:00
Daniël de Kok	540e710c3f	Falcon/DBRX: get correct number of key-value heads (#2205 )	2024-09-25 05:21:34 +00:00
Daniël de Kok	17594916ed	Fix incorrect cache allocation with multi-query (#2203 ) We wouldn't allocate any memory in multi-query (1 KV head). Fixes Starcoder et al.	2024-09-25 05:21:34 +00:00
Daniël de Kok	f11fd699b6	hotfix: Fix number of KV heads (#2202 ) Fix number of KV heads	2024-09-25 05:21:34 +00:00
icyboy™	8e3d1e6c3f	fix dbrx & opt model prefix bug (#2201 ) * Update idefics_causal_lm.py Fix syntax issues * fix dbrx & opt model prefix bug	2024-09-25 05:21:34 +00:00
Daniël de Kok	508e308088	Consistently take `prefix` in model constructors (#2191 ) * Consistently take `prefix` in model constructors * Release test check fix * Misc refactor-related fixes	2024-09-25 05:21:34 +00:00
Daniël de Kok	1e7ce69f20	Fix Starcoder2 after refactor (#2189 )	2024-09-25 05:20:28 +00:00
Nicolas Patry	e481a9bb9b	Hotfixing after refactor.	2024-09-25 05:20:28 +00:00
Nicolas Patry	1b434e8019	Refactor dead code - Removing all `flash_xxx.py` files. (#2166 ) * Refactor dead code. * First working step. * Remove a lot of duplicated code. * More dead code. * More cleanup. * Fix Santacoder test. * Fixing the simple tests. * Fixing sharding. * Fixes for VLM. * Fixing santacoder (num_kv_heads hardcoded). * Removing more dead code. * Fixing `config.n_head`. * Stopping earlier because of `<end_of_utterance>` in idefics2. * Addresses comments. * Removing the dead code. * Fuse back mistral into FlashCausalLM. * Finish removal. * Fixing docs + causal_lm `batch_class`. * Fixing docs + causal.lm. * Add default to Gemma Causality. * Default value for gemma/gemma2. * Wrong default.	2024-09-25 05:20:28 +00:00
Aaron Mihalik	835ad0a923	Adding "longrope" for Phi-3 (#2172 ) (#2179 ) Adding "longrope" for phi-3	2024-09-24 04:08:02 +00:00
Nicolas Patry	d580215a24	Hotfixing qwen2 and starcoder2 (which also get clamping). (#2167 )	2024-09-24 03:58:36 +00:00
Nicolas Patry	bc5a792dc8	Fixing rocm. (#2164 )	2024-09-24 03:58:13 +00:00
drbh	e913f3ad2d	fix: use the base layers weight in mistral rocm (#2155 )	2024-09-24 03:58:13 +00:00
Wang, Yi	71b0189cd5	fix FlashDecoding change's regression in intel platform (#2161 ) install triton because GPTQParams needs it. Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-24 03:58:13 +00:00
Nicolas Patry	9b3d3a3690	Fixing graph capture for flash decoding. (#2163 )	2024-09-24 03:58:13 +00:00
Nicolas Patry	b80bd724e1	Move to FlashDecoding instead of PagedAttention kernel. (#1940 ) * Using flash decoding Conditional flashdecoding. Fix max_q. Working kvcache Working version with flash decoding. Make it work for mistral. Fix after rebase.. Less intrusive. REvert changes in modeling. Speedup flashdecoding. HHachweew Hack to make other models work. Fixing non flash decoding llama path. Router logic knows about page size. Missing 2 models. Missing cohere. Fixing cohere flash decoding. Revamped all this architecture. Fix cohere. Fixing falcon. Enabling custom block size schedule. Update router/src/infer.rs Not sending preallocated output. * Making it work on non flash decoding. * Fix Cohere. * Fix non decoding paths. * Rebased. * No need for cache_manager anymore. * Update? * "ipex" -> "cpu" * These do not belong. * Factoring cu_seqlen_qk for better abstracting over every model. * Fixing non flash tests/imports. * Changing return everywhere. * Update mistral past. * Fixing Mi{s,x}tral (non functional in Flash Decoding mode though). * Fixup mistral clamping (had issues with cuda graphs). * No need to recreate anything actually.	2024-09-24 03:58:13 +00:00
Nicolas Patry	2b9339c65b	Fixing baichuan override. (#2158 )	2024-09-24 03:58:13 +00:00
Wang, Yi	6265956bc4	refine get xpu free memory/enable Qwen2/gemma2/gemma/phi in intel platform (#2132 ) * refine get xpu free memory Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable qwen2 in xpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable gemma/gemma2/phi in intel platform Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-24 03:57:32 +00:00
icyboy™	5b977c3141	fix AttributeError: 'MixtralLayer' object has no attribute 'mlp' (#2123 ) https://github.com/huggingface/text-generation-inference/issues/2122	2024-09-24 03:57:32 +00:00
Daniël de Kok	e0d168ba20	Use GPTQ-Marlin for supported GPTQ configurations (#2111 ) GPTQ-Marlin is currently the best-performing kernel for GPTQ models. So let's use it by default if the kernels are installed, the GPU supports it, and the kernels support the configuration. For models generated by `text-generation-server quantize`, use `sym=False`. This subcommand symmetric quantization since the beginning and incorrectly reporting the model to be symmetric will use GPTQ-Marlin (which does not support asymmetric quantization).	2024-09-24 03:57:32 +00:00
drbh	3e02d4fdbf	fix: use weights from base_layer (#2141 )	2024-09-24 03:57:32 +00:00
Nicolas Patry	bc15e960ea	Fixing gemma2. (#2135 ) * Fixing gemma2. * Adding new model.	2024-09-24 03:57:07 +00:00
Daniël de Kok	d731866245	Idefics2: sync added image tokens with transformers (#2080 ) Before this change, the number of reserved image tokens was not the same as the number of images. Fixes #2029. While at it, also remove all the image token handling duplication in `prepare_input`.	2024-09-24 03:56:28 +00:00
Daniël de Kok	4700ea413f	Add support for Marlin 2:4 sparsity (#2102 ) This change adds support for 2:4 sparsity when using Marlin quantization. The 2:4 kernel is used when: * The quantizer is `marlin`; * the quantizer checkpoint format is `marlin_24`. Fixes #2098.	2024-09-24 03:55:04 +00:00
Daniël de Kok	18a8364d94	Support AWQ quantization with bias (#2117 ) When the AWQ quantizer was used with a layer that uses a bias, the bias tensor was not correctly passed/used. Instead, the value `true`/`1.0` was added to the linear transformation. Correctly pass through the bias when it is not `None`. Fixes #2106.	2024-09-24 03:55:04 +00:00
drbh	8a155b2d5b	Enable multiple LoRa adapters (#2010 ) * feat: first draft load multiple lora * feat: load weights within layer and refactor lora pass * fix: refactor and reduce lora math * feat: baseline impl single request multi lora support * feat: prefer lorax implementation and port loading logic * fix: prefer adapter_data and refactors * feat: perfer loraxs custom punica kernels and add mlp loras * fix: adjust batch for bgmv * fix: adjust adapter_segments logic when in batch * fix: refactor and move changes to v3 proto * fix: pass model_id for all flash causal lms * fix: pass model_id for all causal and seq2seq lms * fix: add model_id to model test * feat: add lora support to mistral and refactors * feat: prefer model id in request * fix: include rust code for adapter id * feat: bump launcher and add new lora docs * feat: support base model generation and refactors * fix: rename doc to retry ci build * feat: support if vlm models * fix: add adapter_data param and avoid missing layers * fix: add adapter_data param to phi and neox * fix: update all models forwards to include adapter_data * fix: add model_id to IdeficsCausalLM * Update lora.md Fixed a typo * Update lora.md Fixing spam image * fix: add lora kernel to dockerfile, support running without kernels and refactors * fix: avoid dockerfile conflict * fix: refactors and adjust flash llama lora logic * fix: skip llama test due to CI issue (temp) * fix: skip llama test CI (temp) 2 * fix: revert skips and prefer updated ci token for tests * fix: refactors and helpful comments * fix: add noop in TensorParallelAdapterRowLinear too * fix: refactor and move shard_lora_weights logic * fix: exit early if no adapter_data --------- Co-authored-by: Derek <datavistics@gmail.com>	2024-09-24 03:55:04 +00:00
Wang, Yi	27ae4f7916	fix cpu and xpu issue (#2116 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-24 03:52:23 +00:00
Nicolas Patry	d626685039	Removing IPEX_AVAIL. (#2115 ) * Removing IPEX_AVAIL. Chose to unify CPU and XPU under `ipex`. Most code is exactly similar except for a very few spots. The biggest number of spots is the kv-cache layout and the flash_xxx.py files. Since those files should be removed soon and factored away, we should not need them. * Forgot a few places. * Unrelated change. * Fixing HF_TOKEN. * HF_TOKEN	2024-09-24 03:52:23 +00:00
Wang, Yi	0d879fe66e	Cpu tgi (#1936 ) * add CPU tgi support Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * ipex distributed ops support Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Funtowicz Morgan <mfuntowicz@users.noreply.github.com>	2024-09-24 03:51:26 +00:00
Wang, Yi	e49aed4713	use xpu-smi to dump used memory (#2047 ) * use xpu-smi to dump used memory xpu use "ZE_AFFINITY_MASK" to control card, usage is like CUDA_VISIBLE_DEVICES Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Update server/text_generation_server/utils/import_utils.py Co-authored-by: Daniël de Kok <me@github.danieldk.eu> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2024-09-24 03:51:26 +00:00

1 2 3 4 5 ...

560 Commits