text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-07-05 15:30:19 +00:00

Author	SHA1	Message	Date
Nicolas Patry	2e09ebecf6	Preparing patch release. (#2186 )	2024-09-24 04:08:02 +00:00
Nicolas Patry	74ddd1265a	Version 2.1.1	2024-09-24 04:01:22 +00:00
Nicolas Patry	e93c830e66	Fixing missing `object` field for regular completions. (#2175 ) * Fixing missing `object` field for regular completions. * Fixing docs by re-adding missing `Prompt`.	2024-09-24 04:00:11 +00:00
Nicolas Patry	64989f9439	Fixing the dockerfile warnings. (#2173 )	2024-09-24 04:00:11 +00:00
Nicolas Patry	878491cd5b	Revert "Fixing missing `object` field for regular completions." This reverts commit `2bbb7fa4b2`.	2024-09-24 03:59:15 +00:00
Nicolas Patry	b6c8984658	Fixing missing `object` field for regular completions.	2024-09-24 03:59:15 +00:00
drbh	233e46409a	feat: improve update_docs for openapi schema (#2169 ) * feat: add pre commit step to force schema update when router changes * fix: prefer improved update_doc and start server and compare * fix: adjust typo * fix: adjust revert typo * fix: update workflow to use update_doc md command * feat: improve workflow to check openapi schema too * fix: adjust timeout for CI * fix: adjust raise condition and install server in ci * fix: install protoc before server * feat: improve update doc and add command to print router schema * fix: adjust autodoc workflow * fix: explicitly install protoc and python * fix: alllow trailing space in openapi schema diff	2024-09-24 03:59:15 +00:00
Nicolas Patry	d580215a24	Hotfixing qwen2 and starcoder2 (which also get clamping). (#2167 )	2024-09-24 03:58:36 +00:00
Nicolas Patry	bc5a792dc8	Fixing rocm. (#2164 )	2024-09-24 03:58:13 +00:00
drbh	e913f3ad2d	fix: use the base layers weight in mistral rocm (#2155 )	2024-09-24 03:58:13 +00:00
Wang, Yi	71b0189cd5	fix FlashDecoding change's regression in intel platform (#2161 ) install triton because GPTQParams needs it. Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-24 03:58:13 +00:00
Nicolas Patry	9b3d3a3690	Fixing graph capture for flash decoding. (#2163 )	2024-09-24 03:58:13 +00:00
Nicolas Patry	b80bd724e1	Move to FlashDecoding instead of PagedAttention kernel. (#1940 ) * Using flash decoding Conditional flashdecoding. Fix max_q. Working kvcache Working version with flash decoding. Make it work for mistral. Fix after rebase.. Less intrusive. REvert changes in modeling. Speedup flashdecoding. HHachweew Hack to make other models work. Fixing non flash decoding llama path. Router logic knows about page size. Missing 2 models. Missing cohere. Fixing cohere flash decoding. Revamped all this architecture. Fix cohere. Fixing falcon. Enabling custom block size schedule. Update router/src/infer.rs Not sending preallocated output. * Making it work on non flash decoding. * Fix Cohere. * Fix non decoding paths. * Rebased. * No need for cache_manager anymore. * Update? * "ipex" -> "cpu" * These do not belong. * Factoring cu_seqlen_qk for better abstracting over every model. * Fixing non flash tests/imports. * Changing return everywhere. * Update mistral past. * Fixing Mi{s,x}tral (non functional in Flash Decoding mode though). * Fixup mistral clamping (had issues with cuda graphs). * No need to recreate anything actually.	2024-09-24 03:58:13 +00:00
Nicolas Patry	2b9339c65b	Fixing baichuan override. (#2158 )	2024-09-24 03:58:13 +00:00
drbh	381c5c02a6	fix: prefer serde structs over custom functions (#2127 ) * fix: prefer enum for chat object * fix: adjust typo * fix: enum CompletionType not ObjectType * fix: adjust typo * feat: leverage serde for conditional deser * fix: adjust HubTokenizerConfig after rebase * fix: update create_post_processor logic for token type * fix: adjust unwrap syntax in template * Fixing the post processor. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-09-24 03:57:32 +00:00
Wang, Yi	6265956bc4	refine get xpu free memory/enable Qwen2/gemma2/gemma/phi in intel platform (#2132 ) * refine get xpu free memory Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable qwen2 in xpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable gemma/gemma2/phi in intel platform Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-24 03:57:32 +00:00
icyboy™	5b977c3141	fix AttributeError: 'MixtralLayer' object has no attribute 'mlp' (#2123 ) https://github.com/huggingface/text-generation-inference/issues/2122	2024-09-24 03:57:32 +00:00
Daniël de Kok	e0d168ba20	Use GPTQ-Marlin for supported GPTQ configurations (#2111 ) GPTQ-Marlin is currently the best-performing kernel for GPTQ models. So let's use it by default if the kernels are installed, the GPU supports it, and the kernels support the configuration. For models generated by `text-generation-server quantize`, use `sym=False`. This subcommand symmetric quantization since the beginning and incorrectly reporting the model to be symmetric will use GPTQ-Marlin (which does not support asymmetric quantization).	2024-09-24 03:57:32 +00:00
drbh	de96056c26	feat: download lora adapter weights from launcher (#2140 )	2024-09-24 03:57:32 +00:00
drbh	3e02d4fdbf	fix: use weights from base_layer (#2141 )	2024-09-24 03:57:32 +00:00
Nicolas Patry	03691f6d34	Fixing clippy. (#2149 )	2024-09-24 03:57:32 +00:00
Wang, Yi	8721b601e3	fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_… (#2148 ) * fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_indices] Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Apply suggestions from code review --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-09-24 03:57:32 +00:00
drbh	69514868ee	fix: refactor post_processor logic and add test (#2137 ) * fix: refactor post_processor logic and add test * fix: remove dev comment * fix: adjust when post_processor is overridden and improve create_post_processor	2024-09-24 03:57:07 +00:00
Nicolas Patry	bc15e960ea	Fixing gemma2. (#2135 ) * Fixing gemma2. * Adding new model.	2024-09-24 03:57:07 +00:00
Nicolas Patry	befe60b566	Fixing malformed rust tokenizers (#2134 ) * Fixing malformed rust tokenizers * Fix for deepseek too.	2024-09-24 03:57:07 +00:00
Daniël de Kok	d731866245	Idefics2: sync added image tokens with transformers (#2080 ) Before this change, the number of reserved image tokens was not the same as the number of images. Fixes #2029. While at it, also remove all the image token handling duplication in `prepare_input`.	2024-09-24 03:56:28 +00:00
Nicolas Patry	11fced79bd	Bumping to 2.1 (#2131 )	2024-09-24 03:56:28 +00:00
Nicolas Patry	7045598b20	Fixing prom leak by upgrading. (#2129 )	2024-09-24 03:55:38 +00:00
drbh	399919d715	fix: simplify kserve endpoint and fix imports (#2119 )	2024-09-24 03:55:04 +00:00
Daniël de Kok	4700ea413f	Add support for Marlin 2:4 sparsity (#2102 ) This change adds support for 2:4 sparsity when using Marlin quantization. The 2:4 kernel is used when: * The quantizer is `marlin`; * the quantizer checkpoint format is `marlin_24`. Fixes #2098.	2024-09-24 03:55:04 +00:00
Daniël de Kok	18a8364d94	Support AWQ quantization with bias (#2117 ) When the AWQ quantizer was used with a layer that uses a bias, the bias tensor was not correctly passed/used. Instead, the value `true`/`1.0` was added to the linear transformation. Correctly pass through the bias when it is not `None`. Fixes #2106.	2024-09-24 03:55:04 +00:00
drbh	8a155b2d5b	Enable multiple LoRa adapters (#2010 ) * feat: first draft load multiple lora * feat: load weights within layer and refactor lora pass * fix: refactor and reduce lora math * feat: baseline impl single request multi lora support * feat: prefer lorax implementation and port loading logic * fix: prefer adapter_data and refactors * feat: perfer loraxs custom punica kernels and add mlp loras * fix: adjust batch for bgmv * fix: adjust adapter_segments logic when in batch * fix: refactor and move changes to v3 proto * fix: pass model_id for all flash causal lms * fix: pass model_id for all causal and seq2seq lms * fix: add model_id to model test * feat: add lora support to mistral and refactors * feat: prefer model id in request * fix: include rust code for adapter id * feat: bump launcher and add new lora docs * feat: support base model generation and refactors * fix: rename doc to retry ci build * feat: support if vlm models * fix: add adapter_data param and avoid missing layers * fix: add adapter_data param to phi and neox * fix: update all models forwards to include adapter_data * fix: add model_id to IdeficsCausalLM * Update lora.md Fixed a typo * Update lora.md Fixing spam image * fix: add lora kernel to dockerfile, support running without kernels and refactors * fix: avoid dockerfile conflict * fix: refactors and adjust flash llama lora logic * fix: skip llama test due to CI issue (temp) * fix: skip llama test CI (temp) 2 * fix: revert skips and prefer updated ci token for tests * fix: refactors and helpful comments * fix: add noop in TensorParallelAdapterRowLinear too * fix: refactor and move shard_lora_weights logic * fix: exit early if no adapter_data --------- Co-authored-by: Derek <datavistics@gmail.com>	2024-09-24 03:55:04 +00:00
Nicolas Patry	8980bf43d7	Fix CI . (#2118 ) Fix clippy.	2024-09-24 03:53:26 +00:00
Daniël de Kok	136fb7e9b9	Add pytest release marker (#2114 ) * Add pytest release marker Annotate a test with `@pytest.mark.release` and it only gets run with `pytest integration-tests --release`. * Mark many models as `release` to speed up CI	2024-09-24 03:52:50 +00:00
Wang, Yi	27ae4f7916	fix cpu and xpu issue (#2116 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-24 03:52:23 +00:00
Nicolas Patry	d626685039	Removing IPEX_AVAIL. (#2115 ) * Removing IPEX_AVAIL. Chose to unify CPU and XPU under `ipex`. Most code is exactly similar except for a very few spots. The biggest number of spots is the kv-cache layout and the flash_xxx.py files. Since those files should be removed soon and factored away, we should not need them. * Forgot a few places. * Unrelated change. * Fixing HF_TOKEN. * HF_TOKEN	2024-09-24 03:52:23 +00:00
drbh	1f70bb75e3	feat: add simple tests for weights (#2092 ) * feat: add simple tests for weights * fix: adjust types and add tests * fix: adjust so all tests pass * feat: improve weight tests * fix: add missing tests and renames * fix: tweak shapes	2024-09-24 03:51:26 +00:00
Wang, Yi	0d879fe66e	Cpu tgi (#1936 ) * add CPU tgi support Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * ipex distributed ops support Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Funtowicz Morgan <mfuntowicz@users.noreply.github.com>	2024-09-24 03:51:26 +00:00
sunxichen	a9faabc374	fix ChatCompletion and ChatCompletionChunk object string not compatible with standard openai api (#2089 ) Co-authored-by: sunxichen <sun.xc@digitalcnzz.com>	2024-09-24 03:51:26 +00:00
Wang, Yi	e49aed4713	use xpu-smi to dump used memory (#2047 ) * use xpu-smi to dump used memory xpu use "ZE_AFFINITY_MASK" to control card, usage is like CUDA_VISIBLE_DEVICES Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Update server/text_generation_server/utils/import_utils.py Co-authored-by: Daniël de Kok <me@github.danieldk.eu> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2024-09-24 03:51:26 +00:00
Jeff	1952a0b03b	corrected Pydantic warning. (#2095 ) * corrected Pydantic warning. * Update clients/python/text_generation/types.py Co-authored-by: Daniël de Kok <me@github.danieldk.eu> --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2024-09-24 03:51:26 +00:00
KevinDuffy94	76c6a5ca2a	Add OTLP Service Name Environment Variable (#2076 ) * Adding Service Name Environment variable for https://github.com/huggingface/text-generation-inference/issues/2069 * Update Docs * Update README.md * Update Launcher Docs * Update Launcher Docs Removing Option	2024-09-24 03:51:26 +00:00
Lucain	931ff16c7a	Support `HF_TOKEN` environment variable (#2066 ) * Support HF_TOKEN environement variable * Load test. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-09-24 03:50:38 +00:00
ur4t	4b25048b75	Fix cargo-chef prepare (#2101 ) * Fix cargo-chef prepare In prepare stage, cargo-chef reads Cargo.lock and transforms it accordingly. If Cargo.lock is not present, cargo-chef will generate a new one first, which might vary a lot and invalidate docker build caches. * Fix Dockerfile_amd and Dockerfile_intel	2024-09-24 03:49:13 +00:00
Nicolas Patry	b6a59e2f91	New runner. Manual squash. (#2110 ) * New runner. Manual squash. * Network host. * Put back trufflehog with proper extension. * No network host ? * Moving buildx install after tailscale ? * 1.79	2024-09-24 03:47:37 +00:00
drbh	d930724e82	feat: sort cuda graphs in descending order (#2104 )	2024-09-24 03:46:09 +00:00
Daniël de Kok	f0ed8d294f	Fix `text-generation-server quantize` (#2103 ) The subcommand did not work due to some broken imports.	2024-09-24 03:46:09 +00:00
Daniël de Kok	c61ef1ce85	Factor out sharding of packed tensors (#2059 ) For Phi-3-Small I need to shard a packed QKV bias tensor, for which I implemented the `Weights.get_packed_sharded` method. However, this method can also replace the `Weights._get_qweight` method and the custom sharding code from `Weights.get_weights_col_packed`.	2024-09-24 03:46:09 +00:00
Daniël de Kok	38741feff0	Support exl2-quantized Qwen2 models (#2085 ) Fixes #2081.	2024-09-24 03:46:09 +00:00
Daniël de Kok	6b2cbd0169	Set maximum grpc message receive size to 2GiB (#2075 ) * Set maximum grpc message receive size to 2GiB The previous default was 4MiB, which doesn't really work well for multi-modal models. * Update to Rust 1.79.0 * Fixup formatting to make PR pass	2024-09-24 03:44:36 +00:00

1 2 3 4 5 ...

964 Commits