This version removes our patches/custom API. Makes it simpler to
get changes from upstream. One of which is that we can enable FP8
KV cache for paged attention as well.
* backend(trtllm): attempt to remove AWS S3 flaky cache for sccache
* backend(trtllm): what if we expose ENV instead of inline?
* backend(trtllm): and with the right env var for gha sccache
* backend(trtllm): relax the way to detect sccache
* backend(trtllm): make sccache definition manually
* backend(trtllm): ok let's try to define the launchers in build.rs when rustc_wrapper is present
* backend(trtllm): export env variable in run mb?
* backend(trtllm): Cache mode max to cache intermediate layers
* backend(trtllm): inject ompi_version build arg in dependent step
* Upgrade the version number.
* Remove modifications in Lock.
* Tmp branch to test transformers backend with 2.5.1 and TP>1
* Fixing the transformers backend.
inference_mode forces the use of `aten.matmul` instead of `aten.mm` the
former doesn't have sharding support crashing the transformers TP
support.
`lm_head.forward` also crashes because it skips the hook that
cast/decast the DTensor.
Torch 2.5.1 is required for sharding support.
* Put back the attention impl.
* Revert the flashinfer (this will fails).
* Building AOT.
* Using 2.5 kernels.
* Remove the archlist, it's defined in the docker anyway.
* backend(trtllm): update to 0.16.0
* backend(trtllm): do not use shallow clone
* backend(trtllm): use tag instead
* backend(trtllm): move to nvidia remote instead of hf
* backend(trtllm): reenable shallow clone
* backend(trtllm): attempt to use ADD instead of RUN for openmpi
* backend(trtllm): make sure we are using correct path for openmpi ADD in dockerfile
* backend(trtllm): add correctly untar it
* Trying to avoid the random timeout.
* More read timeout ?
* Longer timeout ?
* Remove legacy ENV directive.
* Remove the dummy test, only increase the read timeout.
* Wat?
* Fixing TRTLLM dockerfile.
* Fixed.
* Creating a dummy modification to chekc CI runs.
* Removing the cache directive.
* Modifying this should cache hit.
* Revert "Modifying this should cache hit."
This reverts commit 46a2bde108.
* Modifying this should cache hit.
* Unwanted files.
* feat: tokenize each request individually and increase warmup image size
* feat: adjust rotary embed and avoid cuda graphs of size 2 and smaller
* fix: address image resize and rebase changes
* feat: update to run qwen2-vl tests
* fix: tweak param types
* fix the crash of meta-llama/Llama-3.2-1B
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Apply suggestions from code review
Simpler fix (which doesn't break vlms).
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* Moving to `uv` instead of `poetry`.
More in the standard, faster, seemingly better lockfile.
* Creating venv if not created.
* Create the venv.
* Fix ?
* Fixing the test by activating the environment ?
* Install system ?
* Add the cli entry point.
* docker install on system
* Monkeying this...
* `--system` is redundant.
* Trying to force-include this pb folder.
* TRying to check that pb is imported correctly.
* Editable install necessary ?
* Non editable?
* Editable it is.
* flash decoding
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* enable xpu flashdecoding
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* set flashdecoding blocksize as 64
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* enable flashdecoding, prefill chunking and prefix caching
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* add flashdecoding-ipex
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* feat: improve star coder to support multi lora layers
* feat: improve weight that support adapters and add tests for starcoder with lora
* fix: bump snapshot for added tests
* fix: rerun pre commit lints
* fix: bump adapter test for added later names
* Upgrading bitsandbytes.
Co-Authored-By: Matthew Douglas <38992547+matthewdouglas@users.noreply.github.com>
* Tighter lock.
---------
Co-authored-by: Matthew Douglas <38992547+matthewdouglas@users.noreply.github.com>
* Fix `docker run` in `README.md`
* Add line-break in `docker run` for readability
Co-authored-by: Daniël de Kok <danieldk@users.noreply.github.com>
* Add line-break in `docker run` for readability
Co-authored-by: Daniël de Kok <danieldk@users.noreply.github.com>
---------
Co-authored-by: Daniël de Kok <danieldk@users.noreply.github.com>
* Baichuan2-13B does not have max_position_embeddings in config
see https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/blob/main/config.json
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Update server/text_generation_server/models/flash_causal_lm.py
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
* fmt
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
error like "ValueError: Expecting a ProcessGroup, but got a <class
'text_generation_server.utils.dist.FakeGroup'>. rank=0"
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>