* feat: unroll notify_error if no tool is choosen
* fix: expect simple message when no tool is selected
* fix: improve test to avoid notify_error
* fix: improve docs and indicate change in expected response
* fix: adjust linting in test file
* adding max_token_capacity_metric
* added tgi to name of metric
* Adding max capacity metric.
* Add description for the metrics
---------
Co-authored-by: Edwinhr716 <Edandres249@gmail.com>
* Working loading state.
* Preprocessing.
* Working state ? (Broke idefics1 temporarily).
* Cleaner condition.
* Fix idefics.
* Updating config, removing TODO
* Mllama
* Ugrade transformers 4.45
* Flashing mllama.
* Starting to get there.
* Working state.
* Integrations tests for mllama (cutting to 10 tokens because there seems'
to be instability after (meaning size of the batch matters.
* Updating model link.
* Earlier assert.
* Fix vlm ?
* remove log.
* Force ignore all images but last.
* Default dtype bfloat16.
* Update integration test after switch to bf16.
* Remove dead code.
* Removed dead code.
* Upgrade the flake to latest transformers/tokenizers
* Move to hf tgi-nix
* Upgrade to 0.5.0
* nix: experimental support for building a Docker image
Run using something like:
```
docker run \
--device nvidia.com/gpu=all \
-it --rm -p 8080:80 \
-v $PWD/data:/data \
-v $PWD/tmp:/tmp \
tgi-docker:latest \
--model-id <model_id>
```
* Example of building the Docker image using Nix inside Docker
* Stream to make the builder image smaller
This avoids storing a Docker image tarball in the image. Instead,
stream the layers while doing `docker run`.
* Don't spam journalctl on Linux
* Other dockerfile.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* feat: support phi3.5 moe model loading
* fix: prefer llama base model and improve rotary logic
* feat: return reasonable generation and add integration test
* fix: run lint and update docs
* fix: rerun lint for openapi docs
* fix: prefer do_sample false unless temp is set by user, and update chat tests
* fix: small typo adjustments
* fix: consolidate long rope paths
* fix: revert greedy by default and test changes
* Vendor configuration so that we don't have to `trust_remote_code`
* Use SparseMoELayer
* Add support for dense MoE
* Some type annotations
* Add the usual model tests
* Ruff.
---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
This change add support for MoE models that use GPTQ quantization.
Currently only models with the following properties are supported:
- No `desc_act` with tensor parallelism, unless `group_size=-1`.
- No asymmetric quantization.
- No AWQ.
Remove compute capability lock
We are only calling the `get_cuda_capability` function once, so avoiding
the cost of multiple calls is not really necessary yet.
* Improve support for GPUs with capability < 8
- For models that cannot use flashinfer, use flash-attn v1 + paged
attention for models with a compute capability older than 8.
- Disable prefix caching when using paged attention.
- When using flash-attn v1, pass the key/value, rather than the
cache, since v1 cannot use block tables.
* nix: add flash-attn-v1 to the server environment
* Move disabling prefix caching into the block of exceptions
* Capability as `usize`s