* update transformres
* Upgrading the nix deps too.
* Forcing torchvision to be in there.
* Fixing bug in mllama.
* Those tests cannot be run in CI.
* Lint.
---------
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* initial changes
* Add support for other vlm
* cleanup comment
* Improve attn_implementation
* Add comments for support of models
* add model
* add model
* fixes and improvements
* update docker
* Add cache position
* Add tests
* remove redundant changes
* remove tr version
* Upgrade doc + fix linting.
* Fixing the CI.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* launcher: correctly get the head dimension for VLMs
For most (?) VLMs, the head dimension is in the `text_config`
configuration section. However, since we only queried the top-level
`head_dim` (which typically doesn't exist in VLMs), we would never use
flashinfer. This change adds a method that gets the head dimension from
the top-level `Config` struct or `text_config` when that fails.
* fix: bump org name in gemma3 test
---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
* feat(neuron): use AWS Neuron SDK 2.21.1
* feat(neuron): bump optimum-neuron version
* feat(neuron): tag latest image for local tests
* test(neuron): simplify sampling test
* Fixing the tool calling convention.
* Update tehe doc.
* Fixing some corner cases.
* Fixing the tool call id.
* Fmt.
* Snapshot update with the new updated tool_call_id.
* More qwen2.
* change ChatCompletionChunk to align with "OpenAI Chat Completions streaming API"
Moving after tool_calls2
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
add in Buffering..
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
fix: handle usage outside of stream state and add tests
Simplifying everything quite a bit.
Remove the unused model_dump.
Clippy.
Clippy ?
Ruff.
Uppgrade the flake for latest transformers.
Upgrade after rebase.
Remove potential footgun.
Fix completion test.
* Clippy.
* Tweak for multi prompt.
* Ruff.
* Update the snapshot a bit.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* Making `tool_calls` a vector.
* Arguments output is a string.
* Update all the integration tests.
* Add the requirements.
* Upgrade other tests.
* Clippy.
* Update the old test.
* Making `tool_calls` a vector.
* Update doc.
* Fixing the nix overlay with updated version.
* Add openai dependency.
* Updating the old tests.
* Trying to reduce the logs in the case of errors.
* Less spammy logs too.
* feat: add neuron backend
* feat(neuron): add server standalone installation
* feat(neuron): add server and integration tests
* fix(neuron): increase ulimit when building image
The base image used to compile the rust components seems to have a low
ulimit for opened files, which leads to errors during compilation.
* test(neuron): merge integration tests and fixtures
* test: add --neuron option
* review: do not use latest tag
* review: remove ureq pinned version
* review: --privileged should be the exception
* feat: add neuron case to build ci
* fix(neuron): export models from container in test fixtures
The neuron tests require models to have been previously exported and
cached on the hub. This is done automatically by the neuron.model
fixture the first time the tests are ran for a specific version.
This fixture used to export the models using optimum-neuron directly,
but this package is not necessarily present on the system.
Instead, it is now done through the neuron TGI itself, since it
contains all the tools required to export the models.
Note that since the CI runs docker in docker (dind) it does not seem
possible to share a volume between the CI container and the container
used to export the model.
For that reason, a specific image with a modified entrypoint is built
on-the-fly when a model export is required.
* refactor: remove sagemaker entry-point
The SageMaker image is built differently anyway.
* fix(neuron): avoid using Levenshtein
* test(neuron): use smaller llama model
* feat(neuron): avoid installing CUDA in image
* test(neuron): no error anymore when requesting too many tokens
* ci: doing a precompilation step (with a different token).
* test(neuron): avoid using image sha when exporting models
We now manually evaluate the apparent hash of the neuron backend by
combining the hash of the neuron backend directory and Dockerfile.
This new hash is used to identify exported neuron models instead of the
image sha.
This has two benefits:
- it changes less frequently (only hwen the neuron backend changes),
which means less neuron models being pushed to the hub,
- it can be evaluated locally, meaning that running the tests once
locally will export the models before the CI uses them.
* test(neuron): added a small script to prune test models
---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* make content field optional in chat request
* add tool_calls field to Message struct
* feat: add test and serialize tool messages
* fix: bump utopia, openapi doc version and improve test
* fix: rerun update docs
* fix: suppoer tool call id in template and remove unnecessary changes
* fix: ruff lint remove unused import
* fix: adjust message types in tests
---------
Co-authored-by: sailesh duddupudi <saileshradar@gmail.com>
* feat: support qwen2.5 vl model
* fix: bump support models doc
* feat: check before rope type adjustment and small refactors
* fix: add transformer overlay for processor support
* fix: vendor processor and config from transformers
* fix: refactor/simplify conditionals
* Updating mllama after strftime.
* Town instead village.
* Forgot the integration snapshot.
* Attempt to fix intel CPU.
* Intel extension fix.
* Workaround intel.
* Moving those deps directly into pyproject.
* Revert "Moving those deps directly into pyproject."
This reverts commit 98c1496ea6.
* Non system uv.
* Fixing the docker environment hopefully.
* Missed a step.
* Move workdir up a bit.
* Bailing out of reproducible python env.
* Triton version.
* feat: refactor model, improve startup and re enable tests
* fix: improve multimodal rotary embed caching
* fix: limit vision flop calc to qwen2 vl models and update config typing
* fix: include clippy lint
* feat: refactor position ids in warmup and bump tests
* fix: prefer default dtype
* fix: enable all cuda graphs and bump snapshots
* fix: adjust rotaty init path
* fix: simplify get position ids and remove usused vision config
* fix: update position ids so first dim is batch, simplify rotary and bump vlm default token limit
* fix: improve position id init during cuda warmup for mrope and simplfy rotary forward
* fix: check existance before accessing rope type in cuda warmup
* fix: check key before access
* fix: improve mrope check in cuda graph warmup
* fix: remove check for default rope type
* fix: add more test and improve model generation
* fix: improve and simplify get_cos_sin, refactors and cleanup get_position_ids
* fix: adjust signatures with types
* Trying to avoid the random timeout.
* More read timeout ?
* Longer timeout ?
* Remove legacy ENV directive.
* Remove the dummy test, only increase the read timeout.
* Wat?
* feat: tokenize each request individually and increase warmup image size
* feat: adjust rotary embed and avoid cuda graphs of size 2 and smaller
* fix: address image resize and rebase changes
* feat: update to run qwen2-vl tests
* fix: tweak param types
* feat: improve star coder to support multi lora layers
* feat: improve weight that support adapters and add tests for starcoder with lora
* fix: bump snapshot for added tests
* fix: rerun pre commit lints
* fix: bump adapter test for added later names
* Basic flashinfer 0.2 support
This change does not use any of the new features yet, but makes
some small compatibility changes.
* Update to flashinfer 0.2.0.post1
* flashinfer: remove `contiguous` calls
* Fix flashinfer install
* flashinfer: fixup kv cache dtype
* Fix some annoying perturbations
* More output changes
* Attempt at automatic max batch prefill.
* Taking into account number of shards.
* Adding more cards.
* Adding A100 + H100
* Adding a few more cards.
* Logprobs cost too much.
* h100 better name, and keep factor of 2
* Damn inflated sparse tflops.
* Typo in h100.
* Updated the flops calculation (checked with fvcore).
* chunking by default.
* Fix prefix caching for chat completion since we removed logprobs.
* More tests.
* Dropping all the prefill logprobs.
* Add a flag that enables users to get logprobs back.
* Repairing prompt token counting.
* Fixing a few tests.
* Remove some scaffolding.
* Attempting to reduces the issues (workarounds for now).
* feat: support continue_final_message param in chat request
* feat: add test for continue final message
* fix: bump openapi docs
* fix: remove continue_final_message chat request param
* fix: remove unneeded launcher args in continue test
* fix: bump test output
* fix: remove accidentally included guideline from rebase
* fix: remove guideline tests
* fix: adjust continuation tests expected text
* fix: replace expected output for continue test
* Move JSON grammar -> regex grammar conversion to the router
This change moves the JSON grammar -> regex grammar conversion to the
router by adding a dependency on the `outlines-core` Rust crate. In
contrast to the Python implementation, the conversions are not LRU-cached
since they seem to be fast enough:
simple schema time: [5.8293 µs 5.8307 µs 5.8320 µs]
change: [-13.166% -12.884% -12.641%] (p = 0.00 < 0.05)
Performance has improved.
complex schema time: [14.875 µs 14.881 µs 14.887 µs]
change: [-2.1637% -1.9914% -1.7852%] (p = 0.00 < 0.05)
Performance has improved.
Using the schemas from:
https://github.com/dottxt-ai/outlines-core/blob/main/benchmarks/bench_json_schema.py
* add OpenAI like tool_choice for named choice
* add tests
* fix: run linter and bump api docs
* fix: consolidate changes and remove old tool type
* feat: improve, simplify and rename tool choice struct add required support and refactor
* fix: simplify tool choice logic, improve tests, openapi and rust docs
* fix: refactor away prepare_chat_input and improve tool grammar apply control flow
* feat: update docs and add tool choice configuration section
* fix: simplify naming, tool choice default and improve test
* fix: adjust tool choice none logic, add test and small refactors
* fix: add missing snapshot file
* fix: adjust tool choice type in test
* fix: adjust default when json tool choice is
* fix: remove trailing space lint after rebase
* fix: remove mostly mocked unit test
---------
Co-authored-by: Linus Bierhoff <linus.bierhoff@icloud.com>
* Add support for compressed-tensors w8a8 int checkpoints
This change adds a loader for w8a8 int checkpoints. One large benefit of
int8 support is that the corresponding cutlass matmul kernels also work on
compute capability 7.5.
Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8:
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|---------------|------:|----------------|-----:|-----------------------|---|-----:|---|------|
|gsm8k_cot_llama| 3|flexible-extract| 8|exact_match |↑ |0.8431|± |0.0100|
| | |strict-match | 8|exact_match |↑ |0.8393|± |0.0101|
|ifeval | 4|none | 0|inst_level_loose_acc |↑ |0.8597|± | N/A|
| | |none | 0|inst_level_strict_acc |↑ |0.8201|± | N/A|
| | |none | 0|prompt_level_loose_acc |↑ |0.7967|± |0.0173|
| | |none | 0|prompt_level_strict_acc|↑ |0.7468|± |0.0187|
Which is the same ballpark as vLLM.
As usual, lots of thanks to Neural Magic/vLLM for the kernels.
* Always use dynamic input quantization for w8a8 int
It's far less flaky and gives better output.
* Use marlin-kernels 0.3.5
* Fix a typo
Co-authored-by: drbh <david.richard.holtz@gmail.com>
* Small fixes
---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
compressed-tensors is a safetensors extension for sparse, quantized
tensors. The format is more powerful than earlier AWQ/GPTQ/FP8
quantization, because
- Different quantizer configurations can be used for different targets.
- The format can specify input/output quantizers in addition to weight
quantizers.
- Configurable exclusions for quantization.
This change adds a dependency on the `compressed-tensors` package for
its configuration parsing and layer matching functionality.
The following types of quantization are supported in this PR:
- W8A16 and W4A16 INT using GPTQ-Marlin kernels.
- W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels.
Support for other quantization types will be added in subsequent PRs.
* feat: support multidimensional position ids on batch to enable cuda graphs on qwen2-vl
* fix: only check model type if config exists
* fix: adjust sharding and lm head logic
* fix qwen2 failure in intel cpu
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix: return correct shape logits and add streaming test
* fix: remove unused import and refactor test
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* feat: add support for qwen2 vl model
* feat: fix token padding, enable warmup and process basic request
* fix: improve get_position_ids, add lift embed_tokens
* fix: remove get_cos_sin_hack dev function
* feat: add simple test chat with meesage and text
* fix: lint test
* fix: adjust positional embeddings for multi dimensional position ids
* fix: update docs and lint unused vars
* fix: include linted file
* fix: add norm after text output
* fix: format model file
* fix: adjust for ruff lints
* fix: remove unused rotate_half
* feat: refactors and calc num features
* fix: prefer position_ids passed from vlm causal lm and reset ids on batch
* fix: adjust get_position_ids if not available and add required args to signatures
* fix: adjust resize case for qwen2_vl warmup
* fix: avoid qwen2 vl specific paths with qwen2
* We can have a tokenizer anywhere.
* Handling potential lack of offsets (python tokenizer)
* Remove redundancy.
* Fixing the tests.
* Flake.lock update ?
* Fixing the GIL locking.
* Fixing mamba by using the transformers version.
* Adding the legacy handle.
* Ellide lifetime.
* Lint.
* Deprecation message.
* Fixing bad rebase.