* change ChatCompletionChunk to align with "OpenAI Chat Completions streaming API"
Moving after tool_calls2
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
add in Buffering..
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
fix: handle usage outside of stream state and add tests
Simplifying everything quite a bit.
Remove the unused model_dump.
Clippy.
Clippy ?
Ruff.
Uppgrade the flake for latest transformers.
Upgrade after rebase.
Remove potential footgun.
Fix completion test.
* Clippy.
* Tweak for multi prompt.
* Ruff.
* Update the snapshot a bit.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* Making `tool_calls` a vector.
* Arguments output is a string.
* Update all the integration tests.
* Add the requirements.
* Upgrade other tests.
* Clippy.
* Update the old test.
* Making `tool_calls` a vector.
* Update doc.
* Fixing the nix overlay with updated version.
* Add openai dependency.
* Updating the old tests.
* Trying to reduce the logs in the case of errors.
* Less spammy logs too.
* Patch rust release.
* Trying to remove the rust-toolchain hardcoded in action.
* Upgrade rust toolchain.
* Put back the toolchain ?
* Fix neuron dockerfile.
* Move to the proper version of Rust.
* 1.85 since the GH action doesn't respect the override.
* Typo.
* Fixing the github action.
* Fixing docker llamacpp.
* Fixing the github action.
* Update clippy.
* feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests.
* fix: Rust version for Neuron
* fix: PR comments, use rust-toolchain.toml
* make content field optional in chat request
* add tool_calls field to Message struct
* feat: add test and serialize tool messages
* fix: bump utopia, openapi doc version and improve test
* fix: rerun update docs
* fix: suppoer tool call id in template and remove unnecessary changes
* fix: ruff lint remove unused import
* fix: adjust message types in tests
---------
Co-authored-by: sailesh duddupudi <saileshradar@gmail.com>
* feat: Add the parsing of HF_HUB_USER_AGENT_ORIGIN environment variable to add info about the environment running TGI. That is useful to track usage in case of collaborations for example.
* fix: trufflehog
* feat: support qwen2.5 vl model
* fix: bump support models doc
* feat: check before rope type adjustment and small refactors
* fix: add transformer overlay for processor support
* fix: vendor processor and config from transformers
* fix: refactor/simplify conditionals
* misc(cmake) update dependencies
* feat(hardware) enable new hardware.hpp and unittests
* test(ctest) enable address sanitizer
* feat(backend): initial rewrite of the backend for simplicity
* feat(backend): remove all the logs from hardware.hpp
* feat(backend): added some logging
* feat(backend): enable compiler warning if support for RVO not applying
* feat(backend): missing return statement
* feat(backend): introduce backend_workspace_t to store precomputed information from the engine folder
* feat(backend): delete previous backend impl
* feat(backend): more impl
* feat(backend): use latest trtllm main version to have g++ >= 13 compatibility
* feat(backend): allow overriding which Python to use
* feat(backend): fix backend_exception_t -> backend_error_t naming
* feat(backend): impl missing generation_step_t as return value of pull_tokens
* feat(backend): make backend_workspace_t::engines_folder constexpr
* feat(backend): fix main.rs retrieving the tokenizer
* feat(backend): add guard to multiple header definitions
* test(backend): add more unittest
* feat(backend): remove constexpr from par
* feat(backend): remove constexpig
* test(backend): more test coverage
* chore(trtllm): update dependency towards 0.15.0
* effectively cancel the request on the executor
* feat(backend) fix moving backend when pulling
* feat(backend): make sure we can easily cancel request on the executor
* feat(backend): fix missing "0" field access
* misc(backend): fix reborrowing Pin<&mut T> as described in the doc https://doc.rust-lang.org/stable/std/pin/struct.Pin.html#method.as_mut
* chore: Add doc and CI for TRTLLM (#2799)
* chore: Add doc and CI for TRTLLM
* chore: Add doc and CI for TRTLLM
* chore: Add doc and CI for TRTLLM
* chore: Add doc and CI for TRTLLM
* doc: Formatting
* misc(backend): indent
---------
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>
* Using both value from config as they might not be correct.
* Fixing max_position_embeddings for falcon.
* Simple attempt to fix the healthcheck block allocation.
* Much simpler solution.
* Default value for Backend start_health
* Attempt at automatic max batch prefill.
* Taking into account number of shards.
* Adding more cards.
* Adding A100 + H100
* Adding a few more cards.
* Logprobs cost too much.
* h100 better name, and keep factor of 2
* Damn inflated sparse tflops.
* Typo in h100.
* Updated the flops calculation (checked with fvcore).
* chunking by default.
* Fix prefix caching for chat completion since we removed logprobs.
* More tests.
* Dropping all the prefill logprobs.
* Add a flag that enables users to get logprobs back.
* Repairing prompt token counting.
* Fixing a few tests.
* Remove some scaffolding.
* Attempting to reduces the issues (workarounds for now).
* feat: support continue_final_message param in chat request
* feat: add test for continue final message
* fix: bump openapi docs
* fix: remove continue_final_message chat request param
* fix: remove unneeded launcher args in continue test
* fix: bump test output
* fix: remove accidentally included guideline from rebase
* fix: remove guideline tests
* fix: adjust continuation tests expected text
* fix: replace expected output for continue test
* Move JSON grammar -> regex grammar conversion to the router
This change moves the JSON grammar -> regex grammar conversion to the
router by adding a dependency on the `outlines-core` Rust crate. In
contrast to the Python implementation, the conversions are not LRU-cached
since they seem to be fast enough:
simple schema time: [5.8293 µs 5.8307 µs 5.8320 µs]
change: [-13.166% -12.884% -12.641%] (p = 0.00 < 0.05)
Performance has improved.
complex schema time: [14.875 µs 14.881 µs 14.887 µs]
change: [-2.1637% -1.9914% -1.7852%] (p = 0.00 < 0.05)
Performance has improved.
Using the schemas from:
https://github.com/dottxt-ai/outlines-core/blob/main/benchmarks/bench_json_schema.py
* add OpenAI like tool_choice for named choice
* add tests
* fix: run linter and bump api docs
* fix: consolidate changes and remove old tool type
* feat: improve, simplify and rename tool choice struct add required support and refactor
* fix: simplify tool choice logic, improve tests, openapi and rust docs
* fix: refactor away prepare_chat_input and improve tool grammar apply control flow
* feat: update docs and add tool choice configuration section
* fix: simplify naming, tool choice default and improve test
* fix: adjust tool choice none logic, add test and small refactors
* fix: add missing snapshot file
* fix: adjust tool choice type in test
* fix: adjust default when json tool choice is
* fix: remove trailing space lint after rebase
* fix: remove mostly mocked unit test
---------
Co-authored-by: Linus Bierhoff <linus.bierhoff@icloud.com>
* Remove vLLM dependency for CUDA
This change adds `attention-kernels` as a dependency for paged
attention and cache reshaping. With that, we don't use vLLM
anywhere for CUDA.
Tested run (since we don't have paged attention in CI):
```
❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release
[...]
5 snapshots passed.
```
* Fix clippy warning
* feat: return streaming errors as an event formatted for openai's client
* fix: propagate completions error events to stream
* fix: improve stream api error format and add status code
* fix: improve streamin error to include error_type
* Revert "fix: improve streamin error to include error_type"
This reverts commit 2b1a360b1511d94ea9a24e5432e498e67939506a.
* Reworked the implementation.
* Revert "Reworked the implementation."
This reverts commit 7c3f29777f17411ae4ade57e2f88e73cde704ee5.
* Small lifting.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* feat: add support for qwen2 vl model
* feat: fix token padding, enable warmup and process basic request
* fix: improve get_position_ids, add lift embed_tokens
* fix: remove get_cos_sin_hack dev function
* feat: add simple test chat with meesage and text
* fix: lint test
* fix: adjust positional embeddings for multi dimensional position ids
* fix: update docs and lint unused vars
* fix: include linted file
* fix: add norm after text output
* fix: format model file
* fix: adjust for ruff lints
* fix: remove unused rotate_half
* feat: refactors and calc num features
* fix: prefer position_ids passed from vlm causal lm and reset ids on batch
* fix: adjust get_position_ids if not available and add required args to signatures
* fix: adjust resize case for qwen2_vl warmup
* fix: avoid qwen2 vl specific paths with qwen2
* We can have a tokenizer anywhere.
* Handling potential lack of offsets (python tokenizer)
* Remove redundancy.
* Fixing the tests.
* Flake.lock update ?
* Fixing the GIL locking.
* Fixing mamba by using the transformers version.
* Adding the legacy handle.
* Ellide lifetime.
* Lint.
* Deprecation message.
* Fixing bad rebase.
As spotted by @philschmid, the payload was compliant with Vertex AI, but
just partially, since ideally the most compliant version would be with
the generation kwargs flattened to be on the same level as the
`messages`; meaning that Vertex AI would still expect a list of
instances, but each instance would be an OpenAI-compatible instance,
which is more clear; and more aligned with the SageMaker integration
too, so kudos to him for spotting that; and sorry from my end for any
inconvenience @Narsil.
* feat: process token stream before returning to client
* fix: expect content in test
* fix: improve comparison via ruff lint
* fix: return event in all cases
* fix: always send event on error, avoid unwraps, refactor and improve tests
* fix: prefer no_tool over notify_error to improve reponse
* fix: adjust chat input test for no_tool
* fix: adjust test expected content
---------
Co-authored-by: System administrator <root@ip-10-90-0-186.ec2.internal>
* feat: unroll notify_error if no tool is choosen
* fix: expect simple message when no tool is selected
* fix: improve test to avoid notify_error
* fix: improve docs and indicate change in expected response
* fix: adjust linting in test file