* chore: prepare version 3.3.5
* black
* neuron: black
* Update hf-xet in uv lockfile
* Attempt to fix API doc check failure
Add `error_type` where missing.
* Pin redocly version
* Sync redocly with Nix for now
---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>
* Disable Cachix pushes
This is not safe until we have sandboxed builds. For TGI alone
this might not be a huge issue, but with Cachix caching disabled
in hf-nix, TGI CI would build all the packages and push it to
our cache.
* fix: bump docs
---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
* chore(neuron): update to optimum-neuron 0.3.0
Dependencies were changed accordingly, because Neuron SDK was updated to
v2.24.
* test: sample is not deterministic
Also modify the temperature in decode test to avoid granite early
stopping.
* test(neuron): adjust expectations after graph changes
* test(neuron): use greedy for stop sequences
---------
Co-authored-by: David Corvoysier <david@huggingface.co>
* chore(neuron): use optimum-neuron 0.2.1
* test(neuron): adjust expectations
Since the latest optimum-neuron uses a new modeling for granite and
qwen, the greedy outputs are slighly different.
* test(neuron): add phi3 and qwen3 tests
* chore(neuron): use optimum-neuron 0.2.2
* chore(neuron): bump version to 0.2.0
* refactor(neuron): use named parameters in inputs helpers
This allows to hide the differences between the two backends in terms of
input parameters.
* refactor(neuron): remove obsolete code paths
* fix(neuron): use neuron_config whenever possible
* fix(neuron): use new cache import path
* fix(neuron): neuron config is not stored in config anymore
* fix(nxd): adapt model retrieval to new APIs
* fix(generator): emulate greedy in sampling parameters
When on-device sampling is enabled, we need to emulate the greedy
behaviour using top-k=1, top-p=1, temperature=1.
* test(neuron): update models and expectations
* feat(neuron): support on-device sampling
* fix(neuron): adapt entrypoint
* tests(neuron): remove obsolete models
* fix(neuron): adjust test expectations for llama on nxd
* Switch to punica-sgmv kernel from the Hub
This also switches (temporarily) to the tgi-nix/kernel-builder merge
branch, bumping up to CUDA 12.8 (same as non-Nix Torch).
* nix: client depends on aiohttp
This probably worked before the nixpkgs bump because a dependency
propagated aiohttp.