* chore(neuron): use optimum-neuron 0.2.1
* test(neuron): adjust expectations
Since the latest optimum-neuron uses a new modeling for granite and
qwen, the greedy outputs are slighly different.
* test(neuron): add phi3 and qwen3 tests
* chore(neuron): use optimum-neuron 0.2.2
* chore(neuron): bump version to 0.2.0
* refactor(neuron): use named parameters in inputs helpers
This allows to hide the differences between the two backends in terms of
input parameters.
* refactor(neuron): remove obsolete code paths
* fix(neuron): use neuron_config whenever possible
* fix(neuron): use new cache import path
* fix(neuron): neuron config is not stored in config anymore
* fix(nxd): adapt model retrieval to new APIs
* fix(generator): emulate greedy in sampling parameters
When on-device sampling is enabled, we need to emulate the greedy
behaviour using top-k=1, top-p=1, temperature=1.
* test(neuron): update models and expectations
* feat(neuron): support on-device sampling
* fix(neuron): adapt entrypoint
* tests(neuron): remove obsolete models
* fix(neuron): adjust test expectations for llama on nxd
* Switch to punica-sgmv kernel from the Hub
This also switches (temporarily) to the tgi-nix/kernel-builder merge
branch, bumping up to CUDA 12.8 (same as non-Nix Torch).
* nix: client depends on aiohttp
This probably worked before the nixpkgs bump because a dependency
propagated aiohttp.
* Update to Torch 2.7.0
* Try to fix typer/click issue
* Pin click to fix incompatibility with typer
* Fix some test outputs with slight deviations
* Attempt again to sync with CI
* Mamba too
* Fixup mllama
Also switch to `unsloth/Llama-3.2-11B-Vision-Instruct` for testing
from the EU :).
* forward and tokenize chooser use the same shape
concate or filter happened to cpu tensor to avoid dynamic shape in hpu
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* use hpu set seed
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* IPEX support FP8 kvcache
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* add kvcache dtype
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* add softcap and slidingwindow
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* kv scale in pageattn
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* remove triton installation, will be installed with torch
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* install xelink lib
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* softcap default -1.0
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* softcap default -1.0
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Add json_schema alias for GrammarType
* Add tests for all aliases
* fix: various linter adjustments
* fix: end-of-file-fixer lint
* fix: add test snapshots and avoid docs change
* fix: another end-of-file-fixer lint
* feat: support json_schema grammar constraining and add tests
* fix: bump openapi doc with new grammar option
* fix: adjust test payload
* fix: bump test snaps
---------
Co-authored-by: Alex Weston <alexw@alkymi.io>