* chore(neuron): bump version to 0.2.0
* refactor(neuron): use named parameters in inputs helpers
This allows to hide the differences between the two backends in terms of
input parameters.
* refactor(neuron): remove obsolete code paths
* fix(neuron): use neuron_config whenever possible
* fix(neuron): use new cache import path
* fix(neuron): neuron config is not stored in config anymore
* fix(nxd): adapt model retrieval to new APIs
* fix(generator): emulate greedy in sampling parameters
When on-device sampling is enabled, we need to emulate the greedy
behaviour using top-k=1, top-p=1, temperature=1.
* test(neuron): update models and expectations
* feat(neuron): support on-device sampling
* fix(neuron): adapt entrypoint
* tests(neuron): remove obsolete models
* fix(neuron): adjust test expectations for llama on nxd
* Switch to punica-sgmv kernel from the Hub
This also switches (temporarily) to the tgi-nix/kernel-builder merge
branch, bumping up to CUDA 12.8 (same as non-Nix Torch).
* nix: client depends on aiohttp
This probably worked before the nixpkgs bump because a dependency
propagated aiohttp.
* Update to Torch 2.7.0
* Try to fix typer/click issue
* Pin click to fix incompatibility with typer
* Fix some test outputs with slight deviations
* Attempt again to sync with CI
* Mamba too
* Fixup mllama
Also switch to `unsloth/Llama-3.2-11B-Vision-Instruct` for testing
from the EU :).
* forward and tokenize chooser use the same shape
concate or filter happened to cpu tensor to avoid dynamic shape in hpu
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* use hpu set seed
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* IPEX support FP8 kvcache
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* add kvcache dtype
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* add softcap and slidingwindow
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* kv scale in pageattn
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* remove triton installation, will be installed with torch
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* install xelink lib
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* softcap default -1.0
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* softcap default -1.0
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Add json_schema alias for GrammarType
* Add tests for all aliases
* fix: various linter adjustments
* fix: end-of-file-fixer lint
* fix: add test snapshots and avoid docs change
* fix: another end-of-file-fixer lint
* feat: support json_schema grammar constraining and add tests
* fix: bump openapi doc with new grammar option
* fix: adjust test payload
* fix: bump test snaps
---------
Co-authored-by: Alex Weston <alexw@alkymi.io>
* Add `.DS_Store` file to `.gitignore`
* Skip `{% generation %}` and `{% endgeneration %}`
Custom syntax within the chat template for the Phi4 Reasoning models
e.g. https://huggingface.co/microsoft/Phi-4-reasoning-plus, which is
AFAIK not handled natively yet, so skipping for now
* Update explanation on `{% generation %}` and `{% endgeneration %}` removal
* Revert "Add `.DS_Store` file to `.gitignore`"
This reverts commit d64d6d2f7f.
* clean cuda/rocm code in hpu backend, enable flat_hpu
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix TP in pageattn
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* adjust block table in hpu to improve performance
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* enable all the model. not testet yet
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* use tensor cache in hpu graph to avoid replay issue
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* add moe support, fix qwen/mistral/mixtral crash
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix phimoe issue
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* gpt_bigcode could also go pageattn
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* enable dbrx remove some unused code
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* multi-modality initial PR
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* adjust warmup and enable vlm
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix incorrect output in qwen2 idefics if hpu graph is used
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* remove unused quantization code and enable awq/gptq int4
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix gptq issue
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* enable fp8
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* warmup prefill
remove model where pageattn is not used, set block table to None since it's not used
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* add warmup_decode
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* warmup decode
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* remove block_tables and prefill_cache_indices which will lead to dynamic shape
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix comment
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* missing gptj change...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix some issue
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* remove torch.where to fix incorrect output in hpu graph model
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* LLM warmup logic
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* multi-modality warmup
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* optimize code
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* refine log and fix some issue
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix warmup issue for mllama
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* pingpong optimization
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* match the latest vllm_extension ops
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* work with the latest vllm extension ops
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* remove block_scales which is not needed anymore
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* improve performance
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* prefill bypass graph
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* pingpong optimization issue fix
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* add prometheus port
* fix doc
* add port for trtllm and llamacpp
* Fixing format after rebase.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>