Commit Graph

  • b5e814b93c feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248) OlivierDehaene 2024-07-20 17:02:04 +0000
  • ceb4b65fff Add FP8 release test (#2261) Daniël de Kok 2024-07-20 12:26:06 +0200
  • 38aaf4bbc4 re-push to internal registry (#2242) Adrien 2024-07-20 07:06:40 +0200
  • 8f3fa1782e Add support for Deepseek V2 (#2224) Daniël de Kok 2024-07-19 17:23:20 +0200
  • 6bdf8d7d69 fix: adjust default tool choice (#2244) drbh 2024-07-19 11:12:02 -0400
  • 26194add09 add usage stats to toctree (#2260) Erik Kaunismäki 2024-07-19 16:34:04 +0200
  • 4cb09155d0 usage stats and crash reports (#2220) Erik Kaunismäki 2024-07-19 16:17:56 +0200
  • 0f59eb8ff3 Hotfix: pass through model revision in VlmCausalLM (#2258) Daniël de Kok 2024-07-19 15:59:00 +0200
  • ede341d8e4 Hotfix: fix MPT after recent refactor (#2257) Daniël de Kok 2024-07-19 14:42:35 +0200
  • 52bafbae08 Hotfix: various GPT-based model fixes (#2256) Daniël de Kok 2024-07-19 14:42:19 +0200
  • a5b70eeb19 Hotfix: fix of use of unquantized weights in Gemma GQA loading (#2255) Daniël de Kok 2024-07-19 12:55:59 +0200
  • d36a52f84f Improve the handling of quantized weights (#2250) Daniël de Kok 2024-07-19 09:37:39 +0200
  • 0f8b19db76 fix(server): fix cohere (#2249) OlivierDehaene 2024-07-18 14:00:13 +0000
  • 1c11084e0b Remove stray quantize argument in get_weights_col_packed_qkv (#2237) Daniël de Kok 2024-07-16 09:30:57 +0200
  • f3d709b687 server quantize: expose groupsize option (#2225) Daniël de Kok 2024-07-16 08:36:05 +0200
  • 77478f04ec Add support for AWQ-quantized Idefics2 (#2233) Daniël de Kok 2024-07-16 07:58:25 +0200
  • 3e3b74837b fix: Remove bitsandbytes installation when running cpu-only install (#2216) Hugo Larcher 2024-07-15 15:34:20 +0200
  • 26a7bae06e fix custom cache dir (#2226) Erik Kaunismäki 2024-07-15 15:17:13 +0200
  • 6314c96828 feat: simple mistral lora integration tests (#2180) drbh 2024-07-15 09:16:15 -0400
  • 295d1ade49 Use symmetric quantization in the quantize subcommand (#2120) Daniël de Kok 2024-07-12 12:20:12 +0200
  • 37245dfc16 [fix] Modifying base in yarn embedding (#2212) SeongBeomLEE 2024-07-12 17:04:51 +0900
  • ecffeeb367 fix: append DONE message to chat stream (#2221) drbh 2024-07-11 10:42:58 -0400
  • ab366cc083 Add support for FP8 on compute capability >=8.0, <8.9 (#2213) Daniël de Kok 2024-07-11 16:03:26 +0200
  • b967a21f47 Move quantized weight handling out of the Weights class (#2194) Daniël de Kok 2024-07-09 20:04:03 +0200
  • 6d3c598032 Updating the self check (#2209) Nicolas Patry 2024-07-09 17:23:48 +0200
  • 5caff1c8dd Fixed README ToC (#2196) vinkamath 2024-07-09 02:22:08 -0700
  • a48ab68a28 Adding sanity check to openapi docs. Nicolas Patry 2024-07-09 11:13:48 +0200
  • c329b1cc64 Fix buildx cache + change runner type (#2176) Guillaume LEGENDRE 2024-07-08 18:13:32 +0200
  • 867ac74a56 Fix nccl regression on PyTorch 2.3 upgrade (#2099) fxmarty 2024-07-08 17:52:10 +0200
  • 6ab7ade9bb feat: use model name as adapter id in chat endpoints (#2128) drbh 2024-07-08 10:06:49 -0400
  • 0ad64141b1 update to metrics 0.23.0 or could work with metrics-exporter-promethe… (#2190) Wang, Yi 2024-07-08 22:03:59 +0800
  • 651824b475 fix: python deserialization (#2178) Javier Martinez 2024-07-08 15:59:16 +0200
  • 2dd053aff1 add doc for intel gpus (#2181) Wang, Yi 2024-07-08 21:57:06 +0800
  • 6dfd8430e0 Falcon/DBRX: get correct number of key-value heads (#2205) Daniël de Kok 2024-07-08 13:22:38 +0200
  • eabcb61d3e Fix incorrect cache allocation with multi-query (#2203) Daniël de Kok 2024-07-08 11:19:48 +0200
  • b2431ca537 hotfix: Fix number of KV heads (#2202) Daniël de Kok 2024-07-08 09:52:12 +0200
  • 361aae7334 fix dbrx & opt model prefix bug (#2201) icyboy™ 2024-07-08 15:01:14 +0800
  • 2640328dba Consistently take prefix in model constructors (#2191) Daniël de Kok 2024-07-05 16:07:48 +0200
  • c9eabadc94 GPTQ CI improvements (#2151) Daniël de Kok 2024-07-05 14:12:16 +0200
  • 76dcc263fd Fix Starcoder2 after refactor (#2189) Daniël de Kok 2024-07-05 12:22:45 +0200
  • 69aa8a55a0 Hotfixing after refactor. Nicolas Patry 2024-07-05 09:25:29 +0000
  • f6982b80b3 Refactor dead code - Removing all flash_xxx.py files. (#2166) Nicolas Patry 2024-07-05 10:29:56 +0200
  • f598d247ec Adding "longrope" for Phi-3 (#2172) (#2179) Aaron Mihalik 2024-07-05 03:46:41 -0400
  • 348b3b4cff Preparing patch release. (#2186) Nicolas Patry 2024-07-04 10:55:33 +0200
  • ccb0e6bee0 Fixing missing object field for regular completions. (#2175) Nicolas Patry 2024-07-03 12:56:27 +0200
  • 8ffc0189a7 Fixing the dockerfile warnings. (#2173) Nicolas Patry 2024-07-03 12:48:45 +0200
  • 3f36200271 Revert "Fixing missing object field for regular completions." Nicolas Patry 2024-07-03 10:41:39 +0000
  • 6a928632db Fixing missing object field for regular completions. Nicolas Patry 2024-07-03 10:40:22 +0000
  • e1476654c5 feat: improve update_docs for openapi schema (#2169) drbh 2024-07-03 03:53:35 -0400
  • 430169ef60 Hotfixing qwen2 and starcoder2 (which also get clamping). (#2167) Nicolas Patry 2024-07-02 14:26:47 +0200
  • 793299f594 Ci test (#2124) Guillaume LEGENDRE 2024-07-02 12:45:38 +0200
  • fe32f9a5fe Fixing rocm. (#2164) Nicolas Patry 2024-07-02 12:01:08 +0200
  • de5102f2d3 fix: use the base layers weight in mistral rocm (#2155) drbh 2024-07-02 05:56:25 -0400
  • fe859831eb fix FlashDecoding change's regression in intel platform (#2161) Wang, Yi 2024-07-02 17:56:07 +0800
  • ba906df8e0 Fixing graph capture for flash decoding. (#2163) Nicolas Patry 2024-07-02 11:43:07 +0200
  • bec6a17feb [Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. (#1940) Nicolas Patry 2024-07-01 23:28:00 +0200
  • f037337b60 Fixing baichuan override. (#2158) Nicolas Patry 2024-07-01 23:25:54 +0200
  • ad07af3ab9 GH router. (#2153) Nicolas Patry 2024-07-01 15:42:26 +0200
  • 415ded31ff Fixing test. (#2152) Nicolas Patry 2024-07-01 15:24:17 +0200
  • 89a0c5d378 fix: prefer serde structs over custom functions (#2127) drbh 2024-07-01 09:08:05 -0400
  • dca51e4673 refine get xpu free memory/enable Qwen2/gemma2/gemma/phi in intel platform (#2132) Wang, Yi 2024-07-01 20:32:54 +0800
  • e940b35357 fix AttributeError: 'MixtralLayer' object has no attribute 'mlp' (#2123) icyboy™ 2024-07-01 20:17:22 +0800
  • f92eca247e Use GPTQ-Marlin for supported GPTQ configurations (#2111) Daniël de Kok 2024-07-01 12:59:12 +0200
  • d56dbde1af feat: download lora adapter weights from launcher (#2140) drbh 2024-07-01 06:58:49 -0400
  • 53e146fd3c fix: use weights from base_layer (#2141) drbh 2024-07-01 06:58:40 -0400
  • 35068f4b47 Fixing clippy. (#2149) Nicolas Patry 2024-07-01 12:02:19 +0200
  • 08fede71da fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_… (#2148) Wang, Yi 2024-07-01 17:27:53 +0800
  • 592ea3f2f8 removed tracing logs Edwinhr716 2024-07-26 00:39:11 +0000
  • 9697d16207 enviroment variable approach Edwinhr716 2024-07-25 23:12:33 +0000
  • c27075d349 added implementation that requires new cli argument Edwinhr716 2024-07-25 22:15:27 +0000
  • 73eadda9dd fix: reject grammars without properties drbh 2024-07-25 17:12:38 +0000
  • 6784d5d0a6 fix: avoid unneeded quantize check drbh 2024-07-25 15:05:26 +0000
  • a10f4010d7 fix: adjust lints and ignore specific rules drbh 2024-07-25 14:50:18 +0000
  • 3905f854ed
    Fix registry name (#2307) Adrien 2024-07-25 16:06:00 +0200
  • 783e03953a
    wip Adrien 2024-07-25 15:20:54 +0200
  • c5a982de82
    convert header to string KevinDuffy94 2024-07-25 09:16:24 -0400
  • 6b294d7e81
    fix Adrien 2024-07-25 15:06:15 +0200
  • c25babcb76
    test Adrien 2024-07-25 14:54:01 +0200
  • 17ed42be3a
    Fixing idefics on g6 tests. (#2306) Nicolas Patry 2024-07-25 14:44:21 +0200
  • 270ec41b09 allow silent failure erikkaum 2024-07-25 14:39:15 +0200
  • 96a24acbd9
    fix registry usage Adrien 2024-07-25 14:34:40 +0200
  • 55d4288509
    Fixing idefics on g6 tests. Nicolas Patry 2024-07-25 14:00:59 +0200
  • 9256d7c38c
    Some small fixes for the Torch 2.4.0 update (#2304) Daniël de Kok 2024-07-25 13:34:44 +0200
  • 060b70525b Fix small PaliGemma logprob differences after the torch update Daniël de Kok 2024-07-25 09:44:27 +0000
  • 9c129a4a47 Update poetry lock file Daniël de Kok 2024-07-25 09:40:20 +0000
  • fa9221f28d Fix GPTQ autotune data type to be compatible with Torch 2.4.0 Daniël de Kok 2024-07-25 09:39:42 +0000
  • 6b74f5b413 make sure variable live long enough... Morgan Funtowicz 2024-07-25 10:47:52 +0000
  • 69a5804e51 use std::env::const::ARCH Morgan Funtowicz 2024-07-25 10:44:42 +0000
  • fcbf2fc1ac fix envvar CARGO_CFG_TARGET_ARCH set at runtime vs compile time Morgan Funtowicz 2024-07-25 10:36:55 +0000
  • dda015f2aa add some custom stuff for nccl linkage Morgan Funtowicz 2024-07-25 10:29:51 +0000
  • 0a8c9d3dcf install to decoder_attention target Morgan Funtowicz 2024-07-25 10:21:54 +0000
  • 5afc98a7d7
    Snapshot update with vllm paged. use_g6 Nicolas Patry 2024-07-25 12:17:40 +0200
  • 26614057a7
    Using g6 instead of g5. (#2281) Nicolas Patry 2024-07-25 11:21:17 +0200
  • e582eed62f
    Merge branch 'huggingface:main' into model_load_fix_for_qwen2_1.5B Matvey Kolbasov 2024-07-25 10:57:45 +0300
  • 72c97676fd fix: prefer comparing model enum over str drbh 2024-07-24 21:13:07 +0000
  • 9bfa340e34 fix: update lints drbh 2024-07-24 19:44:32 +0000
  • e216e53ea8 fix: improve fbgemm_gpu check and lints drbh 2024-07-24 15:23:10 +0000
  • 382bf59f4f fix: lint and refactor import check and avoid model enum as global names drbh 2024-07-24 14:51:28 +0000
  • 655a9d7ef3 fix: adjust client ruff settings drbh 2024-07-19 17:33:11 +0000
  • 154cf67dad fix: adjust syntax to avoid circular import drbh 2024-07-19 17:24:35 +0000