Commit Graph

  • e0d168ba20 Use GPTQ-Marlin for supported GPTQ configurations (#2111) Daniël de Kok 2024-07-01 12:59:12 +0200
  • de96056c26 feat: download lora adapter weights from launcher (#2140) drbh 2024-07-01 06:58:49 -0400
  • 3e02d4fdbf fix: use weights from base_layer (#2141) drbh 2024-07-01 06:58:40 -0400
  • 03691f6d34 Fixing clippy. (#2149) Nicolas Patry 2024-07-01 12:02:19 +0200
  • 8721b601e3 fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_… (#2148) Wang, Yi 2024-07-01 17:27:53 +0800
  • 69514868ee fix: refactor post_processor logic and add test (#2137) drbh 2024-06-27 17:16:19 -0400
  • bc15e960ea Fixing gemma2. (#2135) Nicolas Patry 2024-06-27 16:04:20 +0200
  • befe60b566 Fixing malformed rust tokenizers (#2134) Nicolas Patry 2024-06-27 16:04:03 +0200
  • d731866245 Idefics2: sync added image tokens with transformers (#2080) Daniël de Kok 2024-06-27 15:54:35 +0200
  • 11fced79bd Bumping to 2.1 (#2131) Nicolas Patry 2024-06-27 12:34:43 +0200
  • 7045598b20 Fixing prom leak by upgrading. (#2129) Nicolas Patry 2024-06-27 08:08:43 +0200
  • 399919d715 fix: simplify kserve endpoint and fix imports (#2119) drbh 2024-06-25 19:30:10 -0400
  • 4700ea413f Add support for Marlin 2:4 sparsity (#2102) Daniël de Kok 2024-06-25 21:09:42 +0200
  • 18a8364d94 Support AWQ quantization with bias (#2117) Daniël de Kok 2024-06-25 21:09:00 +0200
  • 8a155b2d5b Enable multiple LoRa adapters (#2010) drbh 2024-06-25 14:46:27 -0400
  • 8980bf43d7 Fix CI . (#2118) Nicolas Patry 2024-06-25 17:53:36 +0200
  • 136fb7e9b9 Add pytest release marker (#2114) Daniël de Kok 2024-06-25 16:53:20 +0200
  • 27ae4f7916 fix cpu and xpu issue (#2116) Wang, Yi 2024-06-25 22:47:06 +0800
  • d626685039 Removing IPEX_AVAIL. (#2115) Nicolas Patry 2024-06-25 13:20:57 +0200
  • 1f70bb75e3 feat: add simple tests for weights (#2092) drbh 2024-06-25 06:22:59 -0400
  • 0d879fe66e Cpu tgi (#1936) Wang, Yi 2024-06-25 18:21:29 +0800
  • a9faabc374 fix ChatCompletion and ChatCompletionChunk object string not compatible with standard openai api (#2089) sunxichen 2024-06-25 16:59:50 +0800
  • e49aed4713 use xpu-smi to dump used memory (#2047) Wang, Yi 2024-06-25 16:15:46 +0800
  • 1952a0b03b corrected Pydantic warning. (#2095) Jeff 2024-06-25 04:10:32 -0400
  • 76c6a5ca2a Add OTLP Service Name Environment Variable (#2076) KevinDuffy94 2024-06-25 08:33:01 +0100
  • 931ff16c7a Support HF_TOKEN environment variable (#2066) Lucain 2024-06-25 09:23:12 +0200
  • 4b25048b75 Fix cargo-chef prepare (#2101) ur4t 2024-06-25 00:16:36 +0800
  • a05f3849e4 do not set sliding_window if SUPPORTS_WINDOWING is false Wang, Yi A 2024-09-23 20:48:43 -0700
  • b6a59e2f91 New runner. Manual squash. (#2110) Nicolas Patry 2024-06-24 18:08:34 +0200
  • d930724e82 feat: sort cuda graphs in descending order (#2104) drbh 2024-06-21 14:28:26 -0400
  • f0ed8d294f Fix text-generation-server quantize (#2103) Daniël de Kok 2024-06-21 15:28:51 +0200
  • c61ef1ce85 Factor out sharding of packed tensors (#2059) Daniël de Kok 2024-06-20 09:56:04 +0200
  • 38741feff0 Support exl2-quantized Qwen2 models (#2085) Daniël de Kok 2024-06-20 07:56:16 +0200
  • 6b2cbd0169 Set maximum grpc message receive size to 2GiB (#2075) Daniël de Kok 2024-06-17 16:40:44 +0200
  • b3dadbde06 fix build.rs watch files (#2072) Ziru Niu 2024-06-17 18:10:01 +0800
  • 58c743bc90 Contributing guide & Code of Conduct (#2074) Lysandre Debut 2024-06-17 12:09:31 +0200
  • fb939370a3 Support different image sizes in prefill in VLMs (#2065) Daniël de Kok 2024-06-17 10:49:41 +0200
  • 8ee52e91f4 Adding architecture document (#2044) Alvaro Moran 2024-06-14 15:28:34 +0200
  • b07a2518d9 Update the link for qwen2 (#2068) Tiezhen WANG 2024-06-14 17:59:33 +0800
  • f1f28404e7 Add support for GPTQ Marlin (#2052) Daniël de Kok 2024-06-14 09:45:42 +0200
  • 7ce29b1ef2 implement Open Inference Protocol endpoints (#1942) drbh 2024-06-13 12:51:51 -0400
  • d0a1d50fd3 PR #2049 CI run (#2054) drbh 2024-06-13 11:53:49 -0400
  • 2fdad64ece fix(layers): fix SuRotaryEmbedding (#2060) OlivierDehaene 2024-06-12 18:24:47 +0200
  • e85e7ac4f9 fix(server): fix OPT implementation (#2061) OlivierDehaene 2024-06-12 18:22:20 +0200
  • 99c947452d Support chat response format (#2046) drbh 2024-06-11 10:44:56 -0400
  • eb8b76d1d2 Update LLMM1 bound (#2050) fxmarty 2024-06-11 13:30:29 +0200
  • 5381fa7393 fix(ci): remove unnecessary permissions (#2045) Luc Georges 2024-06-10 18:16:53 +0200
  • ac73317894 feat(ci): add trufflehog secrets detection (#2038) Luc Georges 2024-06-10 17:54:13 +0200
  • 748764efb4 Add Phi-3 medium support (#2039) Daniël de Kok 2024-06-10 09:22:29 +0200
  • 5e035063cf ROCm and sliding windows fixes (#2033) fxmarty 2024-06-10 09:09:50 +0200
  • 93663b4567 server: use chunked inputs Daniël de Kok 2024-05-31 11:51:42 +0000
  • 63cd798a19 Xpu gqa (#2013) Wang, Yi 2024-06-07 01:12:57 +0800
  • 0494677284 Internal runner ? (#2023) Nicolas Patry 2024-06-06 18:51:42 +0200
  • 7aaec2a542 marlin: improve build Daniël de Kok 2024-06-06 11:25:56 +0000
  • e6d8d2e50f marlin: support tp>1 when group_size==-1 Daniël de Kok 2024-06-06 11:51:52 +0000
  • 77ac0f364b Add support for Marlin-quantized models Daniël de Kok 2024-06-05 08:14:40 +0000
  • 9c75591c11 Revert "Less cache misses on cargo build." Nicolas Patry 2024-06-06 10:33:55 +0200
  • 346f77f8ba Less cache misses on cargo build. Nicolas Patry 2024-06-06 10:33:01 +0200
  • 2c1ff79d38 Update __version__ on __init__.py to 0.7.0 (#2017) Andrés Marafioti 2024-06-05 14:51:07 +0200
  • af9d60c985 Fix GPTQWeight import (#2020) Daniël de Kok 2024-06-05 14:49:15 +0200
  • 8ee07f0eae Fixing rocm. (#2021) Nicolas Patry 2024-06-05 14:41:34 +0200
  • 20df9234a9 feat: move allocation logic to rust (#1835) OlivierDehaene 2024-06-05 12:18:38 +0200
  • cdd120ac02 Do not initialize scratch space when there are no ExLlamaV2 layers (#2015) Daniël de Kok 2024-06-05 10:45:47 +0200
  • 353a9669ba Hotfixing make install. (#2008) Nicolas Patry 2024-06-04 23:34:03 +0200
  • ed8913535b Making make install work better by default. (#2004) Nicolas Patry 2024-06-04 19:38:46 +0200
  • 648dd7b8e1 Support GPTQ models with column-packed up/gate tensor (#2006) Daniël de Kok 2024-06-04 19:37:49 +0200
  • 184c89fd55 feat: add SchedulerV3 (#1996) OlivierDehaene 2024-06-04 15:56:56 +0200
  • 63de9ff020 fix: update triton implementation reference (#2002) Emmanuel Ferdman 2024-06-04 15:26:35 +0300
  • 75aed8aed5 Fix Phi-2 with tp>1 (#2003) Daniël de Kok 2024-06-04 14:26:07 +0200
  • d51f2c465f router: send the input as chunks to the backend Daniël de Kok 2024-06-03 07:27:22 +0000
  • 347ecdae3b reable xpu, broken by gptq and setuptool upgrade (#1988) Wang, Yi 2024-06-03 22:07:50 +0800
  • b3b175568f Hotfix GPTQ. Nicolas Patry 2024-06-03 09:32:12 +0000
  • b30b2a6dae Fixing GPTQ imports. (#1994) Nicolas Patry 2024-06-03 10:36:29 +0200
  • 7752f1050b Fixing Phi3. Nicolas Patry 2024-06-01 08:47:00 +0000
  • c46a223a6d single char ` addition for docs (#1989) Nicholas Broad 2024-05-31 09:42:14 -0700
  • d1473fab70 Fixing exl2 scratch buffer. (#1990) Nicolas Patry 2024-05-31 18:01:43 +0200
  • bdc676f65c Purely refactors paged/attention into layers/attention and make hardware differences more obvious with 1 file per hardware. (#1986) Nicolas Patry 2024-05-31 17:57:01 +0200
  • dd2d46d9d1 Update documentation version to 2.0.4 (#1980) fxmarty 2024-05-31 07:03:24 -0700
  • f6c5e078d5 Gemma GPTQ checks: skip logprob checks Daniël de Kok 2024-05-30 07:10:10 +0000
  • 628d6a13da Add support for exl2 quantization Daniël de Kok 2024-05-28 09:51:31 +0000
  • 4dca35fc62 feat: adjust attn weight loading logic (#1975) drbh 2024-05-29 12:42:11 -0400
  • 2b204f0479 Fixing the text part from tokenizer endpoint. (#1967) Nicolas Patry 2024-05-28 16:55:36 +0200
  • 9a1475d816 Fix (non-container) pytest stdout buffering-related lock-up Daniël de Kok 2024-05-28 07:25:14 +0000
  • cbd5d67101 Upgrade to Axum 0.7 and Hyper 1.0 (Breaking change: disabled ngrok tunneling). (#1959) Nicolas Patry 2024-05-28 14:52:17 +0200
  • e3d4483f9b fix small typo and broken link (#1958) Moritz Laurer 2024-05-27 17:31:06 +0200
  • 1213b6a817 Processor config chat template (#1954) drbh 2024-05-27 10:03:16 -0400
  • 1439b26cd4 Fix GPTQ for models which do not have float16 at the default dtype (simpler) (#1953) Daniël de Kok 2024-05-27 14:41:28 +0200
  • 742ef9b8e5 Fix (flash) Gemma prefix and enable tests Daniël de Kok 2024-05-24 15:34:42 +0000
  • 479f1953ba Fix seeded output. (#1949) Nicolas Patry 2024-05-24 15:36:13 +0200
  • 92a1e0fbae Aligin the source code with main branch 2.0.4 yuanwu 2024-09-24 03:06:55 +0000
  • 4ac0cd2339
    Use Default trait when parameters: null Alvaro Bartolome 2024-09-23 21:23:39 +0200
  • 8ef3da72e1
    Fix /vertex payload parsing when MESSAGES_API_ENABLED Alvaro Bartolome 2024-09-23 20:39:20 +0200
  • a50e90e7e2
    added v2 OlivierDehaene 2024-09-23 18:49:37 +0200
  • 6e105c8eb8
    wip OlivierDehaene 2024-09-23 18:00:59 +0200
  • ae2c85f485
    Update docs/source/quicktour.md Aritra Roy Gosthipaty 2024-09-23 16:50:34 +0530
  • f92ff5cdab
    Update docs/source/quicktour.md Aritra Roy Gosthipaty 2024-09-23 16:50:27 +0530
  • ee9f0b56c5
    Update docs/source/quicktour.md Aritra Roy Gosthipaty 2024-09-23 15:47:26 +0530
  • 5b78abee4b chore: adding note for private models in quicktour doc ariG23498 2024-09-23 15:17:56 +0530
  • 9263817c71
    nix: remove unused _server.nix file (#2538) Daniël de Kok 2024-09-23 09:43:23 +0200
  • 956f02ed40
    Update the link to the Ratatui organization Orhun Parmaksız 2024-09-21 10:15:16 +0200