Commit Graph

  • 22d9249c4a Fix the bug Sadra Barikbin 2024-08-07 23:48:15 +0330
  • 4dc67e4ef3
    Version check, doc fixes (#182) Vidya Galli 2024-08-07 13:09:51 -0700
  • 9b71343328
    Integrate flash attention for starcoder2 tgi through habana and some fixes, enabling (#198) Abhilash Majumder 2024-08-08 01:36:05 +0530
  • dea2b747d1
    Apply suggestions from code review Vaibhav Srivastav 2024-08-07 19:50:02 +0200
  • 3c9a840362
    update readme with latest quants info Vaibhav Srivastav 2024-08-07 15:55:10 +0200
  • 76ba66b4f8 Update Quantization docs and minor doc fix. Vaibhav Srivastav 2024-08-07 15:33:38 +0200
  • 4430e123d4 add intel-cpu docker image Wang, Yi A 2024-08-07 00:02:31 -0700
  • 6abcab843d fix in regression in ipex flashattention Wang, Yi A 2024-08-06 22:00:36 -0700
  • 30e70f2ceb Merge branch 'main' into hot_fix_xpu Wang, Yi A 2024-08-06 21:57:25 -0700
  • 3e41ec28c7 add gptj modeling Wang, Yi A 2024-08-01 18:26:04 -0700
  • 133015f408
    fix: prefer original layernorm names for 180B (#2365) drbh 2024-08-06 15:25:30 -0400
  • 3280d59f19 fix: prefer original layernorm names for 180B drbh 2024-08-06 19:08:09 +0000
  • a64d407d64
    fix: default num_ln_in_parallel_attn to one if not supplied (#2364) drbh 2024-08-06 13:33:22 -0400
  • fecc66d736 fix: default num_ln_in_parallel_attn to one if not supplied drbh 2024-08-06 17:21:10 +0000
  • 1768c00b9f
    feat: return the generated text when parsing fails (#2353) drbh 2024-08-06 13:10:19 -0400
  • f8a5b381fe
    feat: prefer stop over eos_token to align with openai finish_reason (#2344) drbh 2024-08-06 13:09:50 -0400
  • e557855558 revert mkl Mohit Sharma 2024-08-06 12:46:35 +0000
  • d61f7e63fa fix clone Mohit Sharma 2024-08-06 12:39:49 +0000
  • f230da8d63 Keeping the benchmark somewhere feature/radix-prefix-cache-bench Daniël de Kok 2024-08-06 12:36:15 +0000
  • e11f5f1c38
    feat: implement a templated endpoint for visibility into chat requests (#2333) drbh 2024-08-06 07:51:32 -0400
  • 29b8d19cdf
    fix: return the out tensor rather then the functions return value (#2361) drbh 2024-08-06 07:49:53 -0400
  • 7865851c02 Test partially overlapping prefills. Daniël de Kok 2024-08-06 11:45:20 +0000
  • 2a255ad719 Initial radix cache tests Daniël de Kok 2024-08-06 10:50:18 +0000
  • 5788c942a5 ix issues Mohit Sharma 2024-08-06 10:29:46 +0000
  • 6486887b43 Add radix cache free, improve allocate Daniël de Kok 2024-08-06 08:57:34 +0000
  • 516b43f006 fix: return the out tensor rather then the functions return value drbh 2024-08-05 19:06:55 +0000
  • 4379f0650a feat: add release and sha tagged images inlcude-latest-release-on-commit-builds-tags drbh 2024-08-05 13:13:52 -0400
  • dd47a3dac4
    feat: include local lora adapter loading docs (#2359) drbh 2024-08-05 12:36:44 -0400
  • 9415b90892 First radix allocation bits Daniël de Kok 2024-08-05 15:55:14 +0000
  • 07ede8d8e5 Fix splitting Daniël de Kok 2024-08-05 15:54:56 +0000
  • 6c547a69dc feat: include local lora adapter loading docs drbh 2024-08-05 09:23:28 -0400
  • 215ed3ad52
    fix: attempt forward on flash attn2 to check hardware support (#2335) drbh 2024-08-05 09:11:40 -0400
  • 55e6059eb1 update torch Mohit Sharma 2024-08-05 12:47:21 +0000
  • 0ad78d20a5 style Mohit Sharma 2024-08-05 10:12:46 +0000
  • ab2ab2a0aa pre-commit feature/no_repeat_ngram_size_ci erikkaum 2024-08-05 13:01:19 +0200
  • 05a1d1b83a forgot fixture erikkaum 2024-08-05 12:49:24 +0200
  • 5da696046e Block/node eviction Daniël de Kok 2024-08-05 09:46:37 +0000
  • 82d7914761 delete release decorator erikkaum 2024-08-05 11:15:15 +0200
  • c4258e40fe feat: simplify prepare_chat_input logic and adjust start stop chars drbh 2024-08-02 23:35:28 +0000
  • 645a6f8068 fix: typo tweak drbh 2024-08-02 21:48:04 +0000
  • ad942a1d79 fix: avoid changing conditional drbh 2024-08-02 21:46:58 +0000
  • afc0fb5adf fix: simplify changes and revert model changes drbh 2024-08-02 19:01:58 +0000
  • cf27954257 fix: update sliding window conditional drbh 2024-08-02 17:39:08 +0000
  • ce76f4ccc3 Update parent refcounts when inserting a child Daniël de Kok 2024-08-02 15:02:28 +0000
  • aa1c96a7a4 Access times Daniël de Kok 2024-08-02 14:17:56 +0000
  • 9caee1f368 Parent links Daniël de Kok 2024-08-02 13:54:38 +0000
  • 590fc2c58d Double linked data structures are still terrible in Rust. Daniël de Kok 2024-08-02 13:50:56 +0000
  • ed83bfe0ff
    fix: Fix Poetry path Hugo Larcher 2024-08-02 15:10:15 +0200
  • 690d631d68
    fix: Update runners Hugo Larcher 2024-08-02 14:54:34 +0200
  • 28b8a4287d
    Merge branch 'main' into feat/add-load-test Hugo Larcher 2024-08-02 14:44:37 +0200
  • cd4933cd5a Trie insertion/lookup Daniël de Kok 2024-08-02 11:05:03 +0000
  • c9916107b7 Add FlashInfer support Daniël de Kok 2024-08-01 11:02:22 +0000
  • d34ffc4fe9 Refile the hpu warmup yuanwu 2024-08-02 04:36:59 +0000
  • e9842ceef2 feat: return the generated text when parsing fails drbh 2024-08-01 23:40:23 +0000
  • fe5c19d155
    Fix unsigned integer underflow Max de Bayser 2024-08-01 15:53:32 -0300
  • 5b649d67c4 fix: improve condtional and error message drbh 2024-08-01 16:17:29 +0000
  • 060b2db0df add 'mamba' as model config fix/parse-mamba-config erikkaum 2024-08-01 18:16:32 +0200
  • 82240bf44c
    Update server/text_generation_server/utils/logits_process.py Erik Kaunismäki 2024-08-01 17:38:09 +0200
  • cae28dcbf1 fix: prefer version check over test op and avoid window_size_left if not flash attn2 drbh 2024-08-01 15:06:09 +0000
  • 743d37812d WIP Daniël de Kok 2024-08-01 15:05:51 +0000
  • 47447ef017
    Unify attention output handling (#2343) Daniël de Kok 2024-08-01 17:03:28 +0200
  • 5d482d4da2 Port over block allocator interface (with token ids) Daniël de Kok 2024-08-01 13:41:07 +0000
  • 4562c16048 Use a block size of 1 for FlashInfer Daniël de Kok 2024-08-01 11:20:42 +0000
  • 8fb8e1da78 Add FlashInfer support Daniël de Kok 2024-08-01 11:02:22 +0000
  • fe41e13b45 Unify attention output handling Daniël de Kok 2024-07-31 14:53:58 +0000
  • 22fb1be588
    Fix cache block size for flash decoding (#2351) Daniël de Kok 2024-08-01 15:38:57 +0200
  • f484bcb552 Also run CI on changes to backends Daniël de Kok 2024-08-01 13:01:01 +0000
  • 278697cf55 Fix cache block size for flash decoding Daniël de Kok 2024-08-01 12:34:34 +0000
  • 0ba10078e8 pre-commit again erikkaum 2024-08-01 13:57:49 +0200
  • f5a6691d0e pre-commit erikkaum 2024-08-01 13:37:45 +0200
  • dab00af971 fix: fix num_ln_in_parallel_attn attribute name typo in RWConfig Islam Almersawi 2024-08-01 14:35:00 +0400
  • 98e790e32a add param in healthcheck erikkaum 2024-08-01 12:10:10 +0200
  • 7186ab8e8e draft of unit integration test erikkaum 2024-07-31 17:29:45 +0200
  • 54b45be38d add missed commit Nathan Brake 2024-07-23 12:44:50 -0400
  • d8d3c4678e run update docs erikkaum 2024-07-25 18:03:21 +0200
  • 72cade84f9 fix pre-commit checks erikkaum 2024-07-25 18:01:52 +0200
  • f6324ffb3a delete the last no repeat processor from warpers erikkaum 2024-07-25 17:31:04 +0200
  • 6353e2417b satisfy compiler erikkaum 2024-07-18 18:04:00 +0200
  • d0eef2b552 make nrns optional Nathan Brake 2024-07-15 13:58:34 +0000
  • 10b940559a update docs Nathan Brake 2024-07-15 13:55:43 +0000
  • ea915ad7d7 Add support for no_repeat_ngram_size Nathan Brake 2024-07-15 13:51:11 +0000
  • 9ab9937414
    enable HuggingFaceM4/idefics-9b in intel gpu (#2338) Wang, Yi 2024-08-01 17:08:36 +0800
  • 2c288866a7 Unify attention output handling Daniël de Kok 2024-07-31 14:53:58 +0000
  • c3e874aaf5 fix EleutherAI/gpt-neox-20b does not work in tgi Wang, Yi A 2024-07-31 22:29:12 -0700
  • e4a0bf3b71 fix: allocate tmp based on sgmv kernel if available drbh 2024-07-31 22:15:52 +0000
  • acb41a5e6f (chore) fmt ... why? Morgan Funtowicz 2024-07-31 20:38:30 +0000
  • 40658f4e84 fix: adjust return type drbh 2024-07-30 13:50:10 +0000
  • 26b954dfd3 feat: improve to tokenize too drbh 2024-07-30 13:48:13 +0000
  • 62d7be3727 feat: implement a templated endpoint for visibility into chat requests drbh 2024-07-30 13:06:52 +0000
  • ca8ad2dbee feat: prefer stop over eos_token to align with openai finish_reason drbh 2024-07-31 15:34:48 +0000
  • 7451041ecd
    refactor usage stats (#2339) Erik Kaunismäki 2024-07-31 16:29:07 +0200
  • f7f61876cf
    Pr 2290 ci run (#2329) drbh 2024-07-31 10:27:15 -0400
  • 290e7bd173 delete option around usage stats arg erikkaum 2024-07-31 14:51:10 +0200
  • e51171a18d
    Update router/src/server.rs Erik Kaunismäki 2024-07-31 14:41:10 +0200
  • 6df2557910 (launcher) default new server::run parameters to false for now Morgan Funtowicz 2024-07-31 09:06:52 +0000
  • 81682561bd (docker) build ompi with SLURM support Morgan Funtowicz 2024-07-31 09:06:24 +0000
  • 4ff17caaed (docker) let's put rust in the TRTLLM folder when building Morgan Funtowicz 2024-07-31 09:06:11 +0000
  • 4c1e234266 (backend) use parking_lot crate for RwLock fairness Morgan Funtowicz 2024-07-31 12:30:53 +0000
  • f476b0cc34 fix pre-commit erikkaum 2024-07-31 13:24:52 +0200
  • 34f7dcfd80
    Handle GPTQ-Marlin loading in GPTQMarlinWeightLoader (#2300) Daniël de Kok 2024-07-31 13:08:41 +0200