Commit Graph

  • b9dffbd512 python now uses v3 OlivierDehaene 2024-06-03 15:50:37 +0200
  • 188c396b88 continue refactoring OlivierDehaene 2024-06-03 15:18:03 +0200
  • dc07ad2691 continue refactoring OlivierDehaene 2024-06-03 15:04:27 +0200
  • ba59da1589 wip OlivierDehaene 2024-06-03 14:23:30 +0200
  • 679c670293 small refactor to make router a bit more agnostic OlivierDehaene 2024-06-03 13:30:46 +0200
  • 4dbb342fe3 small refactor to make router a bit more agnostic OlivierDehaene 2024-06-03 13:30:31 +0200
  • df71aafdcc router: send the input as chunks to the backend Daniël de Kok 2024-06-03 07:27:22 +0000
  • e2855617a1 Whatever. Nicolas Patry 2024-06-03 14:35:46 +0000
  • 8fcbbf9e8b Attempt at yaml. Nicolas Patry 2024-06-03 14:33:06 +0000
  • f3c4d06bae Check temporary. Nicolas Patry 2024-06-03 14:30:35 +0000
  • f92411a57f router: send the input as chunks to the backend Daniël de Kok 2024-06-03 07:27:22 +0000
  • d1d724b027
    reable xpu, broken by gptq and setuptool upgrade (#1988) Wang, Yi 2024-06-03 22:07:50 +0800
  • a7c744664c v2.0.1 OlivierDehaene 2024-04-18 17:20:36 +0200
  • 11c16aa64c Upgrading all versions. (#1759) Nicolas Patry 2024-04-18 17:17:40 +0200
  • 918916939f feat: accept list as prompt and use first string (#1702) drbh 2024-04-17 04:41:12 -0400
  • fea0f2f013 fix: bump clients test base url to llama (#1751) drbh 2024-04-16 16:56:47 -0400
  • 65748c7353 Update response type for /v1/chat/completions and /v1/completions (#1747) Lucain 2024-04-16 19:26:32 +0200
  • 2aad5f89bb feat: improve tools to include name and add tests (#1693) drbh 2024-04-16 09:02:46 -0400
  • be4417c310 Fixing CI. (#1748) Nicolas Patry 2024-04-15 18:47:36 +0200
  • 903debac22 Revert comment. Nicolas Patry 2024-06-03 10:08:25 +0000
  • b5d7732922 Fixing tests. Nicolas Patry 2024-06-03 10:06:42 +0000
  • 4de62562a6 Fuse gh action. Nicolas Patry 2024-06-03 10:03:58 +0000
  • 3b082fe5f3 Python 3.9 Nicolas Patry 2024-06-03 09:33:33 +0000
  • 9a59ebcec3 Hotfix GPTQ. Nicolas Patry 2024-06-03 09:32:12 +0000
  • 2eac0951b1 No sccache. Nicolas Patry 2024-06-03 08:59:41 +0000
  • 709d70305d What about /opt ? Morgan Funtowicz 2024-04-30 16:42:01 +0200
  • ed89e464a4 Let's try /usr/bin for sccache for Intel Morgan Funtowicz 2024-04-30 16:38:31 +0200
  • 70690911cb let's see if we really need sudo for intel Morgan Funtowicz 2024-04-30 16:29:55 +0200
  • 740a032ddb Upgrade all the actions deps Morgan Funtowicz 2024-04-30 16:28:21 +0200
  • 6c0b41c037 Let's try with Python 3.8 instead of 3.9 Morgan Funtowicz 2024-04-30 16:26:58 +0200
  • f9786a29ba Upgrade Python setup for intel Morgan Funtowicz 2024-04-30 16:25:03 +0200
  • 49b93d8d18 Let's dispatch Intel XPU on the right runner group Morgan Funtowicz 2024-04-30 16:22:33 +0200
  • 00ffe4fae0 OK let's duplicate the job and dispatch on different labels Morgan Funtowicz 2024-04-30 16:19:59 +0200
  • c0ba3ef92e Enable TGI on XPU tests Morgan Funtowicz 2024-04-30 16:04:40 +0200
  • 1852d107bb remove gptq change Wang, Yi A 2024-06-03 01:45:56 -0700
  • 0b3f71c6f6 Merge branch 'main' into xpu_fix Wang, Yi A 2024-06-03 01:44:38 -0700
  • 9add5d0af5
    Fixing GPTQ imports. (#1994) Nicolas Patry 2024-06-03 10:36:29 +0200
  • 2ae9ed20fb
    Update server/text_generation_server/layers/gptq/__init__.py Nicolas Patry 2024-06-03 10:36:04 +0200
  • 4188a25b15
    Update server/text_generation_server/layers/gptq/__init__.py Nicolas Patry 2024-06-03 10:35:50 +0200
  • 9a9b679c33 Fixing indirect GPTQ loads. Nicolas Patry 2024-06-03 08:20:37 +0000
  • ff5ca67f58 WIP maintenance/merge-vlm-input-prep Daniël de Kok 2024-05-31 16:14:27 +0000
  • ebeea9daf8 router: send the input as chunks to the backend Daniël de Kok 2024-05-30 12:31:35 +0000
  • fc52ba61ab router: send the input as chunks to the backend Daniël de Kok 2024-06-03 07:27:22 +0000
  • 09590956a4 Merge branch 'main' into xpu_fix Wang, Yi A 2024-06-02 18:10:14 -0700
  • 799a193b10 Fixing Phi3. fix_phi3 Nicolas Patry 2024-06-01 08:47:00 +0000
  • 08b3eac2ce
    single char ` addition for docs (#1989) Nicholas Broad 2024-05-31 09:42:14 -0700
  • 64a4b88766 Fixing the CLI. Nicolas Patry 2024-05-31 16:06:34 +0000
  • 5ab4cef67e
    Fixing exl2 scratch buffer. (#1990) Nicolas Patry 2024-05-31 18:01:43 +0200
  • 06edde9491
    Purely refactors paged/attention into layers/attention and make hardware differences more obvious with 1 file per hardware. (#1986) Nicolas Patry 2024-05-31 17:57:01 +0200
  • b0c168d249
    Update server/text_generation_server/layers/attention/xpu.py Nicolas Patry 2024-05-31 17:56:08 +0200
  • 5b58262fea Fixing exl2 scratch buffer. Nicolas Patry 2024-05-31 15:18:44 +0000
  • 37f955dd14
    single char ` addition Nicholas Broad 2024-05-31 08:08:59 -0700
  • 9c722a4e35 reable xpu, broken by gptq and setuptool upgrade Wang, Yi A 2024-05-31 07:52:31 -0700
  • 659bd67fec
    Update documentation version to 2.0.4 (#1980) fxmarty 2024-05-31 07:03:24 -0700
  • d44688b6ac Adress comments + fix 2nd path in falcon. Nicolas Patry 2024-05-31 12:43:13 +0000
  • c67539fbcc
    Update server/text_generation_server/utils/import_utils.py Nicolas Patry 2024-05-31 12:51:35 +0200
  • 91f55ea2b5 Removing flash decoding part so it gets merged. Nicolas Patry 2024-05-31 10:16:30 +0000
  • a4d81d623d update doc fxmarty 2024-05-30 13:38:06 +0200
  • be87c840b8 Update router/src/infer.rs Nicolas Patry 2024-05-30 11:25:37 +0200
  • 13caf958eb Enabling custom block size schedule. Nicolas Patry 2024-05-30 05:17:00 +0000
  • cf59593454 Fixing falcon. Nicolas Patry 2024-05-29 18:34:34 +0000
  • a76e650283 Fix cohere. Nicolas Patry 2024-05-29 17:41:15 +0000
  • daddd2e90b Revamped all this architecture. Nicolas Patry 2024-05-29 17:36:04 +0000
  • 7890cd66f7 Fixing cohere flash decoding. Nicolas Patry 2024-05-29 16:04:36 +0000
  • a6f1603525 Missing cohere. Nicolas Patry 2024-05-29 15:46:53 +0000
  • 50d5c08b15 Router logic knows about page size. Nicolas Patry 2024-05-29 15:37:46 +0000
  • 7a29e82629 Fixing non flash decoding llama path. Nicolas Patry 2024-05-29 12:35:32 +0000
  • 6aeb5a73a1 HHachweew Hack to make other models work. Nicolas Patry 2024-05-29 10:52:09 +0000
  • 6bbc843097 Speedup flashdecoding. Nicolas Patry 2024-05-24 16:10:42 +0000
  • ed96a76d67 REvert changes in modeling. Nicolas Patry 2024-05-24 14:18:00 +0000
  • be8c14be8b Less intrusive. Nicolas Patry 2024-05-24 14:15:33 +0000
  • 8171747e4f Fix after rebase.. Nicolas Patry 2024-05-23 12:42:19 +0000
  • 4fd3065d9c Using flash decoding Nicolas Patry 2024-05-17 08:43:33 +0000
  • 967ced2ff4 Gemma GPTQ checks: skip logprob checks Daniël de Kok 2024-05-30 07:10:10 +0000
  • 36dd16017c Add support for exl2 quantization Daniël de Kok 2024-05-28 09:51:31 +0000
  • 03699839a4 Gemma GPTQ checks: skip logprob checks Daniël de Kok 2024-05-30 07:10:10 +0000
  • 0e8f8726db
    Warmup all decode buckets (#152) Karol Damaszke 2024-05-29 22:46:55 +0200
  • 7b879fd1d8
    Pad next token chooser parameters with empty logits processors (#151) Karol Damaszke 2024-05-29 22:43:56 +0200
  • 3fa24fb217 Add support for exl2 quantization Daniël de Kok 2024-05-28 09:51:31 +0000
  • cbced7f0f9
    feat: adjust attn weight loading logic (#1975) drbh 2024-05-29 12:42:11 -0400
  • 3cf4354944 feat: adjust attn weight loading logic drbh 2024-05-29 15:05:57 +0000
  • 129f0ed603 fix: adjust whl names and upload all precompile-kernels-workflow drbh 2024-05-29 09:50:58 -0400
  • 58ac1d7e9b feat: add basic workflow drbh 2024-05-28 22:18:44 -0400
  • 6499b8e213 fix: update workflow trigger again support-pre-compile-kernels drbh 2024-05-28 22:11:21 -0400
  • 4692347140 fix: edit source to build drbh 2024-05-28 22:07:36 -0400
  • b7161c8308 fix: simplify workflow trigger trigger again drbh 2024-05-28 22:04:12 -0400
  • ba47345e1b fix: revert changes and change name drbh 2024-05-28 21:59:47 -0400
  • ca59ef23db fix: force ci to run drbh 2024-05-28 21:54:24 -0400
  • 14088638de feat: precompile kernels drbh 2024-05-28 21:44:21 -0400
  • 1bf32d970f fix: install hf cli before upload pip-installable drbh 2024-05-28 18:04:23 +0000
  • dab44ac1af feat: upload assets to hub rather than github drbh 2024-05-28 12:03:18 -0400
  • da1a0b3412 fix: set cuda arch list prior to vllm build drbh 2024-05-27 09:00:34 -0400
  • ad94f299f4 feat: compile vllm for cuda after flash_attn drbh 2024-05-26 23:21:07 -0400
  • 8253f83034 fix: build kernels inside of repo and move to single dist drbh 2024-05-22 00:34:44 +0000
  • ec8c638d2b feat: cache wheel as build artifact drbh 2024-05-21 22:46:54 +0000
  • 7765aa6ecd fix: adjust skip build typo drbh 2024-05-21 21:56:29 +0000
  • 814e07dffe fix: build proto in CI and avoid rate limit in client test drbh 2024-05-21 21:43:25 +0000
  • 2ee4b9f77f fix: adjust upload command drbh 2024-05-21 20:29:00 +0000
  • 47e19377cb fix: skip redundant login drbh 2024-05-21 20:04:38 +0000
  • 01e68b5acf feat: upload single pre compile to hub drbh 2024-05-21 19:45:49 +0000