Commit Graph

  • 900ac49454 Fixing GTPQ device santacoder. Nicolas Patry 2023-07-20 19:08:33 +0000
  • 7faef69015 Give escape hatch to not use exllama kernels even if available. Nicolas Patry 2023-07-20 17:47:09 +0000
  • 8cf7c89910 Small polish. Nicolas Patry 2023-07-20 17:44:37 +0000
  • 0860394489 Refactored a bit. Nicolas Patry 2023-07-20 17:38:50 +0000
  • f555dabca8 Putting back header inclusion (seems unused but still) simpler_exllama Nicolas Patry 2023-07-20 15:46:51 +0000
  • 5ca0508d02 Simpler exllama Nicolas Patry 2023-07-20 15:36:53 +0000
  • bf94df3c71
    fix(server): use mem_get_info to get kv cache size (#664) OlivierDehaene 2023-07-20 17:23:49 +0200
  • 08b8eec1d7
    fix(server): Fixing non parameters in quantize script bigcode/starcoder was an example. (#661) Nicolas Patry 2023-07-20 16:04:15 +0200
  • 3db5d4a654 fix(server): use mem_get_info to get kv cache size OlivierDehaene 2023-07-20 16:00:15 +0200
  • 362883f259
    fix(server): llama v2 GPTQ (#648) fxmarty 2023-07-20 15:02:54 +0200
  • 214c06f510
    Add trust_remote_code to quantize script (#647) cdawg 2023-07-20 13:53:08 +0200
  • 929e374753 Fixing quantize script on models with non parameters buffers. Nicolas Patry 2023-07-20 11:16:34 +0000
  • a1859012c4
    Merge branch 'main' into patch-2 Dong Shin 2023-07-20 11:37:46 +0900
  • 88d753d79b
    Merge branch 'huggingface:main' into bnb-4bit krzim 2023-07-19 18:11:41 -0400
  • c52a5d4456 add documentation for 4bit quantization options krzim 2023-07-19 22:10:34 +0000
  • 6bf7090ecd fix per-column quantization Felix Marty 2023-07-19 17:55:41 +0000
  • 2080735e16 taste Felix Marty 2023-07-19 16:35:23 +0000
  • 5882768682 nit Felix Marty 2023-07-19 16:31:46 +0000
  • 2b3f65048d
    line break cdawg 2023-07-19 17:50:46 +0200
  • d6649411c4
    Update quantize.py cdawg 2023-07-19 17:43:49 +0200
  • edfbfdfb3f Merge branch 'main' into gptq-cuda-kernels Félix Marty 2023-07-19 16:58:54 +0200
  • 5a1512c025
    docs: Update README.md (#643) Nicolas Patry 2023-07-19 13:39:12 +0200
  • 1c81df15cd
    docs: Update README.md (#639) Nicolas Patry 2023-07-19 13:38:52 +0200
  • fd851a60be
    Update README.md Nicolas Patry 2023-07-19 12:09:43 +0200
  • b66b190403
    feat(router): ngrok edge (#642) OlivierDehaene 2023-07-19 11:59:58 +0200
  • 42f85addaa feat(router): ngrok edge OlivierDehaene 2023-07-19 11:59:21 +0200
  • df61543e0d
    Update README.md Nicolas Patry 2023-07-19 10:55:04 +0200
  • fe80f5360c
    feat(server): auto max_batch_total_tokens for flash att models (#630) OlivierDehaene 2023-07-19 09:31:25 +0200
  • 2934543a59 0.98 OlivierDehaene 2023-07-19 02:06:16 +0200
  • 406b094002 0.985 OlivierDehaene 2023-07-19 01:50:19 +0200
  • 0a02801822 try 0.99 OlivierDehaene 2023-07-19 01:26:42 +0200
  • 7f399cd848 revert OlivierDehaene 2023-07-19 01:15:59 +0200
  • 8793ae5890 add clear cache when batch is finished OlivierDehaene 2023-07-19 01:12:28 +0200
  • 0111869ad0 use less memory OlivierDehaene 2023-07-19 00:42:15 +0200
  • 05d2a77e4c reset peak memory OlivierDehaene 2023-07-19 00:17:49 +0200
  • 99568eef7b add tmate OlivierDehaene 2023-07-18 19:43:48 +0200
  • 45d24bea52 sleep to connect to the CI runner OlivierDehaene 2023-07-18 19:29:14 +0200
  • 5e6ddfd6a4
    fix(server): fix llamav2 config (#635) v0.9.3 OlivierDehaene 2023-07-18 18:49:42 +0200
  • 4409bcf893 fix(server): fix llamav2 config OlivierDehaene 2023-07-18 18:46:38 +0200
  • cf83f9b66f
    v0.9.3 (#634) OlivierDehaene 2023-07-18 18:11:20 +0200
  • 7288cb8640 v0.9.3 OlivierDehaene 2023-07-18 18:11:00 +0200
  • 211b211ec0
    feat(server): add support for llamav2 (#633) Nicolas Patry 2023-07-18 18:09:53 +0200
  • 36a9bddde4 use max_memory_reserved OlivierDehaene 2023-07-18 18:06:46 +0200
  • 7a60f4d8c3 Llamav2 Post flashv2 Nicolas Patry 2023-07-18 16:55:58 +0200
  • 1686a7c0dc add syncs OlivierDehaene 2023-07-18 17:03:29 +0200
  • 160a50af77 cleanup OlivierDehaene 2023-07-18 16:18:56 +0200
  • de892fb434 revert back to normal allocator OlivierDehaene 2023-07-18 16:11:18 +0200
  • 79616a8796 add block size parameter OlivierDehaene 2023-07-18 12:45:51 +0200
  • d2e3843588 pad to block size OlivierDehaene 2023-07-18 12:04:38 +0200
  • 086d0c2252 update logs OlivierDehaene 2023-07-18 11:43:11 +0200
  • a6b128b293 fix default value OlivierDehaene 2023-07-18 11:41:10 +0200
  • 4201a8be46 fix default value OlivierDehaene 2023-07-18 11:39:14 +0200
  • b165f8b7b7 feat(server): auto max_batch_total_tokens for flash att models OlivierDehaene 2023-07-18 11:33:49 +0200
  • 3b71c38558
    feat(server): flash attention v2 (#624) OlivierDehaene 2023-07-18 16:21:18 +0200
  • 751f26b66c fix dockerfile OlivierDehaene 2023-07-18 15:29:02 +0200
  • d186b13c59 fix OlivierDehaene 2023-07-18 12:36:27 +0200
  • 4d38a1c4ad
    feat(server): Reworking the quantization script so it's still universal (not llama specific) (#587) Nicolas Patry 2023-07-18 12:19:05 +0200
  • bc2f351980 abstraction above flash OlivierDehaene 2023-07-18 10:32:10 +0200
  • f400f2d58b use native grouped attention OlivierDehaene 2023-07-18 09:21:22 +0200
  • 4e0d8b2efb export requirements with bnb krzim 2023-07-17 21:10:01 +0000
  • 432ab71be9 add 4bit bnb quantization krzim 2023-07-17 20:23:23 +0000
  • 8ff7d57443 add AutoModel error message for 4bit quantization krzim 2023-07-17 19:31:39 +0000
  • 9c11372d8f add bnb 4bit to quantization enums krzim 2023-07-17 19:31:11 +0000
  • aded1c161e update bnb requirements krzim 2023-07-17 19:29:05 +0000
  • 44acf72a73
    fea(launcher): debug logs (#623) OlivierDehaene 2023-07-17 19:03:07 +0200
  • bc2873246c
    fix(launcher): Rename b-float16 to bfloat16 in the launcher arg (#621) Nicolas Patry 2023-07-17 18:38:16 +0200
  • e856983781 fea(launcher): debug logs OlivierDehaene 2023-07-17 18:37:52 +0200
  • 2d4b31070e fix OlivierDehaene 2023-07-17 17:39:45 +0200
  • 107fcfe9b6 feat(server): flash attention v2 OlivierDehaene 2023-07-17 17:34:55 +0200
  • 5a68b3f751 Rename b-float16 to bfloat16 in the launcher arg (just more usual). Nicolas Patry 2023-07-17 14:34:58 +0200
  • b4ce728b4f Fix env vars Ian 2023-07-17 04:36:04 +0000
  • 0ec4d8182f Update conditionals for dynamic scaling Ian 2023-07-17 01:17:02 +0000
  • f01c11bd0c Implement scaled and dynamically scaled RoPE Ian 2023-07-01 02:18:03 +0000
  • abe4e4b1cc
    fix: LlamaTokenizerFast to AutoTokenizer at flash_llama.py Dong Shin 2023-07-16 21:09:02 +0900
  • a2cf1bdb2f fix(server): empty_cache when stopped OlivierDehaene 2023-07-15 13:57:31 +0200
  • 3e5165c3ed
    Directly load GPTBigCode to specified device Yang, Bo 2023-07-15 00:32:46 -0700
  • c58a0c185b
    v0.9.2 (#616) v0.9.2 OlivierDehaene 2023-07-14 16:31:48 +0200
  • 743a8ea9c9 v0.9.2 OlivierDehaene 2023-07-14 15:39:04 +0200
  • 152085461d
    Merge pull request #1 from bbc/fix-type-hint Matt Haynes 2023-07-14 12:06:04 +0100
  • 5161d4148e Change type hint for backward compatibility with python <3.9 Ciarán Byrne 2023-07-14 12:00:59 +0100
  • 51e3f84453 Remove unused cert from async client Ciarán Byrne 2023-07-14 12:00:16 +0100
  • 5b9de4a1d3
    fix(server): blacklist local files (#609) OlivierDehaene 2023-07-13 21:54:55 +0200
  • c8b077be79
    docs: README: Add logo + baseline (#611) Victor Muštar 2023-07-13 21:45:20 +0200
  • 7f18519806
    move image header to top Victor Muštar 2023-07-13 20:59:43 +0200
  • abb02d6556
    Add logo + baseline Victor Muštar 2023-07-13 20:53:15 +0200
  • 982ce3227b
    feat(router): explicit warning if revision is not set (#608) OlivierDehaene 2023-07-13 18:59:38 +0200
  • 17aefa4c76 fix(server): blacklist local files OlivierDehaene 2023-07-13 18:55:58 +0200
  • e6b4bfac02 feat(router): explicit warning if revision is not set OlivierDehaene 2023-07-13 18:49:31 +0200
  • 74e6d6e54e fix the usual merge mess Felix Marty 2023-07-13 15:48:55 +0000
  • 9401e10210 Merge branch 'main' into gptq-cuda-kernels Félix Marty 2023-07-13 17:45:52 +0200
  • 0036084294 support all, test llama Felix Marty 2023-07-13 15:41:57 +0000
  • ae6256a17a Add option cert param to client Matt Haynes 2023-07-13 14:23:28 +0100
  • b7327205a6
    feat(launcher): add arg validation and drop subprocess (#595) OlivierDehaene 2023-07-13 14:22:37 +0200
  • 2ae65b45a8 fix tests Felix Marty 2023-07-13 10:38:08 +0000
  • 82a7f9eb53
    Convert example docker command to use :latest rather than being pegged to 0.9 bealbrown 2023-07-12 23:12:05 -0400
  • 38c2be5926 fix test Felix Marty 2023-07-12 18:31:49 +0000
  • 3628559516
    GPTQ Env vars: catch correct type of error (#596) ssmi153 2023-07-13 01:57:46 +0800
  • faa5b52fdc Merge branch 'main' into gptq-cuda-kernels Félix Marty 2023-07-12 18:47:30 +0200
  • 8645fd39e1 tests Felix Marty 2023-07-12 16:42:34 +0000
  • f90c61a340 support bits different than 4 Felix Marty 2023-07-12 16:19:25 +0000