Commit Graph

  • f36c9a68ae
    refine the code according to the review command Wang, Yi A 2024-10-14 21:01:54 -0700
  • 645369bef7
    set kv cache dtype Wang, Yi A 2024-10-08 08:00:06 -0400
  • dd3fb81719
    fix ci failure Wang, Yi A 2024-09-09 23:54:55 -0700
  • 61fe28e8f7
    add gptq and awq int4 support in intel platform Wang, Yi A 2024-08-21 22:47:34 -0700
  • 46b14e6b28
    Remove all references to habana_quantization_toolkit for 1.18 (#229) Thanaji Rao Thakkalapelli 2024-10-18 01:59:59 -0700
  • 21c13ff3a6
    Remove References to torch compile mode in readme (#236) Thanaji Rao Thakkalapelli 2024-10-17 14:07:51 -0700
  • 8ec57558cd
    Break cycle between the attention implementations and KV cache (#2627) Daniël de Kok 2024-10-17 14:54:22 +0200
  • 5f32dea1e2
    fix: prefer inplace softmax to avoid copy (#2661) drbh 2024-10-17 08:49:02 -0400
  • 3e0a82d512
    Update server/text_generation_server/models/flash_causal_lm.py drbh 2024-10-17 08:48:52 -0400
  • 90553c1dd4 Break cycle between the attention implementations and KV cache Daniël de Kok 2024-10-09 08:32:04 +0000
  • 1b97e084bf
    fix tgi-entrypoint wrapper in docker file: exec instead of spawning a child process (#2663) oOraph 2024-10-17 11:15:26 +0200
  • 13fe82264b
    Fixing "deadlock" when python prompts for trust_remote_code by always specifiying a value. Nicolas Patry 2024-10-17 10:58:07 +0200
  • 59ea38cbca
    Simplify the attention function (#2609) Daniël de Kok 2024-10-17 10:42:52 +0200
  • 5bbe1ce028
    Support e4m3fn KV cache (#2655) Daniël de Kok 2024-10-17 10:42:16 +0200
  • b240fd139a
    tgi-entrypoint: exec instead of spawning a child process Raphael Glon 2024-10-10 18:26:20 +0200
  • 7822bfd68f Fixup flashinfer support Daniël de Kok 2024-10-17 07:56:51 +0000
  • 8d7448de9f fix: prefer inplace softmax to avoid copy David Holtz 2024-10-17 02:53:32 +0000
  • 2326f2b875 Remove References to torch compile mode in readme Thanaji 2024-10-16 22:45:26 +0300
  • 751f1bb815 Make check more obvious Daniël de Kok 2024-10-16 13:54:57 +0000
  • 07128cc178 Simplify the attention function Daniël de Kok 2024-10-04 09:42:20 +0000
  • a6a0c97ed9
    feat: prefill chunking (#2600) OlivierDehaene 2024-10-16 12:49:33 +0200
  • 8ae5d4c7d6
    Ignore EOS for benchmark by using TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN (#234) Sun Choi 2024-10-16 02:57:36 -0700
  • 812aa1c01d
    Fix env name Nicolas Patry 2024-10-16 11:18:22 +0200
  • 52eaa1f4d8
    Put back non default simple tests. Nicolas Patry 2024-10-16 11:17:53 +0200
  • ff36b2fb39
    Add simple resolution when user specifies ATTENTION=paged. Nicolas Patry 2024-10-16 10:56:58 +0200
  • 5c72f269b6
    Fix prefix_caching variable, remove defaults in server (confusing a lot of the times). Nicolas Patry 2024-10-16 10:46:03 +0200
  • 594a2b4a3d
    rename OlivierDehaene 2024-10-16 10:23:21 +0200
  • 704a58c807
    Fp8 e4m3_fnuz support for rocm (#2588) Mohit Sharma 2024-10-16 13:24:50 +0530
  • aa92e451a0 Support e4m3fn KV cache Daniël de Kok 2024-10-16 07:48:10 +0000
  • d07e7f4f62
    Merge pull request #233 from huggingface/fix_sysntax Mandy Li 2024-10-15 14:33:21 -0700
  • 87a1cee32c
    Fix sysntax error in PR 232 Thanaji Rao Thakkalapelli 2024-10-15 13:23:48 -0700
  • e06320f64e
    Enabling Flash Attention support for falcon model (#232) Thanaji Rao Thakkalapelli 2024-10-15 10:50:17 -0700
  • fc41f0784a
    lint fix. Nicolas Patry 2024-10-15 18:46:56 +0200
  • 5c8c5ac81a
    Merge branch 'main' into feat/prefix_chunking Nicolas Patry 2024-10-15 18:28:27 +0200
  • ffe05ccd05
    Rollback to ChatRequest for Vertex AI Chat instead of VertexChat (#2651) Alvaro Bartolome 2024-10-15 18:11:59 +0200
  • fa491e730b
    Fixing dtype + AMD, Ipex targets. Nicolas Patry 2024-10-15 17:56:03 +0200
  • b3917ff695 fix: add limit to internal stream function too adjust-where-request-max-tokens-is-defaulted David Holtz 2024-10-15 15:14:04 +0000
  • 595640e35c fix: enforce default max request tokens in generate_internal David Holtz 2024-10-15 15:08:23 +0000
  • cebd1b47f5
    Rollback to ChatRequest for Vertex AI Chat instead of VertexChat Alvaro Bartolome 2024-10-15 16:41:05 +0200
  • 4fa4da3cb6
    Fixing non blocked attentions Nicolas Patry 2024-10-15 16:12:00 +0200
  • fb4d2080af
    Merge branch 'main' into cpu_perf Wang, Yi 2024-10-15 21:50:15 +0800
  • ce7e356561 Use flashinfer for Gemma 2. Daniël de Kok 2024-10-15 13:49:32 +0000
  • 689aa26db2 (nit) improved comment Mohit Sharma 2024-10-15 12:16:03 +0000
  • 1de96279e3 (review_comments) fix typo and added comments Mohit Sharma 2024-10-15 12:01:12 +0000
  • cf04a43fb1
    Fixing linters. (#2650) Nicolas Patry 2024-10-15 12:43:49 +0200
  • 39b86f7f16
    Fixing linters. Nicolas Patry 2024-10-15 12:26:32 +0200
  • b2b5024ec8 (bug) update all has_tensor Mohit Sharma 2024-10-15 07:51:03 +0000
  • 64b0337574 feature: get trace id from req headers kozistr 2024-10-15 15:14:20 +0900
  • 7ca47777aa
    Update mod.rs smith518 2024-10-15 11:22:34 +0530
  • b069d2c131 refine the code according to the review command Wang, Yi A 2024-10-14 21:01:54 -0700
  • 7c6230c59a Merge branch 'main' into gpt_awq_4 Wang, Yi A 2024-10-14 20:28:15 -0700
  • 58848cb471
    feat: enable pytorch xpu support for non-attention models (#2561) Dmitry Rogozhkin 2024-10-14 09:28:49 -0700
  • 7a82ddcbd0
    update ipex to fix incorrect output of mllama in cpu (#2640) Wang, Yi 2024-10-14 22:32:33 +0800
  • 51f5401893
    Clarify gated description and quicktour (#2631) Omar Sanseviero 2024-10-14 16:31:37 +0200
  • 09d73e56ca
    remove docker entrypoint Raphael Glon 2024-10-14 15:52:09 +0200
  • c9e0f36dbc Machete WIP feature/machete Daniël de Kok 2024-10-14 07:59:09 +0000
  • 3ea82d008c
    Cpu perf (#2596) Nicolas Patry 2024-10-14 15:34:08 +0200
  • ce28ee88d5
    Small fixes for supported models (#2471) Omar Sanseviero 2024-10-14 15:26:39 +0200
  • 406725e05f
    Updating the doc (we keep the list actually). Nicolas Patry 2024-10-14 15:19:02 +0200
  • 7a7cd5f299 (review comments) Fix compression_config load, type hints Mohit Sharma 2024-10-14 11:51:11 +0000
  • 0578bd917d
    Fix gpt_bigcode/starcoderbase-3b accuracy issue (#228) Sun Choi 2024-10-14 01:01:55 -0700
  • af546505ad
    add gfx1100 support to AMD pytorch build Drew Paettie 2024-10-12 22:55:49 -0700
  • 4be95899ca update ipex to fix incorrect output of mllama in cpu Wang, Yi A 2024-10-12 18:49:59 -0700
  • 0c478846c5
    Fixing intel Supports windowing. (#2637) Nicolas Patry 2024-10-11 21:47:03 +0200
  • fe2f251504
    Fixing intel Supports windowing. Nicolas Patry 2024-10-11 21:28:22 +0200
  • 5e70158b2c
    remove support chunking for paged OlivierDehaene 2024-10-11 15:19:14 +0200
  • b392362e9e direct return in clamp like rocm Wang, Yi A 2024-10-10 23:02:56 -0700
  • f213012b08 Merge branch 'main' into sliding_window Wang, Yi A 2024-10-10 22:58:27 -0700
  • 05d68ae5c2 add tests Linus Bierhoff 2024-10-10 19:41:51 +0200
  • 2285b0d63e add OpenAI like tool_choice for named choice Linus Bierhoff 2024-10-10 18:50:32 +0200
  • f18a460181
    propagate signal from entrypoint to tgi Raphael Glon 2024-10-10 18:26:20 +0200
  • df98299919
    fix cargo tests OlivierDehaene 2024-10-10 16:54:42 +0200
  • 3dbdf63ec5
    Intel ci (#2630) Nicolas Patry 2024-10-10 16:51:57 +0200
  • f923a3fb68
    fix mllama OlivierDehaene 2024-10-10 16:01:18 +0200
  • b7a1280f25
    fix tests OlivierDehaene 2024-10-10 14:52:09 +0200
  • f85a308ef1
    remove debugging lines OlivierDehaene 2024-10-09 20:05:39 +0200
  • d361197aab
    omfg OlivierDehaene 2024-10-09 20:04:06 +0200
  • d73c5c634d
    max input length OlivierDehaene 2024-10-09 19:39:14 +0200
  • 57f55fe834
    idk at this point OlivierDehaene 2024-10-09 19:17:18 +0200
  • 3ace1b2f8d
    fix logprobs? OlivierDehaene 2024-10-09 17:33:15 +0200
  • 08953c5975
    fix launcher OlivierDehaene 2024-10-08 19:23:45 +0200
  • ea4b739a9f
    fix prefill logprobs OlivierDehaene 2024-10-07 17:12:31 +0200
  • 3924b87a04
    rename to cache and input lengths OlivierDehaene 2024-10-07 15:14:03 +0200
  • 8188deac22
    fix vlm and seq2seq OlivierDehaene 2024-10-07 15:08:30 +0200
  • 460e830444
    fix benchmarker OlivierDehaene 2024-10-07 14:45:52 +0200
  • 4ddea01c6e
    remove log OlivierDehaene 2024-10-07 12:11:50 +0200
  • c8a033b636
    feedback loop OlivierDehaene 2024-10-07 12:02:25 +0200
  • ff4155dfea
    fix slot_filtering_indices OlivierDehaene 2024-10-02 19:16:36 +0200
  • b49978ff67
    re-create slots OlivierDehaene 2024-10-02 14:17:26 +0200
  • 4db5e7dde6
    re-create slots OlivierDehaene 2024-10-02 14:10:33 +0200
  • 7f9abde3f8
    load tested OlivierDehaene 2024-10-02 12:59:44 +0200
  • 34f5dc525e
    working OlivierDehaene 2024-10-01 09:51:34 +0200
  • 173bc99ab3
    add prepare_for_prefill OlivierDehaene 2024-09-30 17:58:14 +0200
  • 0e31619893
    current OlivierDehaene 2024-09-30 11:03:13 +0200
  • 962ccfd5b7
    wip, no filter, no concat OlivierDehaene 2024-09-26 17:10:00 +0200
  • a85f5ebecd
    fix filter and concat OlivierDehaene 2024-09-25 15:34:08 +0200
  • e4f9110e14
    maybe patching vlms? OlivierDehaene 2024-09-25 14:54:59 +0200
  • 838756eb18
    refactor to use prefix/postfix namming + fix all_input_ids_tensor OlivierDehaene 2024-09-25 14:40:47 +0200
  • de043b53c4
    rollback OlivierDehaene 2024-09-25 13:57:18 +0200
  • 7169cbae6d
    wip OlivierDehaene 2024-09-20 14:25:51 +0200