Commit Graph

  • 8a4df6e181
    Only n_heads / process_group.size() are necessary. Nicolas Patry 2024-08-28 16:34:58 +0200
  • 8d01848370
    Update server tests Nicolas Patry 2024-08-28 15:42:05 +0200
  • 12325564dc
    Put back default pure shell. Nicolas Patry 2024-08-28 14:54:05 +0200
  • f886747949
    Oops this doesn't belong here. Nicolas Patry 2024-08-28 14:49:00 +0200
  • e6ee67f301
    Truncating left for radix purposes. Nicolas Patry 2024-08-28 10:53:22 +0200
  • 0a60973166
    Fixing the batching tokenization in flash causal lm. Nicolas Patry 2024-08-28 10:34:10 +0200
  • c6f1a61267
    Update the chat test. Nicolas Patry 2024-08-27 23:02:12 +0200
  • 8ac1ffa087
    Removing encoder_decoder (seq2seq). Nicolas Patry 2024-08-27 21:11:49 +0200
  • ccaf1d0030
    Fixing the test. Nicolas Patry 2024-08-27 20:03:50 +0200
  • 2cf1f5c00e
    Fixing the issue with add_special_tokens not being passed around. Nicolas Patry 2024-08-27 20:02:35 +0200
  • e0069a3a26
    Fixing seqlen with the new vlms. Nicolas Patry 2024-08-27 18:16:35 +0200
  • 9dacac3b15
    add_special_tokens is internal only Nicolas Patry 2024-08-27 15:18:47 +0200
  • 55d984d730
    Fixed flashinfer version. Nicolas Patry 2024-08-27 15:00:22 +0200
  • bb9769ed42
    Update all models. Nicolas Patry 2024-08-27 14:46:42 +0200
  • 65b94a69bd
    Fixing prefix caching for flashdecoding. Nicolas Patry 2024-08-27 14:23:51 +0200
  • 7f1816a4e1
    Change add_special_tokens in order to have the correct tokens for chat input and not (since it's super important with the prefixing now) Nicolas Patry 2024-08-27 11:51:29 +0200
  • f1c0735453
    Don't enable prefix caching on VLM just yet. Nicolas Patry 2024-08-27 09:58:19 +0200
  • e30fb25444
    Fixing the default for vlm. Nicolas Patry 2024-08-26 22:45:04 +0200
  • 27b566baa8
    Downgrade some logs. Nicolas Patry 2024-08-26 18:30:19 +0200
  • 26e5037de4
    This seems to be working. Nicolas Patry 2024-08-26 18:27:28 +0200
  • f5182c188c
    Is this enough to make it work ? Nicolas Patry 2024-08-26 17:43:27 +0200
  • 1568e82548
    OVerride the env in server tests. Nicolas Patry 2024-08-26 15:25:03 +0200
  • 682db34b6a
    Handling debugger. Nicolas Patry 2024-08-26 14:59:27 +0200
  • c53968dc45
    Remove lambda for cleaner function. Nicolas Patry 2024-08-23 15:37:54 +0200
  • 32f6416358
    Upgrade resolution system for less errors in resolution. Nicolas Patry 2024-08-23 15:27:53 +0200
  • 5eb6ea0063
    Tmp Nicolas Patry 2024-08-22 14:34:12 +0200
  • 0bf4eb9683
    Updated flake lock Nicolas Patry 2024-08-21 09:15:10 +0200
  • b80593bfa3
    Apply suggestions from code review Nicolas Patry 2024-08-21 09:03:28 +0200
  • 8d0220a695
    Forgot last default place. Nicolas Patry 2024-08-20 18:17:54 +0200
  • 860b550cdf
    Everywhere 1.80 Nicolas Patry 2024-08-20 15:52:31 +0200
  • 344fee0d44
    Upgrade to 1.80 because of bitstream... Nicolas Patry 2024-08-20 15:43:42 +0200
  • 17c8a5e574
    Update cargo lock ? Nicolas Patry 2024-08-20 15:28:11 +0200
  • ba1ce20ce8
    Updating integration tests with new values with FI/FD. Nicolas Patry 2024-08-20 15:12:41 +0200
  • ffb6841121
    Update lock Nicolas Patry 2024-08-20 12:08:33 +0200
  • f0b35f94b8
    More specific codes. Nicolas Patry 2024-08-20 12:05:40 +0200
  • a6cd5fef23
    Disable prefix caching for lora. Nicolas Patry 2024-08-20 09:14:57 +0200
  • cba59aca03
    Disabling flashinfer/prefix caching on odd head_dim Nicolas Patry 2024-08-19 16:56:06 +0200
  • f55278de2d
    Allowing window_left_size (dummy version). Nicolas Patry 2024-08-17 12:04:21 +0200
  • f2bdc65098
    Using prebuilt. Nicolas Patry 2024-08-17 00:42:51 +0200
  • 9d4c5d39fe
    Include flashinfer in the docker. Nicolas Patry 2024-08-16 23:50:37 +0200
  • 60719babf6
    Making prefix/flashinfer the default and testing the full release tests. Nicolas Patry 2024-08-16 14:16:45 +0200
  • 21187c27c9
    fix: bump minijinja version and add test for llama 3.1 tools (#2463) drbh 2024-08-27 13:31:08 -0400
  • 5e14f5bed7 fix: add to redocly ignore and lint drbh 2024-08-27 17:01:15 +0000
  • 8bfa11f636 fix: update docs with new endpoint drbh 2024-08-27 16:59:33 +0000
  • a76bd78486 fix: revert route typo drbh 2024-08-27 16:34:37 +0000
  • 997d7a102a fix: remove unused type import drbh 2024-08-27 16:33:40 +0000
  • b348ab4c55 Merge branch 'support-openai-models-endpoint' of github.com:huggingface/text-generation-inference into support-openai-models-endpoint drbh 2024-08-27 16:31:50 +0000
  • 1b8f384ce2 fix: adjust comment typo drbh 2024-08-27 16:26:50 +0000
  • 25a0ea6674 fix: prefer minijinja native methods and prefer workspace level dependency drbh 2024-08-27 16:25:23 +0000
  • 8fd5639e9f fix: support tojson and avoid message indexing issue in template drbh 2024-08-27 15:05:43 +0000
  • 2788d41a76
    Fixing CI. (#2462) Nicolas Patry 2024-08-27 15:33:02 +0200
  • f1a94fb009
    Fixing CI. Nicolas Patry 2024-08-27 15:24:11 +0200
  • fde061ccf8
    Updated docker image version to 2.0.4 (#212) Thanaji Rao Thakkalapelli 2024-08-27 01:14:27 -0700
  • 8398d4f436 feat: add /v1/models endpoint drbh 2024-08-19 16:00:48 +0000
  • cfa73b5c99
    Pr 2451 ci branch (#2454) drbh 2024-08-26 20:19:38 -0400
  • 57a8038d05 fix: increase test client timeout for grammar compilation tests drbh 2024-08-26 21:14:32 +0000
  • 20db2c3db8 feat: avoid skip tool test and avoid empty tool prompts drbh 2024-08-26 19:15:05 +0000
  • 1f72dcf062 fix: simplify tool grammar logic and improve schema drbh 2024-08-26 17:59:21 +0000
  • 8b45d82897 fix: adjust non tool template apply drbh 2024-08-25 19:12:59 +0000
  • 1bf0e3b65c feat: refactor tool logic to include notify_error in prompt and adjust typing drbh 2024-08-23 21:07:43 +0000
  • 9ea34977ac feat: improve default tool serialization and lints drbh 2024-08-23 18:05:40 +0000
  • 2ee98c7c07 fix[router]: Fix tools not passed in chat template Simone Rossi 2024-08-22 15:48:37 +0000
  • 30be188400
    Fix: don't apply post layernorm in SiglipVisionTransformer (#2459) drbh 2024-08-26 17:04:46 -0400
  • 6256b81baf fix: adjust pali gemma for post layer norm and small refactors drbh 2024-08-26 19:35:39 +0000
  • 2985503900
    llava-next Fp8 (#209) yuanwu2017 2024-08-26 22:53:08 +0800
  • 55d60a103c
    Add qwen2 fp8 support (#210) Wang, Chang 2024-08-26 17:02:58 +0800
  • e33db1877c
    Updated Readme to use flash attention for llama (#200) Thanaji Rao Thakkalapelli 2024-08-26 02:01:11 -0700
  • c925bd2872
    Undo disable of hpu graphs for starcoder (#201) Vidya Galli 2024-08-26 01:58:01 -0700
  • 0c3239e710
    Enable quantization with INC (#203) Thanaji Rao Thakkalapelli 2024-08-26 01:55:37 -0700
  • ea48ae169a
    Make prefill time of static benchmark correct (#214) Sun Choi 2024-08-26 01:51:28 -0700
  • a8cead1f92
    Upgrade SynapseAI version to 1.17.0 (#208) yuanwu2017 2024-08-26 16:49:29 +0800
  • b84303e2e9
    Fix: don't apply post layernorm in SiglipVisionTransformer Travis Addair 2024-08-24 23:41:23 -0700
  • f3c5d7d92f
    nix: add default package (#2453) Daniël de Kok 2024-08-23 22:06:22 +0200
  • dd89e0d24c nix: add default package Daniël de Kok 2024-08-23 06:22:09 +0000
  • e152cb022b fix: also show total memory after full warmup avoid-cuda-graph-during-warmup-if-oom drbh 2024-08-22 17:57:51 +0000
  • 8b4cd2a9fc fix: skip cuda graphs that will oom and improve free memory logging drbh 2024-08-22 17:49:17 +0000
  • 9a3e838079
    fix[router]: Fix tools not passed in chat template Simone Rossi 2024-08-22 15:48:37 +0000
  • 0b02d45a05 add gptq and awq int4 support in intel platform Wang, Yi A 2024-08-21 22:47:34 -0700
  • 0b3384762b
    Update Dockerfile_intel Tyler Titsworth 2024-08-21 15:39:50 -0700
  • 358ceb67dd
    nix: add awq-inference-engine as server dependency (#2442) Daniël de Kok 2024-08-21 22:20:03 +0200
  • c98dbdb8c9 nix: add awq-inference-engine as server dependency Daniël de Kok 2024-08-21 20:09:39 +0000
  • d33fb9ed2c extracting traceparent from header to span fix/op-trace-id erikkaum 2024-08-21 11:28:50 +0200
  • 2652e209e7
    Updated flake lock prefix_default Nicolas Patry 2024-08-21 09:15:10 +0200
  • 3ece76392b
    Apply suggestions from code review Nicolas Patry 2024-08-21 09:03:28 +0200
  • cdbf73eef8
    Forgot last default place. Nicolas Patry 2024-08-20 18:17:54 +0200
  • 3d46783f1a
    Everywhere 1.80 Nicolas Patry 2024-08-20 15:52:31 +0200
  • e2319fa891
    Upgrade to 1.80 because of bitstream... Nicolas Patry 2024-08-20 15:43:42 +0200
  • f628886c0a
    Update cargo lock ? Nicolas Patry 2024-08-20 15:28:11 +0200
  • 2fe5879816
    Updating integration tests with new values with FI/FD. Nicolas Patry 2024-08-20 15:12:41 +0200
  • e48e07c04b
    Update lock Nicolas Patry 2024-08-20 12:08:33 +0200
  • bd0ced354d
    More specific codes. Nicolas Patry 2024-08-20 12:05:40 +0200
  • f5ee062cbd
    Disable prefix caching for lora. Nicolas Patry 2024-08-20 09:14:57 +0200
  • 719d7b4d54
    Disabling flashinfer/prefix caching on odd head_dim Nicolas Patry 2024-08-19 16:56:06 +0200
  • 7857910435
    Allowing window_left_size (dummy version). Nicolas Patry 2024-08-17 12:04:21 +0200
  • 73fd04d60a
    Using prebuilt. Nicolas Patry 2024-08-17 00:42:51 +0200
  • 5336755358
    Include flashinfer in the docker. Nicolas Patry 2024-08-16 23:50:37 +0200
  • 52c813527a
    Making prefix/flashinfer the default and testing the full release tests. Nicolas Patry 2024-08-16 14:16:45 +0200
  • 310778e02a
    Adding eetq to flake. (#2438) Nicolas Patry 2024-08-21 09:06:33 +0200
  • cbbfe8eb2a
    Adding eetq to flake. Nicolas Patry 2024-08-21 09:05:56 +0200
  • 9474415095
    nix: add text-generation-benchmark to pure devshell (#2431) Daniël de Kok 2024-08-21 07:48:13 +0200