Commit Graph

  • 653193a942 Improve support for GPUs with capability < 8 (#2575) Daniël de Kok 2024-09-27 16:19:42 +0200
  • bc28f86903 Fix build with --features google (#2566) Alvaro Bartolome 2024-09-26 11:41:38 +0200
  • 6976cf8c4c Add LoRA adapters support for Gemma2 (#2567) Alvaro Bartolome 2024-09-26 10:54:08 +0200
  • 0817643b58 remove LORA_ADAPTERS_PATH (#2563) Nicholas Broad 2024-09-24 16:20:15 -0700
  • a684a81927 More tensor cores. (#2558) Nicolas Patry 2024-09-24 23:57:26 +0200
  • 97d4bdd685 Cleanup Vertex + Chat (#2553) Nicolas Patry 2024-09-24 23:37:17 +0200
  • 25e0edf337 Hotfixing main. (#2562) Nicolas Patry 2024-09-24 23:00:43 +0200
  • 782130df17 Adding note for private models in quick-tour document (#2548) Aritra Roy Gosthipaty 2024-09-24 18:36:53 +0530
  • 5247f8938d Simplify crossterm imports (#2545) Orhun Parmaksız 2024-09-24 15:57:20 +0300
  • 8c6d3e074f Update the link to the Ratatui organization (#2546) Orhun Parmaksız 2024-09-24 15:51:48 +0300
  • d4f995e718 Add DenseMoELayer and wire it up in Mixtral/Deepseek V2 (#2537) Daniël de Kok 2024-09-24 14:27:06 +0200
  • 32d50c2ea7 Add support for scalar FP8 weight scales (#2550) Daniël de Kok 2024-09-24 13:57:40 +0200
  • 55115ed700
    Skip the test let's see if it's always the first tests that fails. Nicolas Patry 2024-10-25 11:00:29 +0200
  • ba5fc7d922
    Add support for stop words in TRTLLM (#2678) Funtowicz Morgan 2024-10-25 10:58:34 +0200
  • 68cfc94f40 Hotfixing main (#2556) Nicolas Patry 2024-09-24 11:51:14 +0200
  • 79ac2b741d Micro cleanup. (#2555) Nicolas Patry 2024-09-24 11:19:24 +0200
  • 73e6090d53 chore: Add old V2 backend (#2551) OlivierDehaene 2024-09-24 08:38:17 +0200
  • 9aed9d5f81 nix: remove unused _server.nix file (#2538) Daniël de Kok 2024-09-23 09:43:23 +0200
  • b590310255 Add missing import package yuanwu 2024-10-25 08:52:24 +0000
  • 79690a0d65
    Update for new API Nicolas Patry 2024-10-25 10:46:05 +0200
  • a7465ba67d
    fix kernel OlivierDehaene 2024-10-25 10:37:10 +0200
  • 347f3f51da
    fix kernel OlivierDehaene 2024-10-24 19:17:31 +0200
  • d1e95ceaff
    cast to int32 OlivierDehaene 2024-10-24 19:01:40 +0200
  • ea66379e3c
    feat: add triton kernels to decrease latency of large batches OlivierDehaene 2024-10-24 16:48:46 +0200
  • 8ebe77b3be Simplify the warmup yuanwu 2024-10-24 06:26:48 +0000
  • 84f14a1437
    feat(trtllm): detect stop_words from generation_config.json Morgan Funtowicz 2024-10-23 16:05:59 +0200
  • 13a68e223a
    chore(docker): install transformers Morgan Funtowicz 2024-10-23 15:48:24 +0200
  • 381262337a
    chore(docker): add mpi to ld_library_path Morgan Funtowicz 2024-10-23 15:48:17 +0200
  • e4f67f70a2
    feat(docker): add python3.10 dev to runtime deps Morgan Funtowicz 2024-10-22 23:05:55 +0200
  • 17573d42d8
    feat(docker): build with-slurm ompi Morgan Funtowicz 2024-10-22 23:05:45 +0200
  • 50a19aa326
    chore(router): minor refactorings Morgan Funtowicz 2024-10-22 23:05:10 +0200
  • b939a0f7d7
    chore(trtllm): minor fix Morgan Funtowicz 2024-10-21 23:50:02 +0200
  • cdba16fd23
    chore(trtllm): ensure max throughput scheduling policy is selected Morgan Funtowicz 2024-10-21 23:40:54 +0200
  • d659cb0113
    chore(trtllm): validate there are enough GPus on the system for the desired model Morgan Funtowicz 2024-10-21 23:40:38 +0200
  • c2bb199fb1
    chore(trtllm): minor refactoring Morgan Funtowicz 2024-10-21 23:40:20 +0200
  • 703c26eca7
    chore(trtllm): use GetParallelConfig Morgan Funtowicz 2024-10-21 23:39:44 +0200
  • c90680ed30
    chore(trtllm): define a macro for SizeType cast Morgan Funtowicz 2024-10-21 23:39:08 +0200
  • 16bb4b670b
    chore(trtllm): create specific parallelconfig factory and logging init methods Morgan Funtowicz 2024-10-21 23:38:42 +0200
  • 171a5638b1
    feat(trtllm): add stop words handling Morgan Funtowicz 2024-10-21 17:00:45 +0200
  • e711947e3e
    chore(ffi):formatting Morgan Funtowicz 2024-10-21 16:59:30 +0200
  • 17073267c0
    feat(post_processing): max_new_tokens is const evaluated now Morgan Funtowicz 2024-10-21 16:57:46 +0200
  • 3af45189b3
    chore(looper): cleanup a bit more Morgan Funtowicz 2024-10-21 16:57:26 +0200
  • 7f383bf4dc
    feat(trtllm): rewrite health to not account for current state Morgan Funtowicz 2024-10-21 15:55:38 +0200
  • c3fb2ecdc0
    Merge branch 'main' into auto_length Nicolas Patry 2024-10-25 10:20:00 +0200
  • 123ff3a83e
    Fixing bad rebase. Nicolas Patry 2024-10-25 09:58:46 +0200
  • 0bd9171556
    Avoiding timeout for bloom tests. Nicolas Patry 2024-10-25 09:48:57 +0200
  • db68bd0524
    Fixing mt0 test. (#2692) Nicolas Patry 2024-10-25 09:46:39 +0200
  • f16121002c
    Fixing mt0 test. Nicolas Patry 2024-10-25 09:34:15 +0200
  • cece8635f8
    Fixing rocm gptq by using triton code too (renamed cuda into triton). (#2691) Nicolas Patry 2024-10-25 09:17:57 +0200
  • 7dc2adf7e9
    Fixing rocm gptq by using triton code too (renamed cuda into triton). Nicolas Patry 2024-10-25 07:26:33 +0200
  • bbbd9a6dd2
    Deprecation message. Nicolas Patry 2024-10-16 18:50:33 +0200
  • d4d4321814
    Lint. Nicolas Patry 2024-10-16 15:07:01 +0200
  • b07935b04f
    Ellide lifetime. Nicolas Patry 2024-09-25 21:38:31 +0200
  • f20ef614bd
    Adding the legacy handle. Nicolas Patry 2024-09-25 14:37:36 +0200
  • cd355d08a9
    Fixing mamba by using the transformers version. Nicolas Patry 2024-09-25 03:37:12 +0200
  • 9d7a95b24b
    Fixing the GIL locking. Nicolas Patry 2024-09-25 01:18:05 +0200
  • c0151cc14a
    Flake.lock update ? Nicolas Patry 2024-09-24 16:22:17 +0200
  • 5bc1fe84eb
    Fixing the tests. Nicolas Patry 2024-09-24 15:45:10 +0200
  • b89b9fd016
    Remove redundancy. Nicolas Patry 2024-09-17 17:10:30 +0200
  • 9d702bcde3
    Handling potential lack of offsets (python tokenizer) Nicolas Patry 2024-09-17 16:56:19 +0200
  • 5ba7805f1c
    We can have a tokenizer anywhere. Nicolas Patry 2024-09-17 16:16:51 +0200
  • 43df056eee
    [TENSORRT-LLM] - Implement new looper thread based backend (#2357) Funtowicz Morgan 2024-10-25 07:17:14 +0200
  • 4463856cc7
    Fix bad rebase Nicolas Patry 2024-10-25 07:14:41 +0200
  • b4b6322ede
    Lint. Nicolas Patry 2024-10-25 07:10:34 +0200
  • 01b82b58d2
    Merge branch 'main' into trtllm-executor-thread Nicolas Patry 2024-10-25 07:06:35 +0200
  • 84b4a49093
    Upgrade outlines to 0.1.1 Alex Weston 2024-10-16 13:58:54 -0400
  • ed87b464b4
    Fixing "deadlock" when python prompts for trust_remote_code by always (#2664) Nicolas Patry 2024-10-25 06:39:21 +0200
  • c6281a4893 Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels Daniël de Kok 2024-10-24 15:29:33 +0000
  • eab07f746c
    Add support for FP8 KV cache scales (#2628) Daniël de Kok 2024-10-24 16:36:18 +0200
  • f311643fff fix: improve find_segments via numpy diff drbh 2024-10-24 10:16:34 -0400
  • 14a0df3a38
    Fix Phi 3.5 MoE tests (#2684) Daniël de Kok 2024-10-24 15:21:50 +0200
  • 1b914f37e7
    flashinfer: reminder to remove contiguous call in the future (#2685) Daniël de Kok 2024-10-24 14:59:56 +0200
  • 996413b8b0 flashinfer: reminder to remove contiguous call in the future Daniël de Kok 2024-10-24 12:42:52 +0000
  • a68fae05e9 can_scale: check that the attention is flashinfer Daniël de Kok 2024-10-24 12:35:30 +0000
  • 9bbbe47c82 Fix Phi 3.5 MoE tests Daniël de Kok 2024-10-24 12:06:21 +0000
  • e3db525917
    Fix integration mt0 (transformers update). auto_length Nicolas Patry 2024-10-24 11:54:11 +0200
  • 199973cc3c
    Simple updates. Nicolas Patry 2024-10-24 11:39:02 +0200
  • 1f18cb6aa6 Update FP8 KV cache test to use checkpoint with scales Daniël de Kok 2024-10-21 11:18:52 +0000
  • ba4ac96399 Add support for FP8 KV cache scales Daniël de Kok 2024-10-09 11:50:21 +0000
  • cacaba64c3
    Revert doc text. Nicolas Patry 2024-10-24 10:06:59 +0200
  • 6994fa12f8
    Updating logic + non flash. Nicolas Patry 2024-10-24 09:58:05 +0200
  • 10534511ea
    Much simpler logic after the overhead. Nicolas Patry 2024-10-24 06:55:25 +0200
  • 9cee00eec3 feat(trtllm): detect stop_words from generation_config.json Morgan Funtowicz 2024-10-23 16:05:59 +0200
  • 6376fecc6c chore(docker): install transformers Morgan Funtowicz 2024-10-23 15:48:24 +0200
  • ef0031182e chore(docker): add mpi to ld_library_path Morgan Funtowicz 2024-10-23 15:48:17 +0200
  • 41c2623735
    feat: allow any supported payload on /invocations (#2683) OlivierDehaene 2024-10-23 13:26:01 +0200
  • 27ff1871b5
    hotfix: fix flashllama OlivierDehaene 2024-10-23 13:22:31 +0200
  • 2c8a51a474
    update doc OlivierDehaene 2024-10-23 12:20:20 +0200
  • 03c9388bf7
    feat: natively support Granite models (#2682) OlivierDehaene 2024-10-23 12:04:05 +0200
  • 25b97fff49
    Update doc OlivierDehaene 2024-10-23 12:03:46 +0200
  • 849d8821ab
    QuantLinear is rocm compatible. Nicolas Patry 2024-10-23 18:02:50 +0800
  • 70483428ee
    update openAPI OlivierDehaene 2024-10-23 11:59:41 +0200
  • 09dfff62ff
    feat: allow any supported payload on /invocations OlivierDehaene 2024-10-23 11:51:13 +0200
  • 82a6cb82e1
    fix. Nicolas Patry 2024-10-23 17:26:18 +0800
  • 9897edb842
    feat: natively support Granite models OlivierDehaene 2024-10-23 11:08:09 +0200
  • f58eb70ebf
    Make moe-kernels and marlin-kernels mandatory in CUDA installs (#2632) Daniël de Kok 2024-10-23 11:07:31 +0200
  • b126bf4785
    Revert pr 235 as flash attention is not really enabled for gemma (#239) Thanaji Rao Thakkalapelli 2024-10-23 01:58:57 -0700
  • 8686a0fc6d
    Merge branch 'habana-main' into 2.3.0 yuanwu2017 2024-10-23 16:32:12 +0800
  • 67ee45a270 Pass the max_batch_total_tokens to causal_lm yuanwu 2024-10-10 07:31:50 +0000
  • 5c3efbc763
    Attempt #2 Nicolas Patry 2024-10-23 15:23:39 +0800