Commit Graph

  • 73ebbd05f8 Pr 2451 ci branch (#2454) drbh 2024-08-26 20:19:38 -0400
  • 7aebb953e2 Fix: don't apply post layernorm in SiglipVisionTransformer (#2459) drbh 2024-08-26 17:04:46 -0400
  • 92ac02e4f2 nix: add default package (#2453) Daniël de Kok 2024-08-23 22:06:22 +0200
  • b7d1adc3e9 nix: add awq-inference-engine as server dependency (#2442) Daniël de Kok 2024-08-21 22:20:03 +0200
  • 6654c2d11b Adding eetq to flake. (#2438) Nicolas Patry 2024-08-21 09:06:33 +0200
  • a5af557359 nix: add text-generation-benchmark to pure devshell (#2431) Daniël de Kok 2024-08-21 07:48:13 +0200
  • 516392d790 nix: add pure server to flake, add both pure and impure devshells (#2430) Daniël de Kok 2024-08-20 22:07:33 +0200
  • 635dde8af9 Prefix caching (#2402) Nicolas Patry 2024-08-20 11:15:30 +0200
  • ddba272a66 nix: update to CUDA 12.4 (#2429) Daniël de Kok 2024-08-19 09:28:38 +0200
  • cd208c5043 All integration tests back everywhere (too many failed CI). (#2428) Nicolas Patry 2024-08-16 21:19:46 +0200
  • 53fdbe617d doc: Add metrics documentation and add a 'Reference' section (#2230) Hugo Larcher 2024-08-16 19:43:30 +0200
  • 11d25a4bd3 FIxing the CI. Nicolas Patry 2024-08-16 14:21:29 +0200
  • 85df9fc2db Further fixes. (#2426) Nicolas Patry 2024-08-16 13:21:44 +0200
  • df0e650891 Improve the Consuming TGI + Streaming docs. (#2412) Vaibhav Srivastav 2024-08-16 12:43:08 +0200
  • 20ed7b598e nix: try to reduce the number of Rust rebuilds (#2424) Daniël de Kok 2024-08-16 10:01:01 +0200
  • f0181ed2d7 Upgrading the tests to match the current workings. (#2423) Nicolas Patry 2024-08-15 13:28:42 +0200
  • df6ea89da9 Fixing exl2 and other quanize tests again. (#2419) Nicolas Patry 2024-08-15 11:12:51 +0200
  • e5c39a5545 nix: build router incrementally (#2422) Daniël de Kok 2024-08-15 10:21:51 +0200
  • c3401e0b99 More fixes trtllm (#2342) Funtowicz Morgan 2024-08-14 12:02:05 +0200
  • 4baa6ff59f Upgrading exl2. (#2415) Nicolas Patry 2024-08-14 11:58:08 +0200
  • bae161ab84 nix: partial incremental build of the router (#2416) Daniël de Kok 2024-08-14 11:06:28 +0200
  • ffc8fb0850 fix: adds causal to attention params (#2408) drbh 2024-08-13 10:19:46 -0400
  • 7a4d831d17 add numa to improve cpu inference perf (#2330) Wang, Yi 2024-08-13 21:33:55 +0800
  • c5e4c1877b Adding more kernels to flake. (#2411) Nicolas Patry 2024-08-13 10:49:18 +0200
  • eb561bb715 nix: incremental build of the launcher (#2410) Daniël de Kok 2024-08-13 10:44:15 +0200
  • 10b2be6536 fix: include create_exllama_buffers and set_device for exllama (#2407) drbh 2024-08-12 17:59:37 -0400
  • 1f8c0f83e3 Pr 2395 ci run (#2406) drbh 2024-08-12 14:38:59 -0400
  • 18d6be6af4 Updating the flake. (#2404) Nicolas Patry 2024-08-12 18:09:16 +0200
  • 96e8fa37b0 fix: improve completions to send a final chunk with usage details (#2336) drbh 2024-08-12 11:26:11 -0400
  • 3079865b60 fix: allocate tmp based on sgmv kernel if available (#2345) drbh 2024-08-12 11:24:32 -0400
  • 8e6bfa2fc5 feat: validate template variables before apply and improve sliding wi… (#2403) drbh 2024-08-12 10:58:40 -0400
  • 6393cdee63 Keeping the benchmark somewhere (#2401) Nicolas Patry 2024-08-12 15:22:02 +0200
  • f586cc7f0c Add support for prefix caching to the v3 router (#2392) Daniël de Kok 2024-08-12 14:59:17 +0200
  • b8efd6d00c Cpu dockerimage (#2367) Wang, Yi 2024-08-12 20:10:30 +0800
  • 1daaddd072 Fixing import exl2 (#2399) Nicolas Patry 2024-08-12 14:08:59 +0200
  • fbe59c6267 Adding launcher to build. (#2397) Nicolas Patry 2024-08-12 14:08:46 +0200
  • 8750dc878e Upgrade fbgemm (#2398) Nicolas Patry 2024-08-12 14:08:38 +0200
  • 197dd3af12 nix: add router to the devshell (#2396) Daniël de Kok 2024-08-12 09:28:38 +0200
  • bb833389e0 Update flake for 9.0a capability in Torch (#2394) Daniël de Kok 2024-08-09 22:36:51 +0200
  • 959add5e9b feat: add guideline to chat request and template (#2391) drbh 2024-08-09 10:56:45 -0400
  • 849bd93dc3 Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385) Nicolas Patry 2024-08-09 16:41:17 +0200
  • df719fd527 flake: use rust-overlay (#2390) Daniël de Kok 2024-08-09 15:24:21 +0200
  • 1d4a35a23c Update documentation for Supported models (#2386) Vaibhav Srivastav 2024-08-09 15:01:34 +0200
  • e9ba044250 flake: add fmt and clippy (#2389) Daniël de Kok 2024-08-09 14:56:20 +0200
  • afa14b7595 Using HF_HOME instead of CACHE to get token read in addition to models. (#2288) Nicolas Patry 2024-08-09 14:25:44 +0200
  • dc0fa60f55 Add experimental flake (#2384) Daniël de Kok 2024-08-09 12:32:37 +0200
  • 4a16da5d49 Add FlashInfer support (#2354) Daniël de Kok 2024-08-09 11:42:00 +0200
  • 6f2a468a64 Pr 2352 ci branch (#2382) drbh 2024-08-09 04:54:32 -0400
  • b1bc0ecb7f Update Quantization docs and minor doc fix. (#2368) Vaibhav Srivastav 2024-08-08 22:06:57 +0200
  • 853fb96fec fix: prefer hidden_activation over hidden_act in gemma2 (#2381) drbh 2024-08-08 14:08:56 -0400
  • 1057f28128 Pr 2337 ci branch (#2379) drbh 2024-08-08 12:30:29 -0400
  • 3893d00927 fix EleutherAI/gpt-neox-20b does not work in tgi (#2346) Wang, Yi 2024-08-09 00:08:52 +0800
  • 06b638f310 Pr 2374 ci branch (#2378) drbh 2024-08-08 11:14:06 -0400
  • 9b1b545bb4 Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) (#2371) drbh 2024-08-07 23:14:02 -0400
  • 3ea8e8a2d5 add gptj modeling in TGI #2366 (CI RUN) (#2372) drbh 2024-08-07 21:32:37 -0400
  • 11fab8a20c fix: fix num_ln_in_parallel_attn attribute name typo in RWConfig (#2350) almersawi 2024-08-08 03:45:23 +0400
  • 3ccde430d9 fix: prefer original layernorm names for 180B (#2365) drbh 2024-08-06 15:25:30 -0400
  • db873be177 fix: default num_ln_in_parallel_attn to one if not supplied (#2364) drbh 2024-08-06 13:33:22 -0400
  • 5400c7155d feat: return the generated text when parsing fails (#2353) drbh 2024-08-06 13:10:19 -0400
  • b4562e1369 feat: prefer stop over eos_token to align with openai finish_reason (#2344) drbh 2024-08-06 13:09:50 -0400
  • 88e07f12cc feat: implement a templated endpoint for visibility into chat requests (#2333) drbh 2024-08-06 07:51:32 -0400
  • 83d1f23fea fix: return the out tensor rather then the functions return value (#2361) drbh 2024-08-06 07:49:53 -0400
  • 8b0f5feb02 feat: include local lora adapter loading docs (#2359) drbh 2024-08-05 12:36:44 -0400
  • 688321bcc4 fix: attempt forward on flash attn2 to check hardware support (#2335) drbh 2024-08-05 09:11:40 -0400
  • 48fec7b198 Unify attention output handling (#2343) Daniël de Kok 2024-08-01 17:03:28 +0200
  • ccddb30c02 Fix cache block size for flash decoding (#2351) Daniël de Kok 2024-08-01 15:38:57 +0200
  • d70da59c25 enable HuggingFaceM4/idefics-9b in intel gpu (#2338) Wang, Yi 2024-08-01 17:08:36 +0800
  • 3c4f816ae3 refactor usage stats (#2339) Erik Kaunismäki 2024-07-31 16:29:07 +0200
  • c73d1d604f Pr 2290 ci run (#2329) drbh 2024-07-31 10:27:15 -0400
  • 468e5c6874 Handle GPTQ-Marlin loading in GPTQMarlinWeightLoader (#2300) Daniël de Kok 2024-07-31 13:08:41 +0200
  • 120d5773e8 Rebase TRT-llm (#2331) Nicolas Patry 2024-07-31 10:33:10 +0200
  • 247a29f77c server quantize: store quantizer config in standard format (#2299) Daniël de Kok 2024-07-30 15:16:20 +0200
  • bafab73f76 fix: adjust test snapshots and small refactors (#2323) drbh 2024-07-29 11:38:38 -0400
  • b1d1d26559 patch-error-on-invalid-grammar (#2282) Erik Kaunismäki 2024-07-29 16:09:25 +0200
  • a574381cb4 fix: reject grammars without properties (#2309) drbh 2024-07-29 10:07:25 -0400
  • 23a3927eb6 Install Marlin from standalone package (#2320) Daniël de Kok 2024-07-29 15:37:10 +0200
  • 2c1d280fae Run ci api key (#2315) Erik Kaunismäki 2024-07-29 11:14:17 +0200
  • a87791d7c9 feat: add ruff and resolve issue (#2262) drbh 2024-07-26 10:29:09 -0400
  • fc6d80fdb8 Support tied embeddings in 0.5B and 1.5B Qwen2 models (#2313) Daniël de Kok 2024-07-26 14:57:24 +0200
  • 1674f441d0 Fix registry name (#2307) Adrien 2024-07-25 16:06:00 +0200
  • d5e054342e Fixing idefics on g6 tests. (#2306) Nicolas Patry 2024-07-25 14:44:21 +0200
  • 64ffd642fa Some small fixes for the Torch 2.4.0 update (#2304) Daniël de Kok 2024-07-25 13:34:44 +0200
  • 69db13e5e5 Using g6 instead of g5. (#2281) Nicolas Patry 2024-07-25 11:21:17 +0200
  • 7ebee37641 fix: refactor adapter weight loading and mapping (#2193) drbh 2024-07-24 15:32:14 -0400
  • 457791f511 Split up layers.marlin into several files (#2292) Daniël de Kok 2024-07-24 16:33:26 +0200
  • d93931567d fix of use of unquantized weights in cohere GQA loading, also enable … (#2291) Wang, Yi 2024-07-24 16:44:02 +0800
  • 204142153f fix crash in multi-modal (#2245) Wang, Yi 2024-07-24 16:39:08 +0800
  • a994f6aedd hotfix: update nccl OlivierDehaene 2024-07-23 23:31:28 +0200
  • 34c472bd64 chore: update to torch 2.4 (#2259) OlivierDehaene 2024-07-23 20:39:43 +0000
  • b1077b077c hotfix: pin numpy (#2289) Daniël de Kok 2024-07-23 17:53:19 +0200
  • 43f49141fd Add support for Llama 3 rotary embeddings (#2286) Daniël de Kok 2024-07-23 17:18:54 +0200
  • 5390973c09 Preparing for release. (#2285) Nicolas Patry 2024-07-23 16:20:17 +0200
  • 69b67b7add Add support for Mistral-Nemo by supporting head_dim through config (#2254) shaltielshmid 2024-07-23 16:00:07 +0300
  • 26460f053d Add support for repacking AWQ weights for GPTQ-Marlin (#2278) Daniël de Kok 2024-07-23 13:08:20 +0200
  • 919da25c3b fix(l4): fix fp8 logic on l4 (#2277) OlivierDehaene 2024-07-23 09:24:29 +0000
  • 31eb03dbe2 Fixing mistral nemo. (#2276) Nicolas Patry 2024-07-23 11:16:03 +0200
  • 568cc9f3d0 Softcapping for gemma2. (#2273) Nicolas Patry 2024-07-22 18:27:10 +0200
  • a7515b8af1 fix(server): fix fp8 weight loading (#2268) OlivierDehaene 2024-07-22 15:51:32 +0000
  • 758a8b8423 legacy warning on text_generation client (#2271) Erik Kaunismäki 2024-07-22 12:00:17 +0200
  • a5aee82a69 Hotfix: fix of use of unquantized weights in Mixtral GQA loading (#2269) icyboy™ 2024-07-22 17:31:00 +0800