Commit Graph

  • 8770b39c20 fix: remove accidentally included guideline from rebase drbh 2024-11-22 13:50:30 -0500
  • 4069955e44 fix: bump test output drbh 2024-11-21 12:04:24 -0500
  • 7486d930f8 fix: remove unneeded launcher args in continue test drbh 2024-11-21 09:09:39 -0500
  • 70066e6d8c fix: remove continue_final_message chat request param David Holtz 2024-11-19 21:24:18 +0000
  • d6280141de fix: bump openapi docs David Holtz 2024-11-08 20:46:58 +0000
  • b2ae92e470 feat: add test for continue final message David Holtz 2024-11-08 19:00:05 +0000
  • c782a78623 feat: support continue_final_message param in chat request drbh 2024-11-07 18:24:58 -0400
  • 2d1e80d248 fix: only use eos_token_id as pad_token_id if int Dmitry Rogozhkin 2024-11-22 10:36:26 -0800
  • d2ed52f531
    v2.4.1 v2.4.1 git_v2.4.1 OlivierDehaene 2024-11-22 18:28:39 +0100
  • 780531ec77
    chore: prepare 2.4.1 release (#2773) OlivierDehaene 2024-11-22 18:26:15 +0100
  • 7213d30141
    fmt OlivierDehaene 2024-11-22 17:39:20 +0100
  • 690702b1ce
    fix tests OlivierDehaene 2024-11-22 16:09:14 +0100
  • bb87333d19
    chore: prepare 2.4.1 release OlivierDehaene 2024-11-22 15:50:44 +0100
  • e87893d38e
    chore: Update to marlin-kernels 0.3.6 (#2771) Daniël de Kok 2024-11-22 15:44:47 +0100
  • 9025a26cea chore: remove unrelated change to trtllm Morgan Funtowicz 2024-11-22 15:42:09 +0100
  • 862a519fdd misc(doc): rust documentation Morgan Funtowicz 2024-11-22 15:35:55 +0100
  • b9c04b9c07 misc(doc): c++ documentation Morgan Funtowicz 2024-11-22 15:13:54 +0100
  • 4ee2ee58c9 misc(license): update LICENSE Morgan Funtowicz 2024-11-22 14:48:39 +0100
  • afb381033b Update to marlin-kernels 0.3.6 Daniël de Kok 2024-11-22 09:28:34 +0000
  • 2d9465d181 misc(backend): allow rebinding numa core affinity Morgan Funtowicz 2024-11-22 14:02:58 +0100
  • 30ae99631c misc(docker): add numa lib as dependency Morgan Funtowicz 2024-11-22 13:34:52 +0100
  • 5a85661661 feat(backend): rely on multi consumer queue to scheduler workers Morgan Funtowicz 2024-11-22 13:32:56 +0100
  • b6e3ffb037
    Merge branch 'main' into feature/get-trace-id-from-req-headers Hyeongchan Kim 2024-11-22 13:25:25 +0900
  • 84eead219a feat(backend): correctly setup llama_context providing n_threads and n_ubatch Morgan Funtowicz 2024-11-21 21:43:50 +0100
  • ab7ccf5bc3
    feat: add payload limit (#2726) OlivierDehaene 2024-11-21 19:20:15 +0100
  • e830508c20
    update launcher OlivierDehaene 2024-11-21 19:13:35 +0100
  • d5bc6a20bd
    feat: Add automatic nightly benchmarks (#2591) Hugo Larcher 2024-11-21 18:11:42 +0100
  • d012f229c6
    Remove guideline from API (#2762) Lucain 2024-11-21 17:56:38 +0100
  • c5b5b3a11c
    docs: Add a README section about using Nix (#2767) Daniël de Kok 2024-11-21 17:53:27 +0100
  • faa10ad0bc
    fix: tweak grammar test response (#2769) drbh 2024-11-21 11:46:00 -0500
  • 8e0c161d0a
    fix: incomplete generations w/ single tokens generations and models that did not support chunking (#2770) OlivierDehaene 2024-11-21 17:37:55 +0100
  • 489675b5e5
    entries was wrongly extended for model that did not support chunking OlivierDehaene 2024-11-21 15:24:04 +0100
  • 322565d8f2 fix: tweak grammar test response drbh 2024-11-21 09:13:27 -0500
  • 4cbba33139
    Incomplete generation stream fix (#2754) Wang, Yi 2024-11-21 22:06:26 +0800
  • 50c376612c feat(backend): bind thread and memory affinity for thread Morgan Funtowicz 2024-11-21 13:52:38 +0100
  • 3c54488638
    nix: downgrade to outlines 0.1.3 (#2768) Daniël de Kok 2024-11-21 13:00:26 +0100
  • 2a68d6db09 nix: downgrade to outlines 0.1.3 Daniël de Kok 2024-11-21 11:21:23 +0000
  • 56e3b65c46 Add a README section about using Nix Daniël de Kok 2024-11-21 08:53:16 +0000
  • 6ee8d6dd3b
    fix: set outlines version to 0.1.3 to avoid caching serialization issue (#2766) drbh 2024-11-20 18:09:39 -0500
  • 5335bf973b feat(backend): multistream inference on CPU Morgan Funtowicz 2024-11-21 00:03:05 +0100
  • 613fa03b63 fix: set outlines version to 0.1.3 to avoid bug drbh 2024-11-20 16:57:08 -0500
  • 07bed530f7
    nix: build and cache impure devshells (#2765) Daniël de Kok 2024-11-20 20:56:11 +0100
  • aa46309f8d Fix Nix build, disable pure shell (covered by Nix tests) Daniël de Kok 2024-11-20 19:30:30 +0000
  • 45c6ae6dd3 nix: add poetry to the impure shell Daniël de Kok 2024-11-20 18:59:17 +0000
  • 98db89b8b6 nix: build and cache all devshells Daniël de Kok 2024-11-20 18:43:31 +0000
  • 46a5a7e73e
    Add support for wNa16 int 2:4 compressed-tensors checkpoints (#2758) Daniël de Kok 2024-11-20 18:25:23 +0100
  • 2fda8845a7
    nix: update for outlines 0.1.4 (#2764) Daniël de Kok 2024-11-20 18:24:29 +0100
  • 80cfe1b16c nix: update for outlines 0.1.4 Daniël de Kok 2024-11-20 16:17:12 +0000
  • 74a8a820ad Use FP8 KV cache when specified by compressed-tensors Daniël de Kok 2024-11-20 12:31:47 +0000
  • 45013b60a4 Install compressed-tensors in Docker CPU builds Daniël de Kok 2024-11-20 14:17:47 +0000
  • 87004ae711
    Remove guideline from API Wauplin 2024-11-20 13:47:59 +0100
  • 5f52e2e38e entries.len() could > batch.size in prefill, so need to filter as well. Wang, Yi A 2024-11-19 23:27:45 -0800
  • bd6e8b3c13
    fix: adjust llama MLP name from dense to mlp to correctly apply lora (#2760) drbh 2024-11-19 15:10:22 -0500
  • 91fe29c1b1 fix: adjust llama MLP name from dense to mlp to correctly apply lora drbh 2024-11-19 14:51:46 -0500
  • 5489406c4a
    PR 2634 CI - Fix the tool_choice format for named choice by adapting OpenAIs scheme (#2645) drbh 2024-11-19 13:31:59 -0500
  • 070af963f8 Add support for wNa16 int 2:4 compressed-tensors checkpoints Daniël de Kok 2024-11-19 13:49:11 +0000
  • 2007a9473a
    Update to moe-kernels 0.7.0 (#2720) Daniël de Kok 2024-11-19 14:55:29 +0100
  • 2b9d692831 Update to moe-kernels 0.7.0 Daniël de Kok 2024-11-04 15:04:04 +0000
  • b4ec427ad0
    Simplify two ipex conditions (#2755) Daniël de Kok 2024-11-19 08:04:23 +0100
  • d49ce00f40
    With this change, bucketing/padding of input is applied to health check. (#245) srajabos 2024-11-18 16:38:30 -0500
  • 38cff84a3e
    feat: support flash attention 2 in qwen2 vl vision blocks (#2721) drbh 2024-11-18 12:46:40 -0500
  • 3c9df21ff8
    Add support for compressed-tensors w8a8 int checkpoints (#2745) Daniël de Kok 2024-11-18 17:20:31 +0100
  • c6393c5512 Simplify two ipex conditions Daniël de Kok 2024-11-18 16:18:59 +0000
  • a5ecd6e586
    add ipex moe implementation to support Mixtral and PhiMoe (#2707) Wang, Yi 2024-11-19 00:16:55 +0800
  • 70409f09f4 fix: calc max_seqlen once and small refactors David Holtz 2024-11-18 15:34:08 +0000
  • fea62e928f
    fix: improve find_segments via numpy diff (#2686) drbh 2024-11-18 09:51:06 -0500
  • 05f98efc9d Small fixes Daniël de Kok 2024-11-18 14:49:59 +0000
  • 3eb6c1ccf8
    Fix a typo Daniël de Kok 2024-11-18 15:45:52 +0100
  • e0018723fc Use marlin-kernels 0.3.5 Daniël de Kok 2024-11-18 12:43:12 +0000
  • 53b6f6e604
    Apply suggestions from code review ipex-moe Wang, Yi 2024-11-18 19:28:07 +0800
  • f76c0ff17f Always use dynamic input quantization for w8a8 int Daniël de Kok 2024-11-18 10:54:51 +0000
  • b2dc10aea5 Add support for compressed-tensors w8a8 int checkpoints Daniël de Kok 2024-11-14 11:00:29 +0000
  • e0e39fa0d9
    Merge branch 'main' into moe Wang, Yi 2024-11-18 09:45:05 +0800
  • 52e48739a5
    Remove vLLM dependency for CUDA (#2751) Daniël de Kok 2024-11-17 17:34:50 +0100
  • 6489f85269
    feat: return streaming errors as an event formatted for openai's client (#2668) drbh 2024-11-15 08:49:19 -0500
  • d8f1203bcb
    Small lifting. Nicolas Patry 2024-11-15 14:48:23 +0100
  • 110d154777 Fix clippy warning Daniël de Kok 2024-11-15 13:44:26 +0000
  • 5d9613e0c5
    Revert "Reworked the implementation." Nicolas Patry 2024-11-15 14:27:16 +0100
  • df72deac26
    Reworked the implementation. Nicolas Patry 2024-11-15 20:24:47 +0700
  • 22d205aa47
    Revert "fix: improve streamin error to include error_type" drbh 2024-10-25 11:55:44 -0400
  • a9c8c6a0d7
    fix: improve streamin error to include error_type David Holtz 2024-10-25 14:35:25 +0000
  • 21378b325b
    fix: improve stream api error format and add status code drbh 2024-10-22 11:59:14 -0400
  • 0ae84e5473
    fix: propagate completions error events to stream drbh 2024-10-22 09:53:15 -0400
  • 84cd8434b0
    feat: return streaming errors as an event formatted for openai's client drbh 2024-10-18 14:15:27 -0400
  • dfc00f7fb3 Remove vLLM dependency for CUDA Daniël de Kok 2024-11-15 12:31:30 +0000
  • 34a3bdedc3
    Upgrading our deps. (#2750) Nicolas Patry 2024-11-15 21:03:27 +0800
  • b52d6332e4
    Fixup. Nicolas Patry 2024-11-15 13:45:22 +0100
  • 8dffe1ca08
    fixup. Nicolas Patry 2024-11-15 13:33:47 +0100
  • 1623a56544
    Upgrading our deps. Nicolas Patry 2024-11-15 13:26:06 +0100
  • 4580ced091
    Upgrade outlines to 0.1.1 (#2742) Alex Weston 2024-11-15 07:22:52 -0500
  • 003eaec0fb
    fix response type of document for Text Generation Inference (#2743) jito 2024-11-15 21:21:50 +0900
  • 4f4857a4ac
    Fix: Change embeddings to embedding (#2738) Billel Mokeddem 2024-11-15 16:16:15 +0400
  • f9ee46f740
    Fix: Change model_type from ssm to mamba (#2740) Billel Mokeddem 2024-11-15 16:15:36 +0400
  • 8442f1ac85
    benchmark: fix prefill throughput (#2741) Daniël de Kok 2024-11-15 13:14:55 +0100
  • ca4f46ddfc
    nix: update nixpkgs (#2746) Daniël de Kok 2024-11-14 18:48:20 +0100
  • c908aab440 nix: update nixpkgs Daniël de Kok 2024-11-14 16:33:04 +0000
  • 23d2bcf28d misc(build): improve build process Morgan Funtowicz 2024-11-14 09:38:13 +0100
  • 70c90ad933 feat(backend): update llamacpp to 4077 Morgan Funtowicz 2024-11-14 09:04:06 +0100
  • 6f059c4b5d feat(backend): wrap Arc tokenizer to avoid duplicating Morgan Funtowicz 2024-11-14 08:41:38 +0100
  • 57b215467b feat(backend): simplify Rust callback Morgan Funtowicz 2024-11-13 00:22:11 +0100