Wang, Yi
|
3d81a80577
|
Fix incorrect setting of max_new_tokens in warmup (#104)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
|
2024-03-13 16:19:40 +01:00 |
|
Yao Matrix
|
7149ac30e6
|
Fix the issue of out of range (#98)
Signed-off-by: yuanwu <yuan.wu@intel.com>
Co-authored-by: yuanwu <yuan.wu@intel.com>
|
2024-03-13 10:09:53 +01:00 |
|
jkaniecki
|
602a920ec5
|
Update nix version (#102)
|
2024-03-11 16:21:04 +01:00 |
|
Karol Damaszke
|
365f277900
|
Clean-up README (#96)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
|
2024-03-10 22:02:15 +01:00 |
|
Karol Damaszke
|
8e14780bf4
|
Wait 2sec once shard is ready to improve stability (#92) (#94)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
|
2024-03-04 12:17:24 +01:00 |
|
Karol Damaszke
|
80ae9ead28
|
Set MAX_TOTAL_TOKENS automatically (#91)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
|
2024-03-01 11:25:15 +01:00 |
|
Karol Damaszke
|
a5c788cfe4
|
Remove redundant fill op (#83) (#90)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
|
2024-03-01 01:32:02 +01:00 |
|
Karol Damaszke
|
03c2123244
|
Use batched index_copy (#73) (#89)
Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>
|
2024-02-29 15:45:16 +01:00 |
|
Karol Damaszke
|
8f6564ce0e
|
Heap based router queue (#63) (#88)
Co-authored-by: mrs303 <54661797+mrs303@users.noreply.github.com>
|
2024-02-29 10:56:26 +01:00 |
|
Karol Damaszke
|
7dbf4bf7a4
|
Improve tensor slicing performance (#66) (#87)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
|
2024-02-29 10:48:54 +01:00 |
|
Karol Damaszke
|
3831f1bed5
|
Add warmup for shift operation (#59) (#86)
|
2024-02-29 09:19:28 +01:00 |
|
Karol Damaszke
|
022ce1eaaf
|
Overhead reduction (#58) (#85)
Co-authored-by: mrs303 <54661797+mrs303@users.noreply.github.com>
|
2024-02-29 09:17:45 +01:00 |
|
Karol Damaszke
|
212136dff8
|
Log exceptions to debug.log (#52) (#84)
Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>
|
2024-02-29 09:14:42 +01:00 |
|
Karol Damaszke
|
c7ccfb87ff
|
Grouped pad/shift/move operations (#57) (#82)
Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>
|
2024-02-29 04:16:44 +01:00 |
|
Karol Damaszke
|
2122acc60f
|
Add warmup for all possible shapes for prefill #49 (#81)
|
2024-02-28 10:40:13 +01:00 |
|
Karol Damaszke
|
31bed905d4
|
Update habana profiler (#50) (#80)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
|
2024-02-28 09:57:40 +01:00 |
|
Karol Damaszke
|
d31fb62576
|
Add more info to high-level profiler events (#46) (#79)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
|
2024-02-28 09:55:50 +01:00 |
|
Karol Damaszke
|
941d36f3fd
|
Enable deferred token generation (#44) (#75)
Co-authored-by: Krzysztof Laskowski <klaskowski@habana.ai>
|
2024-02-27 15:46:40 +01:00 |
|
Karol Damaszke
|
6248c5610e
|
Revert "Prefer prefill instead of decode when max_waiting_tokens==0 (#18)" (#45) (#76)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
|
2024-02-27 11:56:45 +01:00 |
|
jkaniecki
|
83b059bd27
|
Bulk shifting (#40) (#70)
Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>
|
2024-02-26 17:29:56 +01:00 |
|
regisss
|
8f4aba6ad3
|
Update dependencies (#69)
|
2024-02-25 13:07:47 +01:00 |
|
jkaniecki
|
c3bd8ef445
|
Add Fp8 support (#42) (#71)
Co-authored-by: mrs303 <54661797+mrs303@users.noreply.github.com>
Co-authored-by: Adam Stachowicz <105052242+astachowiczhabana@users.noreply.github.com>
Co-authored-by: Grzegorz Morys <gmorys@habana.ai>
|
2024-02-23 11:52:28 +01:00 |
|
jkaniecki
|
a490847702
|
Sequence bucketing for prefill (#39) (#67)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
|
2024-02-23 01:52:14 +01:00 |
|
jkaniecki
|
8eb88a7d75
|
Bump rust version (#41) (#68)
Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>
|
2024-02-22 16:08:34 +01:00 |
|
jkaniecki
|
9ad6086250
|
Improve habana profile dev experience (#36) (#65)
Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com>
|
2024-02-22 13:57:45 +01:00 |
|
jkaniecki
|
f7ef414e38
|
Remove unused pad_token_id for filter (#35) (#64)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
|
2024-02-22 11:24:09 +01:00 |
|
jkaniecki
|
8f590759e3
|
Prefill optimization by allocating space only for the first output token (#34) (#62)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
Co-authored-by: Karol Damaszke <karol.damaszke@intel.com>
|
2024-02-22 04:55:43 +01:00 |
|
jkaniecki
|
80303b469c
|
Do not limit hpu graphs by default (#32) (#61)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
|
2024-02-21 15:38:00 +01:00 |
|
jkaniecki
|
6b6dec9ea1
|
Transparent tokenizer uses explicit int32 (#31) (#60)
Co-authored-by: Adam Stachowicz <105052242+astachowiczhabana@users.noreply.github.com>
|
2024-02-21 14:24:41 +01:00 |
|
regisss
|
a4d3a00d98
|
Fix dependencies (#56)
|
2024-02-19 10:19:23 +01:00 |
|
regisss
|
dca9ac6508
|
Revert "Solve dependency issue"
This reverts commit ea2b93dd75 .
|
2024-02-19 07:28:04 +00:00 |
|
regisss
|
ea2b93dd75
|
Solve dependency issue
|
2024-02-19 07:26:37 +00:00 |
|
regisss
|
2060bb58bf
|
Fix trust remote code (#55)
|
2024-02-19 07:53:24 +01:00 |
|
Karol Damaszke
|
2a7a967de3
|
Revert prefill optimization and fix accuracy issue in shift operation (#29)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>
Co-authored-by: jkaniecki <153085639+jkaniecki@users.noreply.github.com>
|
2024-01-23 15:19:07 +01:00 |
|
jkaniecki
|
ac3bc0e95e
|
Removed kv_cache from HPU graph output (#19)
|
2024-01-19 15:34:13 +01:00 |
|
mrs303
|
da0f874d49
|
Prefer prefill instead of decode when max_waiting_tokens==0 (#18)
|
2024-01-19 15:25:40 +01:00 |
|
Karol Damaszke
|
60f63262db
|
Prefill optimization by allocating space only for the first token (#17)
|
2024-01-19 15:18:35 +01:00 |
|
Adam Stachowicz
|
0b96da89aa
|
Make tokenizer optional (#12)
|
2024-01-19 15:12:04 +01:00 |
|
madamczykhabana
|
381ec38cad
|
Batch bucketing improvements (#15)
|
2024-01-17 10:09:27 +01:00 |
|
mrs303
|
8523f7ef64
|
Deepspeed terminate (#11)
|
2024-01-17 09:57:03 +01:00 |
|
Krzysztof Laskowski
|
c459c86f88
|
High-level server profiler (#13)
|
2024-01-16 09:57:29 +01:00 |
|
madamczykhabana
|
41c4f4fa41
|
Debugging utils (#14)
|
2024-01-15 21:05:27 +01:00 |
|
Karol Damaszke
|
a8c5b69e2c
|
Set default value of LIMIT_HPU_GRAPH to True (#7)
|
2024-01-11 14:51:49 +01:00 |
|
Harish Subramony
|
532e4b8d41
|
Readme updates with review comments (#8)
|
2024-01-11 10:12:43 +01:00 |
|
Harish Subramony
|
cb8b7610c0
|
Update README for proper usage of LIMIT_HPU_GRAPH (#3)
* Update README for proper usage of LIMIT_HPU_GRAPH
|
2024-01-09 14:49:15 -08:00 |
|
Karol Damaszke
|
252ccde104
|
Control prefill and decode batch size separately (#6)
|
2024-01-02 18:21:01 +01:00 |
|
Karol Damaszke
|
1be2d9a8ec
|
Batch size bucketing (#5)
|
2023-12-22 21:53:01 +01:00 |
|
jkaniecki
|
e3dcd7f2c2
|
Disable tensor caching in HPU Graph execution (#4)
|
2023-12-22 13:51:16 +01:00 |
|
Karol Damaszke
|
b1897acfd6
|
Calculate token budget with padding to max_input_length (#2)
|
2023-12-11 09:24:27 +01:00 |
|
Karol Damaszke
|
6436ae86a1
|
Fix for continuous batching (#1)
|
2023-12-11 09:24:09 +01:00 |
|