Commit Graph

192 Commits

Author SHA1 Message Date
OlivierDehaene
ecb0db45af fix: fix logic if sliding window key is not present in config (#1352) 2024-04-19 14:56:10 +03:00
OlivierDehaene
a95e6d603d feat: relax mistral requirements (#1351)
Close #1253
Close #1279
2024-04-19 14:50:24 +03:00
OlivierDehaene
bb6200503c fix: max_past default value must be -1, not 0 (#1348) 2024-04-19 14:18:05 +03:00
OlivierDehaene
5c9ef069ed feat: add more latency metrics in forward (#1346) 2024-04-19 13:41:34 +03:00
OlivierDehaene
c974437ba7 fix: fix gpt-q params loading 2024-04-19 12:12:50 +03:00
OlivierDehaene
f9b58ac7a1 feat: add quant to mixtral (#1337) 2024-04-18 16:32:50 +03:00
OlivierDehaene
09c556dbd7 v1.3.1 2024-04-18 16:32:07 +03:00
OlivierDehaene
79f268f95a chore: formatting 2024-04-18 16:26:00 +03:00
OlivierDehaene
9aef902982 feat: mixtral (#1328) 2024-04-18 12:39:52 +00:00
Nicolas Patry
a7f52f3812 Speculative (#1308) 2024-04-18 12:39:39 +00:00
Karol Damaszke
30cc78773e
Skip server tests of not enabled models (#125)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
2024-04-09 14:15:41 +02:00
Karol Damaszke
d957e32601
Add Habana copyright header (#122)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
2024-04-08 18:06:21 +02:00
Karol Damaszke
b0de25a285
Don't set rope_scaling for unsupported models (#115)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
2024-04-02 12:12:02 +02:00
Karol Damaszke
7342baa2eb
Add support for rope_scaling and remove is_optimized_for_gaudi (#112)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
2024-03-29 15:07:32 +01:00
Karol Damaszke
bf5263b88b
Disable watermark with FP8 quantization (#114)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
2024-03-27 13:32:20 +01:00
jkaniecki
56f00a552b
Adjust warmup to all possible bucket sizes and decode batch size = 1 (#113) 2024-03-27 11:59:51 +01:00
Karol Damaszke
b45f648483
Add warmup for logits processors (#107)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
2024-03-18 15:17:47 +01:00
yuanwu2017
a4d5c3f40f
Fix the generate_stream crash in concurrent query (#105)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-03-15 10:54:56 +01:00
Karol Damaszke
80ae9ead28
Set MAX_TOTAL_TOKENS automatically (#91)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
2024-03-01 11:25:15 +01:00
Karol Damaszke
a5c788cfe4
Remove redundant fill op (#83) (#90)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
2024-03-01 01:32:02 +01:00
Karol Damaszke
03c2123244
Use batched index_copy (#73) (#89)
Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>
2024-02-29 15:45:16 +01:00
Karol Damaszke
7dbf4bf7a4
Improve tensor slicing performance (#66) (#87)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
2024-02-29 10:48:54 +01:00
Karol Damaszke
3831f1bed5
Add warmup for shift operation (#59) (#86) 2024-02-29 09:19:28 +01:00
Karol Damaszke
022ce1eaaf
Overhead reduction (#58) (#85)
Co-authored-by: mrs303 <54661797+mrs303@users.noreply.github.com>
2024-02-29 09:17:45 +01:00
Karol Damaszke
212136dff8
Log exceptions to debug.log (#52) (#84)
Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>
2024-02-29 09:14:42 +01:00
Karol Damaszke
c7ccfb87ff
Grouped pad/shift/move operations (#57) (#82)
Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>
2024-02-29 04:16:44 +01:00
Karol Damaszke
2122acc60f
Add warmup for all possible shapes for prefill #49 (#81) 2024-02-28 10:40:13 +01:00
Karol Damaszke
31bed905d4
Update habana profiler (#50) (#80)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
2024-02-28 09:57:40 +01:00
Karol Damaszke
941d36f3fd
Enable deferred token generation (#44) (#75)
Co-authored-by: Krzysztof Laskowski <klaskowski@habana.ai>
2024-02-27 15:46:40 +01:00
jkaniecki
83b059bd27
Bulk shifting (#40) (#70)
Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>
2024-02-26 17:29:56 +01:00
jkaniecki
c3bd8ef445
Add Fp8 support (#42) (#71)
Co-authored-by: mrs303 <54661797+mrs303@users.noreply.github.com>
Co-authored-by: Adam Stachowicz <105052242+astachowiczhabana@users.noreply.github.com>
Co-authored-by: Grzegorz Morys <gmorys@habana.ai>
2024-02-23 11:52:28 +01:00
jkaniecki
a490847702
Sequence bucketing for prefill (#39) (#67)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
2024-02-23 01:52:14 +01:00
jkaniecki
9ad6086250
Improve habana profile dev experience (#36) (#65)
Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com>
2024-02-22 13:57:45 +01:00
jkaniecki
8f590759e3
Prefill optimization by allocating space only for the first output token (#34) (#62)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
Co-authored-by: Karol Damaszke <karol.damaszke@intel.com>
2024-02-22 04:55:43 +01:00
jkaniecki
80303b469c
Do not limit hpu graphs by default (#32) (#61)
Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
2024-02-21 15:38:00 +01:00
Karol Damaszke
2a7a967de3
Revert prefill optimization and fix accuracy issue in shift operation (#29)
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>
Co-authored-by: jkaniecki <153085639+jkaniecki@users.noreply.github.com>
2024-01-23 15:19:07 +01:00
jkaniecki
ac3bc0e95e
Removed kv_cache from HPU graph output (#19) 2024-01-19 15:34:13 +01:00
Karol Damaszke
60f63262db
Prefill optimization by allocating space only for the first token (#17) 2024-01-19 15:18:35 +01:00
Adam Stachowicz
0b96da89aa
Make tokenizer optional (#12) 2024-01-19 15:12:04 +01:00
madamczykhabana
381ec38cad
Batch bucketing improvements (#15) 2024-01-17 10:09:27 +01:00
mrs303
8523f7ef64
Deepspeed terminate (#11) 2024-01-17 09:57:03 +01:00
madamczykhabana
41c4f4fa41
Debugging utils (#14) 2024-01-15 21:05:27 +01:00
Karol Damaszke
a8c5b69e2c
Set default value of LIMIT_HPU_GRAPH to True (#7) 2024-01-11 14:51:49 +01:00
Karol Damaszke
252ccde104
Control prefill and decode batch size separately (#6) 2024-01-02 18:21:01 +01:00
Karol Damaszke
1be2d9a8ec
Batch size bucketing (#5) 2023-12-22 21:53:01 +01:00
jkaniecki
e3dcd7f2c2
Disable tensor caching in HPU Graph execution (#4) 2023-12-22 13:51:16 +01:00
Karol Damaszke
6436ae86a1
Fix for continuous batching (#1) 2023-12-11 09:24:09 +01:00
regisss
e5f124b077 Merge tag 'v1.2.0' into v1.2-release 2023-12-06 18:46:16 +01:00
regisss
c09066aeb1 Merge tag 'v1.1.1' into v1.1-release 2023-12-06 09:50:58 +01:00
regisss
cc744ba426 Add changes from Optimum Habana's TGI folder 2023-12-05 11:12:16 +01:00