text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-09-19 00:04:51 +00:00

Author	SHA1	Message	Date
jkaniecki	56f00a552b	Adjust warmup to all possible bucket sizes and decode batch size = 1 (#113 )	2024-03-27 11:59:51 +01:00
Karol Damaszke	9796b0e10d	Add simple continuous batching benchmark (#108 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>	2024-03-26 09:17:55 +01:00
regisss	7f58680999	Add `docker pull` command in README (#110 )	2024-03-25 15:44:54 +01:00
jkaniecki	2b1581edac	Warmup greedy search in next token chooser (#109 )	2024-03-22 23:43:20 +01:00
Wang, Yi	d752317b5f	Correct input_length since habana extend input_length to max_input_length (#103 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-03-18 15:23:13 +01:00
Karol Damaszke	b45f648483	Add warmup for logits processors (#107 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>	2024-03-18 15:17:47 +01:00
jkaniecki	8504f9c41c	Improve README clarity (#106 )	2024-03-18 15:15:07 +01:00
yuanwu2017	a4d5c3f40f	Fix the generate_stream crash in concurrent query (#105 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2024-03-15 10:54:56 +01:00
Wang, Yi	3d81a80577	Fix incorrect setting of max_new_tokens in warmup (#104 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-03-13 16:19:40 +01:00
Yao Matrix	7149ac30e6	Fix the issue of out of range (#98 ) Signed-off-by: yuanwu <yuan.wu@intel.com> Co-authored-by: yuanwu <yuan.wu@intel.com>	2024-03-13 10:09:53 +01:00
jkaniecki	602a920ec5	Update nix version (#102 )	2024-03-11 16:21:04 +01:00
Karol Damaszke	365f277900	Clean-up README (#96 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>	2024-03-10 22:02:15 +01:00
Karol Damaszke	8e14780bf4	Wait 2sec once shard is ready to improve stability (#92 ) (#94 ) Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>	2024-03-04 12:17:24 +01:00
Karol Damaszke	80ae9ead28	Set MAX_TOTAL_TOKENS automatically (#91 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>	2024-03-01 11:25:15 +01:00
Karol Damaszke	a5c788cfe4	Remove redundant fill op (#83 ) (#90 ) Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>	2024-03-01 01:32:02 +01:00
Karol Damaszke	03c2123244	Use batched index_copy (#73 ) (#89 ) Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>	2024-02-29 15:45:16 +01:00
Karol Damaszke	8f6564ce0e	Heap based router queue (#63 ) (#88 ) Co-authored-by: mrs303 <54661797+mrs303@users.noreply.github.com>	2024-02-29 10:56:26 +01:00
Karol Damaszke	7dbf4bf7a4	Improve tensor slicing performance (#66 ) (#87 ) Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>	2024-02-29 10:48:54 +01:00
Karol Damaszke	3831f1bed5	Add warmup for shift operation (#59 ) (#86 )	2024-02-29 09:19:28 +01:00
Karol Damaszke	022ce1eaaf	Overhead reduction (#58 ) (#85 ) Co-authored-by: mrs303 <54661797+mrs303@users.noreply.github.com>	2024-02-29 09:17:45 +01:00
Karol Damaszke	212136dff8	Log exceptions to debug.log (#52 ) (#84 ) Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>	2024-02-29 09:14:42 +01:00
Karol Damaszke	c7ccfb87ff	Grouped pad/shift/move operations (#57 ) (#82 ) Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>	2024-02-29 04:16:44 +01:00
Karol Damaszke	2122acc60f	Add warmup for all possible shapes for prefill #49 (#81 )	2024-02-28 10:40:13 +01:00
Karol Damaszke	31bed905d4	Update habana profiler (#50 ) (#80 ) Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>	2024-02-28 09:57:40 +01:00
Karol Damaszke	d31fb62576	Add more info to high-level profiler events (#46 ) (#79 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>	2024-02-28 09:55:50 +01:00
Karol Damaszke	941d36f3fd	Enable deferred token generation (#44 ) (#75 ) Co-authored-by: Krzysztof Laskowski <klaskowski@habana.ai>	2024-02-27 15:46:40 +01:00
Karol Damaszke	6248c5610e	Revert "Prefer prefill instead of decode when max_waiting_tokens==0 (#18 )" (#45 ) (#76 ) Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>	2024-02-27 11:56:45 +01:00
jkaniecki	83b059bd27	Bulk shifting (#40 ) (#70 ) Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>	2024-02-26 17:29:56 +01:00
regisss	8f4aba6ad3	Update dependencies (#69 )	2024-02-25 13:07:47 +01:00
jkaniecki	c3bd8ef445	Add Fp8 support (#42 ) (#71 ) Co-authored-by: mrs303 <54661797+mrs303@users.noreply.github.com> Co-authored-by: Adam Stachowicz <105052242+astachowiczhabana@users.noreply.github.com> Co-authored-by: Grzegorz Morys <gmorys@habana.ai>	2024-02-23 11:52:28 +01:00
jkaniecki	a490847702	Sequence bucketing for prefill (#39 ) (#67 ) Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>	2024-02-23 01:52:14 +01:00
jkaniecki	8eb88a7d75	Bump rust version (#41 ) (#68 ) Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>	2024-02-22 16:08:34 +01:00
jkaniecki	9ad6086250	Improve habana profile dev experience (#36 ) (#65 ) Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com>	2024-02-22 13:57:45 +01:00
jkaniecki	f7ef414e38	Remove unused pad_token_id for filter (#35 ) (#64 ) Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>	2024-02-22 11:24:09 +01:00
jkaniecki	8f590759e3	Prefill optimization by allocating space only for the first output token (#34 ) (#62 ) Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com> Co-authored-by: Karol Damaszke <karol.damaszke@intel.com>	2024-02-22 04:55:43 +01:00
jkaniecki	80303b469c	Do not limit hpu graphs by default (#32 ) (#61 ) Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>	2024-02-21 15:38:00 +01:00
jkaniecki	6b6dec9ea1	Transparent tokenizer uses explicit int32 (#31 ) (#60 ) Co-authored-by: Adam Stachowicz <105052242+astachowiczhabana@users.noreply.github.com>	2024-02-21 14:24:41 +01:00
regisss	a4d3a00d98	Fix dependencies (#56 )	2024-02-19 10:19:23 +01:00
regisss	dca9ac6508	Revert "Solve dependency issue" This reverts commit `ea2b93dd75`.	2024-02-19 07:28:04 +00:00
regisss	ea2b93dd75	Solve dependency issue	2024-02-19 07:26:37 +00:00
regisss	2060bb58bf	Fix trust remote code (#55 )	2024-02-19 07:53:24 +01:00
Karol Damaszke	2a7a967de3	Revert prefill optimization and fix accuracy issue in shift operation (#29 ) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai> Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com> Co-authored-by: jkaniecki <153085639+jkaniecki@users.noreply.github.com>	2024-01-23 15:19:07 +01:00
jkaniecki	ac3bc0e95e	Removed kv_cache from HPU graph output (#19 )	2024-01-19 15:34:13 +01:00
mrs303	da0f874d49	Prefer prefill instead of decode when max_waiting_tokens==0 (#18 )	2024-01-19 15:25:40 +01:00
Karol Damaszke	60f63262db	Prefill optimization by allocating space only for the first token (#17 )	2024-01-19 15:18:35 +01:00
Adam Stachowicz	0b96da89aa	Make tokenizer optional (#12 )	2024-01-19 15:12:04 +01:00
madamczykhabana	381ec38cad	Batch bucketing improvements (#15 )	2024-01-17 10:09:27 +01:00
mrs303	8523f7ef64	Deepspeed terminate (#11 )	2024-01-17 09:57:03 +01:00
Krzysztof Laskowski	c459c86f88	High-level server profiler (#13 )	2024-01-16 09:57:29 +01:00
madamczykhabana	41c4f4fa41	Debugging utils (#14 )	2024-01-15 21:05:27 +01:00

1 2 3 4 5 ...

597 Commits