text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-04-24 00:12:08 +00:00

Author	SHA1	Message	Date
Morgan Funtowicz	7eec0f704f	chore(backend): minor fixes mostly format	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	a1154b17ec	feat(backend): avoid copy constructor	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	588421833c	misc(backend): missing header <variant>	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	62dba1a878	misc(cmake): use url deps and not git repo	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	52208f5b78	misc(backend): decrease log verbosity in callback	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	1149186794	feat(backend): expose tokenizer to the GenerationContext to decode token	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	1473259f84	feat(backend): add early stopping criteria from TGI stream callback	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	958c72a44a	misc(ffi): remove unused ffi mapping	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	5b7a951389	feat(backend): refactor the callback to handle intermediate and end inference message	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	11c593dc69	feat(backend): make eog clearer on c++ side	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	06424aa9ff	feat(backend): correctly handle the max_new_tokens case for is_eos	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	05ff551950	feat(backend): add number of generated tokens in the callback	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	188442f67d	misc(lint): make clippy happier	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	31d9254776	feat(backend): remove static from inner_fw visitor as it leads to invalid memory locations	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	7b0a56f40f	feat(backend): fix memory leaking on llama_sampler when the decode ends	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	86a2ae6ba2	chore: unsued variables	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	2cdfed94d9	feat(backend): correctly link to shared fmt and spdlog instead of static	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	bd8f0f15e1	feat(backend): fix invalid reference to ctx instead of context in release build	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	3e82f14f57	feat(backend): somewhat generates the final infer response	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	b50dcddbb8	feat(backend): avoid dropping the boxed stream at the end of the callback	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	612f2f939f	feat(backend): bind incoming request to the server	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	d4aee42fd8	feat(backend): add logit parameter in the callback fn	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	f39edc72ff	feat(backend): add mapping for ignore_eos_token stopping criteria	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	3af2c6837c	misc(offline): match rework	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	d52b4c4978	feat(backend): full rework of the backend internal to safer c++	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	6a5f6b0755	misc(offline): update offline tester	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	b98c635781	feat(backend): entirely rewrite backend	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	611590440d	misc(offline): expose more parameters for generate	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	dbc5b7a0f7	misc(offline): link correctly	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	0c1dd0ed2b	feat(llamacpp): wip explosion	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	a316c53255	feat(llamacpp): expose number of threads for the backend when constructing the model	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	179309b364	misc(build): refactor build type detection in cmake	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	f0859c247f	misc(build): handle different lib destination folder lib/lib64	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	e4d803c94e	feat(backend): build and link through build.rs	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	355d8a55b4	feat(backend): wip Rust binding	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	f9c248657d	chore(backend): minor formatting	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	37faeb34b2	feat(backend): expose frequency and repetition penalties	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	d4b5be10f9	feat(backend): minor refactor	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	92bb113653	feat(backend): use llama_token as TokenId type	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	45d5a6a8c5	feat(backend): add some initial decoding steps	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	098c66920d	feat(backend): tell cmake to build llama-common and link to it	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	0911076320	feat(backend): correctly load llama.cpp model from llama api and not gpt2	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	05ad684676	feat(llamacpp): enable cuda	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	fa89d1e613	misc(cmake): wut	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	e4432d36b1	misc(cmake): add parameter to build specific cuda arch	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	52d57dca79	feat(llamacpp): initial end2end build	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	7d1f8a2bd6	feat(llamacpp): correctly handle CMAKE_BUILD_TYPE for spdlog macros	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	aa1fcba59f	feat(llamacpp): initial commit # Conflicts: # Cargo.lock	2024-11-14 08:42:01 +01:00
Nicolas Patry	0c9b6cdd76	Choosing input/total tokens automatically based on available VRAM? (#2673 ) * Choosing input/total tokens automatically based on available VRAM? * Update doc. * Remove generated files. * Trying to fix non chunking targets. * Attempt #2 * fix. * QuantLinear is rocm compatible. * Much simpler logic after the overhead. * Updating logic + non flash. * Revert doc text. * Simple updates. * Fix integration mt0 (transformers update).	2024-10-28 04:59:49 +01:00
Funtowicz Morgan	ba5fc7d922	Add support for stop words in TRTLLM (#2678 ) * feat(trtllm): rewrite health to not account for current state * chore(looper): cleanup a bit more * feat(post_processing): max_new_tokens is const evaluated now * chore(ffi):formatting * feat(trtllm): add stop words handling # Conflicts: # backends/trtllm/lib/backend.cpp * chore(trtllm): create specific parallelconfig factory and logging init methods * chore(trtllm): define a macro for SizeType cast * chore(trtllm): use GetParallelConfig * chore(trtllm): minor refactoring * chore(trtllm): validate there are enough GPus on the system for the desired model * chore(trtllm): ensure max throughput scheduling policy is selected * chore(trtllm): minor fix * chore(router): minor refactorings * feat(docker): build with-slurm ompi * feat(docker): add python3.10 dev to runtime deps * chore(docker): add mpi to ld_library_path * chore(docker): install transformers * feat(trtllm): detect stop_words from generation_config.json	2024-10-25 10:58:34 +02:00

1 2

74 Commits