* Fix unsigned integer underflow
Passing --max-batch-size to the launcher actually had no effect
because after a few requests the max_size passed to State::next_batch
would underflow becoming a largo positive number.
In the scheduler, as soon as the cached batch size reached the
max_batch_size the max_size passed to next_batch becomes 0.
Since the only check in that funcion is
```
if Some(batch_requests.len()) == max_size {
break;
}
```
and it's called after the `batch_requests.len()` has
become 1, it doesn't do anything to prevent more than 0
requests from being batched.
Now we have cached batch in the server that is large than
max_batch_size and `max_size - batch_size as usize`
underflows.
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
* fix: update v3 scheduler and ensure max_batch_size > 0
---------
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com>
* feat: first draft load multiple lora
* feat: load weights within layer and refactor lora pass
* fix: refactor and reduce lora math
* feat: baseline impl single request multi lora support
* feat: prefer lorax implementation and port loading logic
* fix: prefer adapter_data and refactors
* feat: perfer loraxs custom punica kernels and add mlp loras
* fix: adjust batch for bgmv
* fix: adjust adapter_segments logic when in batch
* fix: refactor and move changes to v3 proto
* fix: pass model_id for all flash causal lms
* fix: pass model_id for all causal and seq2seq lms
* fix: add model_id to model test
* feat: add lora support to mistral and refactors
* feat: prefer model id in request
* fix: include rust code for adapter id
* feat: bump launcher and add new lora docs
* feat: support base model generation and refactors
* fix: rename doc to retry ci build
* feat: support if vlm models
* fix: add adapter_data param and avoid missing layers
* fix: add adapter_data param to phi and neox
* fix: update all models forwards to include adapter_data
* fix: add model_id to IdeficsCausalLM
* Update lora.md
Fixed a typo
* Update lora.md
Fixing spam image
* fix: add lora kernel to dockerfile, support running without kernels and refactors
* fix: avoid dockerfile conflict
* fix: refactors and adjust flash llama lora logic
* fix: skip llama test due to CI issue (temp)
* fix: skip llama test CI (temp) 2
* fix: revert skips and prefer updated ci token for tests
* fix: refactors and helpful comments
* fix: add noop in TensorParallelAdapterRowLinear too
* fix: refactor and move shard_lora_weights logic
* fix: exit early if no adapter_data
---------
Co-authored-by: Derek <datavistics@gmail.com>
- Refactor code to allow supporting multiple versions of the
generate.proto at the same time
- Add v3/generate.proto (ISO to generate.proto for now but allow for
future changes without impacting v2 backends)
- Add Schedule trait to abstract queuing and batching mechanisms that
will be different in the future
- Add SchedulerV2/V3 impl