* Attempt at automatic max batch prefill.
* Taking into account number of shards.
* Adding more cards.
* Adding A100 + H100
* Adding a few more cards.
* Logprobs cost too much.
* h100 better name, and keep factor of 2
* Damn inflated sparse tflops.
* Typo in h100.
* Updated the flops calculation (checked with fvcore).
* chunking by default.
* Fix prefix caching for chat completion since we removed logprobs.
* More tests.
* Dropping all the prefill logprobs.
* Add a flag that enables users to get logprobs back.
* Repairing prompt token counting.
* Fixing a few tests.
* Remove some scaffolding.
* Attempting to reduces the issues (workarounds for now).
Update the Mixtral GPTQ test to use a model with `desc_act=true` and
`group_size!=-1` to ensure that we are checking activation
sorting/non-full K (with tensor parallelism). The `desc_act=false` case
is already checked by the Mixtral AWQ test.
This change add support for MoE models that use GPTQ quantization.
Currently only models with the following properties are supported:
- No `desc_act` with tensor parallelism, unless `group_size=-1`.
- No asymmetric quantization.
- No AWQ.