mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-09-08 19:04:52 +00:00
Disable Cachix pushes (#3312)
* Disable Cachix pushes This is not safe until we have sandboxed builds. For TGI alone this might not be a huge issue, but with Cachix caching disabled in hf-nix, TGI CI would build all the packages and push it to our cache. * fix: bump docs --------- Co-authored-by: drbh <david.richard.holtz@gmail.com>
This commit is contained in:
parent
8801ba12cf
commit
06d9d88b95
2
.github/workflows/nix_build.yaml
vendored
2
.github/workflows/nix_build.yaml
vendored
@ -23,7 +23,7 @@ jobs:
|
||||
with:
|
||||
name: huggingface
|
||||
# If you chose signing key for write access
|
||||
authToken: '${{ secrets.CACHIX_AUTH_TOKEN }}'
|
||||
# authToken: '${{ secrets.CACHIX_AUTH_TOKEN }}'
|
||||
env:
|
||||
USER: github_runner
|
||||
- name: Build
|
||||
|
2
.github/workflows/nix_cache.yaml
vendored
2
.github/workflows/nix_cache.yaml
vendored
@ -22,7 +22,7 @@ jobs:
|
||||
with:
|
||||
name: huggingface
|
||||
# If you chose signing key for write access
|
||||
authToken: "${{ secrets.CACHIX_AUTH_TOKEN }}"
|
||||
#authToken: "${{ secrets.CACHIX_AUTH_TOKEN }}"
|
||||
env:
|
||||
USER: github_runner
|
||||
- name: Build impure devshell
|
||||
|
4
.github/workflows/nix_tests.yaml
vendored
4
.github/workflows/nix_tests.yaml
vendored
@ -27,9 +27,11 @@ jobs:
|
||||
with:
|
||||
name: huggingface
|
||||
# If you chose signing key for write access
|
||||
authToken: '${{ secrets.CACHIX_AUTH_TOKEN }}'
|
||||
#authToken: '${{ secrets.CACHIX_AUTH_TOKEN }}'
|
||||
env:
|
||||
USER: github_runner
|
||||
- name: Nix info
|
||||
run: nix-shell -p nix-info --run "nix-info -m"
|
||||
- name: Build
|
||||
run: nix develop .#test --command echo "Ok"
|
||||
- name: Pre-commit tests.
|
||||
|
@ -58,8 +58,6 @@ Options:
|
||||
Quantization method to use for the model. It is not necessary to specify this option for pre-quantized models, since the quantization method is read from the model configuration.
|
||||
|
||||
Marlin kernels will be used automatically for GPTQ/AWQ models.
|
||||
|
||||
[env: QUANTIZE=]
|
||||
|
||||
Possible values:
|
||||
- awq: 4 bit quantization. Requires a specific AWQ quantized model: <https://hf.co/models?search=awq>. Should replace GPTQ models wherever possible because of the better latency
|
||||
@ -72,6 +70,8 @@ Options:
|
||||
- bitsandbytes-nf4: Bitsandbytes 4bit. Can be applied on any model, will cut the memory requirement by 4x, but it is known that the model will be much slower to run than the native f16
|
||||
- bitsandbytes-fp4: Bitsandbytes 4bit. nf4 should be preferred in most cases but maybe this one has better perplexity performance for you model
|
||||
- fp8: [FP8](https://developer.nvidia.com/blog/nvidia-arm-and-intel-publish-fp8-specification-for-standardization-as-an-interchange-format-for-ai/) (e4m3) works on H100 and above This dtype has native ops should be the fastest if available. This is currently not the fastest because of local unpacking + padding to satisfy matrix multiplication limitations
|
||||
|
||||
[env: QUANTIZE=]
|
||||
|
||||
```
|
||||
## SPECULATE
|
||||
@ -456,14 +456,14 @@ Options:
|
||||
```shell
|
||||
--usage-stats <USAGE_STATS>
|
||||
Control if anonymous usage stats are collected. Options are "on", "off" and "no-stack" Defaul is on
|
||||
|
||||
[env: USAGE_STATS=]
|
||||
[default: on]
|
||||
|
||||
Possible values:
|
||||
- on: Default option, usage statistics are collected anonymously
|
||||
- off: Disables all collection of usage statistics
|
||||
- no-stack: Doesn't send the error stack trace or error type, but allows sending a crash event
|
||||
|
||||
[env: USAGE_STATS=]
|
||||
[default: on]
|
||||
|
||||
```
|
||||
## PAYLOAD_LIMIT
|
||||
|
Loading…
Reference in New Issue
Block a user