Commit Graph

51 Commits

Author SHA1 Message Date
Nicolas Patry
8e92942a18
Making tool_calls a vector. (#3075)
* Making `tool_calls` a vector.

* Update doc.

* Fixing the nix overlay with updated version.

* Add openai dependency.

* Updating the old tests.

* Trying to reduce the logs in the case of errors.

* Less spammy logs too.
2025-03-05 22:32:31 +01:00
Nicolas Patry
3208d1cd1d
Revert "Trying to reduce the logs in the case of errors."
This reverts commit cdf70d6a28.
2025-03-05 20:52:38 +01:00
Nicolas Patry
cdf70d6a28
Trying to reduce the logs in the case of errors. 2025-03-05 20:50:43 +01:00
David Corvoysier
5eec3a8bb6
Avoid running neuron integration tests twice (#3054)
* test(neuron): refactor to prepare batch export

* test(neuron): add helper to batch export models

Also rename fixture file fro clarity.

* ci(neuron): do not run tests twice

* ci(neuron): rename precompilation job

* test(neuron): remove redundant subdirectory

* test(neuron): remove erroneous line

* doc(neuron): update links to installation page

* feat(neuron): cleanup Dockerfile

CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse is not required anymore.

* test(neuron): try to reduce download errors
2025-02-26 12:15:01 +01:00
drbh
b0069e0485
fix: run linters and fix formatting (#3057) 2025-02-25 16:11:34 -05:00
Nicolas Patry
cea9dbc971
You need to seek apparently. (#3049) 2025-02-24 14:58:23 +01:00
David Corvoysier
c00add9c03
Add Neuron backend (#3033)
* feat: add neuron backend

* feat(neuron): add server standalone installation

* feat(neuron): add server and integration tests

* fix(neuron): increase ulimit when building image

The base image used to compile the rust components seems to have a low
ulimit for opened files, which leads to errors during compilation.

* test(neuron): merge integration tests and fixtures

* test: add --neuron option

* review: do not use latest tag

* review: remove ureq pinned version

* review: --privileged should be the exception

* feat: add neuron case to build ci

* fix(neuron): export models from container in test fixtures

The neuron tests require models to have been previously exported and
cached on the hub. This is done automatically by the neuron.model
fixture the first time the tests are ran for a specific version.
This fixture used to export the models using optimum-neuron directly,
but this package is not necessarily present on the system.
Instead, it is now done through the neuron TGI itself, since it
contains all the tools required to export the models.
Note that since the CI runs docker in docker (dind) it does not seem
possible to share a volume between the CI container and the container
used to export the model.
For that reason, a specific image with a modified entrypoint is built
on-the-fly when a model export is required.

* refactor: remove sagemaker entry-point

The SageMaker image is built differently anyway.

* fix(neuron): avoid using Levenshtein

* test(neuron): use smaller llama model

* feat(neuron): avoid installing CUDA in image

* test(neuron): no error anymore when requesting too many tokens

* ci: doing a precompilation step (with a different token).

* test(neuron): avoid using image sha when exporting models

We now manually evaluate the apparent hash of the neuron backend by
combining the hash of the neuron backend directory and Dockerfile.
This new hash is used to identify exported neuron models instead of the
image sha.
This has two benefits:
- it changes less frequently (only hwen the neuron backend changes),
  which means less neuron models being pushed to the hub,
- it can be evaluated locally, meaning that running the tests once
  locally will export the models before the CI uses them.

* test(neuron): added a small script to prune test models

---------

Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-02-24 09:10:05 +01:00
Nicolas Patry
142a49a80d
Simplify logs2. (#3045)
* Simplify logs2.

* Changing the scope from module to session to fix the event_loop issue.
2025-02-21 10:03:40 +01:00
Nicolas Patry
9c89d0070e
Having less logs in case of failure for checking CI more easily. (#3037)
* Having less logs in case of failure for checking CI more easily.

* Cleaning up the versions to uv for the client.

* Ignore entirely the API.
2025-02-19 17:01:33 +01:00
Nicolas Patry
bdb3e488e4
Trying to avoid the random timeout. (#2929)
* Trying to avoid the random timeout.

* More read timeout ?

* Longer timeout ?

* Remove legacy ENV directive.

* Remove the dummy test, only increase the read timeout.

* Wat?
2025-01-21 11:06:10 +01:00
drbh
da5ab46705
Improve vlm support (add idefics3 support) (#2437)
* feat: expand vlm support and add image token logic and tests

* fix: avoid unused perceiver config

* feat: integrate image tokens into inputs embeds

* feat: add simple idefics3 test

* feat: update docs, image token logic and weight names

* fix: improve image processing

* feat: improve prefix for idefics3

* fix: bump idefics3 tests and snapshots

* fix: improve text model loading

* feat: consolidate changes with existing vlms and add support and test for smolvlm

* fix: create new idefic3 file, simplify logic and adjust llama weight loading

* fix: lint with ruff

* fix: clean up idefics 3 and improve prefix handling

* fix: improve typing

* fix: improve prompt_split_image with ref to original impl

* fix: adjust ruff lints and small refactors

* fix: adjust FlashLlamaModel prefix logic
2025-01-09 10:35:32 -05:00
Nicolas Patry
98330df65e
Monkey patching as a desperate measure. (#2704)
* Monkey patching as a desperate measure.

* New snapshot ?
2024-10-28 11:25:13 +01:00
Nicolas Patry
513d19b955
More timeout on docker start ? (#2701)
* More timeout on docker start ?

* Latest upgrade.
2024-10-28 08:57:22 +01:00
OlivierDehaene
a6a0c97ed9
feat: prefill chunking (#2600)
* wip

* rollback

* refactor to use prefix/postfix namming + fix all_input_ids_tensor

* maybe patching vlms?

* fix filter and concat

* wip, no filter, no concat

* current

* add prepare_for_prefill

* working

* load tested

* re-create slots

* re-create slots

* fix slot_filtering_indices

* feedback loop

* remove log

* fix benchmarker

* fix vlm and seq2seq

* rename to cache and input lengths

* fix prefill logprobs

* fix launcher

* fix logprobs?

* idk at this point

* max input length

* omfg

* remove debugging lines

* fix tests

* fix mllama

* fix cargo tests

* remove support chunking for paged

* Fixing non blocked attentions

* Fixing dtype + AMD, Ipex targets.

* lint fix.

* rename

* Fix prefix_caching variable, remove defaults in server (confusing a lot
of the times).

* Add simple resolution when user specifies ATTENTION=paged.

* Put back non default simple tests.

* Fix env name

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-10-16 12:49:33 +02:00
Nicolas Patry
3dbdf63ec5
Intel ci (#2630)
* Intel CI ?

* Let's try non sharded gemma.

* Snapshot rename

* Apparently container can be gone already.
2024-10-10 16:51:57 +02:00
Nicolas Patry
43f39f6894
AMD CI (#2589)
* Only run 1 valid test.

* TRying the tailscale action quickly.

* ?

* bash spaces.

* Remove tailscale.

* More quotes.

* mnt2 ?

* Othername to avoid recursive directories.

* Good old tmate.

* Remove tmate.

* Trying a few things.

* Remove some stuff.

* Sleep ?

* Tmp

* busybox

* Launcher tgi

* Starting hello

* Busybox in python

* No device.

* Removing all variables ?

* A un moment donné.

* Tmp

* Tmp2

* DEvice request, no container name

* No device requests

* Without pytest.

* No pytest.

* from env

* Start with devices

* Attemp #1

* Remove stdin messing

* Only 1 test, no container name

* Raw tgi

* Sending args.

* Show pip freeze.

* Start downloading with token

* Giving HIP devices.

* Mount volume + port forward

* Without pytest.

* No token

* Repeated arguments

* Wrong kwarg.

* On 2 GPUs

* Fallback to single shard CI test.

* Testing

* yaml

* Common cache ?

* Trailing slash ?

* Docker volume split.

* Fix docker volume

* Fixing ?

* ?

* Try no devices ?

* Flash llama on intel CPU ?

* Fix nvidia ?

* Temp deactivate intel, activate nvidia ?
2024-10-09 17:50:49 +02:00
Daniël de Kok
2358c2bb54
Add basic FP8 KV cache support (#2603)
* Add basic FP8 KV cache support

This change adds rudimentary FP8 KV cache support. The support is
enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so
uses this type for the KV cache. However support is still limited:

* Only the `fp8_e5m2` type is supported.
* The KV cache layout is the same as `float16`/`bfloat16` (HND).
* The FP8 KV cache is only supported for FlashInfer.
* Loading of scales is not yet supported.

* Fix Cargo.toml
2024-10-04 17:51:48 +02:00
Nicolas Patry
38fcafcf96
Adding a test for FD. (#2516)
* Adding a test for FD.

* Fixing flashdecoding (empty batch doesn't work).

* Fixing the invalid popping.

* Fixing radix with block_size > 1

* Last reference.

* Use an actual hash.

* Update hash for slice.len() == 1

* Update the locks.

* Increasing docker timeout.
2024-09-16 17:00:54 +02:00
Nicolas Patry
a4e3e8c608
Prefix test - Different kind of load test to trigger prefix test bugs. (#2490)
* Adding prefix test.

* [WIP] tmp dump of integration load tests.

* Remove other tensor creation.

* Fixed the radix tree.

Used a slice everywhere in radix.rs to keep the cheap Arc cloning
instead of recomputing the input_ids.

* Fix parsing

* Is it really flashinfer version ?

* Remove some comments.

* Revert the max prefix hit.

* Adding numpy to diff.

* Upgraded flashinfer.

* Upgrading some stuff.

* Are we done yet ?

* Minor fixup

* Remove 1 log and put back the other.

* Add comment for why slot 0 is OK.

* Mounting on the job.

* Get me a debug branch

* Debugging CIs is fun.

* Attempt #28

* wip

* Tmate.

* Praying.

* Updating VLM causal model with updated context.

* Important line got squashed.

* Tmate again.

* Fingers crossed.

* We want only 1 run of integration tests.....

---------

Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>
2024-09-11 18:10:40 +02:00
Daniël de Kok
2eb57a15ec
Fix incompatibility with latest syrupy and update in Poetry (#2497) 2024-09-06 11:00:52 +02:00
drbh
cfa73b5c99
Pr 2451 ci branch (#2454)
* fix[router]: Fix tools not passed in chat template

Signed-off-by: GitHub <noreply@github.com>

* feat: improve default tool serialization and lints

* feat: refactor tool logic to include notify_error in prompt and adjust typing

* fix: adjust non tool template apply

* fix: simplify tool grammar logic and improve schema

* feat: avoid skip tool test and avoid empty tool prompts

* fix: increase test client timeout for grammar compilation tests

---------

Signed-off-by: GitHub <noreply@github.com>
Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>
2024-08-26 20:19:38 -04:00
Nicolas Patry
c7ab1810d4
Further fixes. (#2426)
* Further fixes.

* Update the conftest to allow NaN (first logprob).

* Fix the condition.
2024-08-16 13:21:44 +02:00
drbh
bab02ff2bc
feat: add ruff and resolve issue (#2262)
* feat: add ruff and resolve issue

* fix: update client exports and adjust after rebase

* fix: adjust syntax to avoid circular import

* fix: adjust client ruff settings

* fix: lint and refactor import check and avoid model enum as global names

* fix: improve fbgemm_gpu check and lints

* fix: update lints

* fix: prefer comparing model enum over str

* fix: adjust lints and ignore specific rules

* fix: avoid unneeded quantize check
2024-07-26 10:29:09 -04:00
drbh
5a65066922
feat: simple mistral lora integration tests (#2180)
* feat: simple mistral lora integration tests

* fix: include args in docker launcher

* fix: disable cuda graphs with lora and warn

* fix: adjust docs and precommit issues

* fix: re update docs
2024-07-15 09:16:15 -04:00
Daniël de Kok
fc9c3153e5
Add pytest release marker (#2114)
* Add pytest release marker

Annotate a test with `@pytest.mark.release` and it only gets run
with `pytest integration-tests --release`.

* Mark many models as `release` to speed up CI
2024-06-25 16:53:20 +02:00
Lucain
3447c722fd
Support HF_TOKEN environment variable (#2066)
* Support HF_TOKEN environement variable

* Load test.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-06-25 09:23:12 +02:00
Nicolas Patry
480d3b3304
New runner. Manual squash. (#2110)
* New runner. Manual squash.

* Network host.

* Put back trufflehog with proper extension.

* No network host ?

* Moving buildx install after tailscale ?

* 1.79
2024-06-24 18:08:34 +02:00
Daniël de Kok
36dd16017c Add support for exl2 quantization
Mostly straightforward, changes to existing code:

* Wrap quantizer parameters in a small wrapper to avoid passing
  around untyped tuples and needing to repack them as a dict.
* Move scratch space computation to warmup, because we need the
  maximum input sequence length to avoid allocating huge
  scratch buffers that OOM.
2024-05-30 11:28:05 +02:00
Daniël de Kok
f20463e4e3 Fix (non-container) pytest stdout buffering-related lock-up
Two issues:

1. When one of the stdout/stderr pipe buffers of a process started
   with `subprocess.Popen` is full, the process can get blocked until
   the buffer is drained.
2. Calling `Popen.wait` can deadlock when called before draining
   the pipe buffers (if they are full).

This avoids the issue altogether by giving the child process a
temporary file to write to.
2024-05-28 16:26:11 +02:00
Nicolas Patry
bfddfa5955
Idefics2. (#1756)
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-04-23 23:04:44 +02:00
drbh
06c3d4b1ec
feat: accept list as prompt and use first string (#1702)
This PR allows the `CompletionRequest.prompt` to be sent as a string or
array of strings. When an array is sent the first value will be used if
it's a string; otherwise the according error will be thrown

Fixes:
https://github.com/huggingface/text-generation-inference/issues/1690
Similar to: https://github.com/vllm-project/vllm/pull/323/files
2024-04-17 10:41:12 +02:00
Nicolas Patry
4634b00c2a
Adding Llava-Next (Llava 1.6) with full support. (#1709)
# What does this PR do?

- Changed all models to extract `embed_tokens` in order to enable llava
to separately call the embeddings and the core model layers.
- Added VlmCausalLM to inherit from FlashMistral in order to be
maximally supported. The only added logics sits on top and parses images
into pixel values, preallocates input_ids space for the image
embeddings, and passes them for the model.
- Added Clip for the vision tower.
- Didn't add flash for the vision tower since there's no padding anyway.
- Added heuristic (potentially incomplete) to calculate number of
features *before* calculating the clip patches (allows for easier logic
reuse of the LLM under the hood).


Still needs to be done:

- [x] Implement the image parsing in the controller side, to avoid
downloading n times per TP shard and also refusing requests too large
early and avoid issues where the truncation actually truncates the
image.
- [ ] Make sure it works with quantization properly.
- [x] Make sure it works with TP>1



<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-04-09 21:32:00 +02:00
Nicolas Patry
99874eae74
Add cuda graphs sizes and make it default. (#1703)
# What does this PR do?

```
text-generation-launcher --model-id XXX # Uses cuda graphs by default
text-generation-launcher --model-id XXX --cuda-graphs "1,2"  #Restrict the number of cuda graphs which saves VRAM
text-generation-launcher --model-id XXX --cuda-graphs "0"  # Disabling it entirely
```
<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-04-04 23:01:56 +02:00
drbh
de6cb15fa5
fix: improve tool type, bump pydantic and outlines (#1650)
This PR resolves a couple 

- [X] adjusts the tool response to align with openai's tools response
type
- [X] bumps pydantic to `2.6.4` in all apps (resolves dependency issue
when running tests)
- [X] bump `outlines` version and fix import for new name
2024-03-21 12:45:56 -04:00
drbh
9b6db5f793
Support tools (#1587)
This work in progress PR begins to add support for tools. Tools relies
on grammar support and still has some unsolved challenges. Opening the
PR for visibility and feedback
2024-02-28 11:10:27 +01:00
Nicolas Patry
bf700e7eef
Revamp medusa implementation so that every model can benefit. (#1588)
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-02-26 19:49:28 +01:00
OlivierDehaene
c86f58d37c
feat: add support for Gemma (#1583) 2024-02-21 14:15:22 +01:00
drbh
cef0553d59
Outlines guided generation (#1539)
This WIP PR starts to add grammar support via outlines, currently this
PR supports very simple regex grammars and does not optimize for
precompiling or caching grammar fsm's.

todo:
- [X] add simple outlines guidance to `NextTokenChooser`
- [X] update protos for grammar
- [X] update generation params API
- [X] constrain simple grammar
- [ ] support parsing more complex grammar into fsm
- [ ] support all outline support grammar types
- [ ] explore optimizations to avoid recompiling grammars

guided request
```bash
curl -s 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data-raw '{
    "inputs": "make an email for david: \n",
    "parameters": {
        "max_new_tokens": 6,
        "grammar": "[\\w-]+@([\\w-]+\\.)+[\\w-]+"
    }
}' | jq
```
response
```json
{
  "generated_text": "david@example.com"
}
```

unguided request
```bash
curl -s 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data '{
    "inputs": "make an email for david: \n",
    "parameters": {
        "max_new_tokens": 6
    }
}' | jq
```
response
```json
{
  "generated_text": "    email = 'david"
}
```
2024-02-15 10:28:10 +01:00
OlivierDehaene
0d794af6a5
feat: experimental support for cuda graphs (#1428)
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-02-12 10:09:29 +01:00
OlivierDehaene
72ee382ded chore: formatting 2023-12-11 14:49:52 +01:00
Nicolas Patry
624800c4de
Make GPTQ test less flaky (#1295)
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2023-11-28 21:22:35 +01:00
Nicolas Patry
ba552e1a82
Let each model resolve their own default dtype. (#1287)
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2023-11-28 17:54:26 +01:00
OlivierDehaene
35509ff5de
chore: update to torch 2.1.0 (#1182)
Close #1142
2023-11-23 13:38:50 +01:00
Nicolas Patry
d5b5bc750f
feat(server): Add exllama GPTQ CUDA kernel support #553 (#666)
Just trying to get the integration tests to pass.


# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Felix Marty <9808326+fxmarty@users.noreply.github.com>
2023-07-21 10:59:00 +02:00
Nicolas Patry
abd58ff82c
feat(server): Rework model loading (#344)
# What does this PR do?

Reworked the loading logic. Idea is to use cleaner loading code:

- Remove need for `no_init_weights`
- Remove all weird `bnb_linear` and `load_weights` and
`post_load_weights`.

New code layout:

- New class `Weights` in charge of handling loading the weights from
multiple files into appropiate tensors (potentially sharded)
- TP layers now are "shells", they contain the code to know what kind of
sharding we need + eventual `all_reduce`. They do not inherit from
linear, but they contain some kind of Linear instead
- the contained linear can be either FastLinear, BnbLinear or GPTq
Linear next.
- All modeling code is explictly made for sharding, process group is
just no-ops for non sharded code (removes a lot of test cases)

![Screenshot from 2023-05-19
23-19-59](https://github.com/huggingface/text-generation-inference/assets/204321/9a802654-74a3-488c-87a8-073743a6143f)

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.taildb5d.ts.net>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
2023-06-08 14:51:52 +02:00
OlivierDehaene
895c5f1562
feat(server): only compute prefill logprobs when asked (#406)
Close #288
2023-06-02 17:12:30 +02:00
OlivierDehaene
b8b950b37c
feat(server): support RefinedWeb models (#379) 2023-05-30 18:25:19 +02:00
OlivierDehaene
942005386a
feat(router): log input/ouput at debug level (#364)
@njhill FYI
2023-05-23 20:47:37 +02:00
OlivierDehaene
5a58226130
fix(server): fix decode token (#334)
Fixes #333

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-05-16 23:23:27 +02:00
OlivierDehaene
dbdc587ddd
feat(integration-tests): improve comparison and health checks (#336) 2023-05-16 20:22:11 +02:00