mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-04-19 22:02:06 +00:00

fix typos in docs and add small clarifications (#1790 )

# What does this PR do?

Fix some small typos in the docs; add minor clarifications; add guidance
to features on landing page

## Before submitting
- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

@OlivierDehaene

2024-04-22 12:15:48 -04:00

2.0 KiB

Raw Blame History

Text Generation Inference

Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and T5.

Text Generation Inference implements many optimizations and features, such as:

Simple launcher to serve most popular LLMs
Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
Tensor Parallelism for faster inference on multiple GPUs
Token streaming using Server-Sent Events (SSE)
Continuous batching of incoming requests for increased total throughput
Optimized transformers code for inference using Flash Attention and Paged Attention on the most popular architectures
Quantization with bitsandbytes and GPT-Q
Safetensors weight loading
Watermarking with A Watermark for Large Language Models
Logits warper (temperature scaling, top-p, top-k, repetition penalty)
Stop sequences
Log probabilities
Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance.
Guidance: Enable function calling and tool-use by forcing the model to generate structured outputs based on your own predefined output schemas.

Text Generation Inference is used in production by multiple projects, such as:

Hugging Chat, an open-source interface for open-access models, such as Open Assistant and Llama
OpenAssistant, an open-source community effort to train LLMs in the open
nat.dev, a playground to explore and compare LLMs.

2.0 KiB Raw Blame History

Text Generation Inference

2.0 KiB

Raw Blame History