mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-09-11 12:24:53 +00:00
feat: add vlm docs and simple examples
This commit is contained in:
parent
eade737714
commit
c5131eee5c
224
docs/source/conceptual/visual-language-models.md
Normal file
224
docs/source/conceptual/visual-language-models.md
Normal file
@ -0,0 +1,224 @@
|
||||
# Vision Language Models (VLM)
|
||||
|
||||
## What is VLM?
|
||||
|
||||
Visual Language Model (VLM) are models that consume both visual and textual inputs to generate text.
|
||||
|
||||
These models are trained on multimodal data, which includes both images and text.
|
||||
|
||||
VLMs can be used for a variety of tasks, such as image captioning, visual question answering, and more.
|
||||
|
||||
<div class="flex justify-center">
|
||||
<pre>placeholder for architecture diagram</pre>
|
||||
</div>
|
||||
|
||||
With VLM, you can generate text from an image. For example, you can generate a caption for an image, answer questions about an image, or generate a description of an image.
|
||||
|
||||
- **Image Captioning**: Given an image, generate a caption that describes the image.
|
||||
- **Visual Question Answering (VQA)**: Given an image and a question about the image, generate an answer to the question.
|
||||
- **Visual Dialog**: Given an image and a dialog history, generate a response to the dialog.
|
||||
- **Visual Data Extraction**: Given an image, extract information from the image.
|
||||
|
||||
For example, given the image of a cat, a VLM can generate the caption "A cat sitting on a couch" or answer the question "What is the cat doing?" with "The cat is sitting on a couch."
|
||||
|
||||
## How to use VLM?
|
||||
|
||||
### VLM with Python
|
||||
|
||||
To use VLM with Python, you can use the `huggingface_hub` library. The `InferenceClient` class provides a simple way to interact with the Inference API.
|
||||
|
||||
This is the following image:
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img
|
||||
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
|
||||
width="400"
|
||||
/>
|
||||
</div>
|
||||
|
||||
```python
|
||||
from huggingface_hub import InferenceClient
|
||||
|
||||
client = InferenceClient("http://127.0.0.1:3000")
|
||||
image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
|
||||
prompt = f"What is this a picture of?\n\n"
|
||||
for token in client.text_generation(prompt, max_new_tokens=16, stream=True):
|
||||
print(token)
|
||||
|
||||
# This
|
||||
# is
|
||||
# a
|
||||
# picture
|
||||
# of
|
||||
# an
|
||||
# anth
|
||||
# rop
|
||||
# omorphic
|
||||
# rab
|
||||
# bit
|
||||
# in
|
||||
# a
|
||||
# space
|
||||
# suit
|
||||
# .
|
||||
```
|
||||
|
||||
Images can be passed as URLs or base64-encoded strings. The `InferenceClient` will automatically detect the image format.
|
||||
|
||||
This is the following image:
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img
|
||||
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/beaver.png"
|
||||
width="400"
|
||||
/>
|
||||
</div>
|
||||
|
||||
```python
|
||||
from huggingface_hub import InferenceClient
|
||||
import base64
|
||||
import requests
|
||||
import io
|
||||
|
||||
client = InferenceClient("http://127.0.0.1:3000")
|
||||
|
||||
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/beaver.png"
|
||||
original_image = requests.get(url)
|
||||
|
||||
# encode image to base64
|
||||
image_bytes = io.BytesIO(original_image.content)
|
||||
image = base64.b64encode(image_bytes.getvalue()).decode("utf-8")
|
||||
image = f"data:image/png;base64,{image}"
|
||||
|
||||
prompt = f"What is this a picture of?\n\n"
|
||||
|
||||
for token in client.text_generation(prompt, max_new_tokens=10, stream=True):
|
||||
print(token)
|
||||
|
||||
# This
|
||||
# is
|
||||
# a
|
||||
# picture
|
||||
# of
|
||||
# a
|
||||
# be
|
||||
# aver
|
||||
# .
|
||||
```
|
||||
|
||||
If you want additional details, you can add `details=True`. In this case, you get a `TextGenerationStreamResponse` which contains additional information such as the probabilities and the tokens. For the final response in the stream, it also returns the full generated text.
|
||||
|
||||
### VLM with cURL
|
||||
|
||||
To use the `generate_stream` endpoint with curl, you can add the `-N` flag, which disables curl default buffering and shows data as it arrives from the server
|
||||
|
||||
```bash
|
||||
curl -N 127.0.0.1:3000/generate_stream \
|
||||
-X POST \
|
||||
-d '{"inputs":"What is this a picture of?\n\n","parameters":{"max_new_tokens":16, "seed": 42}}' \
|
||||
-H 'Content-Type: application/json'
|
||||
|
||||
# ...
|
||||
# data:{"index":16,"token":{"id":28723,"text":".","logprob":-0.6196289,"special":false},"generated_text":"This is a picture of an anthropomorphic rabbit in a space suit.","details":null}
|
||||
```
|
||||
|
||||
### VLM with JavaScript
|
||||
|
||||
First, we need to install the `@huggingface/inference` library.
|
||||
`npm install @huggingface/inference`
|
||||
|
||||
If you're using the free Inference API, you can use `HfInference`. If you're using inference endpoints, you can use `HfInferenceEndpoint`.
|
||||
|
||||
We can create a `HfInferenceEndpoint` providing our endpoint URL and credential.
|
||||
|
||||
```js
|
||||
import { HfInferenceEndpoint } from "@huggingface/inference";
|
||||
|
||||
const hf = new HfInferenceEndpoint("http://127.0.0.1:3000", "hf_YOUR_TOKEN");
|
||||
|
||||
const prompt =
|
||||
"What is this a picture of?\n\n";
|
||||
|
||||
const stream = hf.textGenerationStream({
|
||||
inputs: prompt,
|
||||
parameters: { max_new_tokens: 16, seed: 42 },
|
||||
});
|
||||
for await (const r of stream) {
|
||||
// yield the generated token
|
||||
process.stdout.write(r.token.text);
|
||||
}
|
||||
|
||||
// This is a picture of an anthropomorphic rabbit in a space suit.
|
||||
```
|
||||
|
||||
## Advantages of VLM in TGI
|
||||
|
||||
VLMs in TGI have several advantages, for example these models can be used in tandem with other features for more complex tasks. For example, you can use VLMs with [Guided Generation](/docs/conceptual/guided-generation) to generate specific JSON data from an image.
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img
|
||||
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
|
||||
width="400"
|
||||
/>
|
||||
</div>
|
||||
|
||||
For example we can extract information from the rabbit image and generate a JSON object with the location, activity, number of animals seen, and the animals seen. That would look like this:
|
||||
|
||||
```json
|
||||
{
|
||||
"activity": "Standing",
|
||||
"animals": ["Rabbit"],
|
||||
"animals_seen": 1,
|
||||
"location": "Rocky surface with mountains in the background and a red light on the rabbit's chest"
|
||||
}
|
||||
```
|
||||
|
||||
All we need to do is provide a JSON schema to the VLM model and it will generate the JSON object for us.
|
||||
|
||||
```bash
|
||||
curl localhost:3000/generate \
|
||||
-X POST \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"inputs":"What is this a picture of?\n\n",
|
||||
"parameters": {
|
||||
"max_new_tokens": 100,
|
||||
"seed": 42,
|
||||
"grammar": {
|
||||
"type": "json",
|
||||
"value": {
|
||||
"properties": {
|
||||
"location": {
|
||||
"type": "string"
|
||||
},
|
||||
"activity": {
|
||||
"type": "string"
|
||||
},
|
||||
"animals_seen": {
|
||||
"type": "integer",
|
||||
"minimum": 1,
|
||||
"maximum": 5
|
||||
},
|
||||
"animals": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
},
|
||||
"required": ["location", "activity", "animals_seen", "animals"]
|
||||
}
|
||||
}
|
||||
}
|
||||
}'
|
||||
|
||||
# {
|
||||
# "generated_text": "{ \"activity\": \"Standing\", \"animals\": [ \"Rabbit\" ], \"animals_seen\": 1, \"location\": \"Rocky surface with mountains in the background and a red light on the rabbit's chest\" }"
|
||||
# }
|
||||
```
|
||||
|
||||
## How does VLM work under the hood?
|
||||
|
||||
coming soon...
|
||||
|
||||
<pre>placeholder for architecture diagram (image to tokens)</pre>
|
Loading…
Reference in New Issue
Block a user