fix: Include special tokens when tokenizing in front-end

There's currently a discrepancy in the tokenization between the router and python server code. The latter includes special tokens but former does not.

This results in a token count mismatch for seq2seq models such as mt0 where the tokenizer emits an EOS token at the end.

This in turn results in some unexpected/incorrect output, in particular when batch concatenation is involved, because the python code uses the input length passed from the router for each row.

As far as I can tell, it is better to include this token in the encoder input_ids, so I guess it's best to just adjust the router logic.
This commit is contained in:
Nick Hill 2022-12-27 13:27:45 -08:00
parent 611e21cb13
commit 03a62635b2

View File

@ -131,7 +131,7 @@ fn validation_worker(
} }
// Get the number of tokens in the input // Get the number of tokens in the input
match tokenizer.encode(request.inputs.clone(), false) { match tokenizer.encode(request.inputs.clone(), true) {
Ok(inputs) => { Ok(inputs) => {
let input_length = inputs.len(); let input_length = inputs.len();