undefined | Better HN

0 pointsdietr1ch2y ago0 comments

I'd guess that the tokenizer is just different and handles this in a "better" way.

0 comments

No, in both tokenizers Unicode tag-block code points like these are converted into bytes (two tokens per character), which is a fallback for code points uncommon enough to not warrant a dedicated token.

j / k navigate · click thread line to collapse

0 pointsdietr1ch2y ago0 comments

I'd guess that the tokenizer is just different and handles this in a "better" way.

0 comments

goodside2y ago

No, in both tokenizers Unicode tag-block code points like these are converted into bytes (two tokens per character), which is a fallback for code points uncommon enough to not warrant a dedicated token.

j / k navigate · click thread line to collapse