In a collaborative effort between the University of Hong Kong and SambaNova Systems, we introduce EvaByte, a 6.5B state-of-the-art byte-level language model featuring an improved architecture and powered by EVA -- an efficient attention mechanism designed for scalability and performance. Trained on 1.5T bytes of natural language text, math, and code using the SambaNova SN30 RDU system, EvaByte demonstrates that efficient byte-level processing at scale is not just possible, but practically advantageous -- rivaling top open-source tokenizer-based LMs [1, 2, 3] despite using 5x less training data, excelling in coding tasks, and decoding up to 2x faster than tokenizer-based models. Its token-free design also brings added flexibility, avoiding tokenizer quirks while naturally extending to multimodal applications without any architecture tweaks.
To our knowledge, EvaByte is the first open-source byte-level model without tokenization that yet matches the performance of modern tokenizer-based LMs. Check out the model weights and code here:
Tokenization is a fundamental step in modern large language models, deciding how input is represented in Transformers. Although it efficiently compresses raw text into shorter sequences, tokenization comes with its own baggage -- it is an externally trained, detached component that can introduce complex biases and edge-case quirks, like the prompt boundary problem [4, 5, 6, 7, 8], undertrained tokens [9, 10, 11, 12, 13], and even pretraining data mixture leaks [14].
Byte-level modeling is an approach that inherently eliminates biases introduced by tokenization, although directly operating on bytes at scale is not easy [15, 16, 17, 18, 19, 20, 21]:
We address these hurdles with a streamlined architecture featuring two improvements: multibyte prediction and the efficient attention mechanism, EVA.
Although vanilla byte-level language models typically run much slower than tokenizer-based LMs, by combining multibyte prediction with EVA, we have achieved a significant speed boost for byte models -- 5-10x faster decoding compared to vanilla architectures and up to 2x faster than tokenizer-based LMs, making byte-level models a practical choice for real-world applications.
We draw inspiration from recent work [22, 23, 24, 25] and equip our model with multiple prediction heads, allowing it to predict several future bytes simultaneously. During training, we average the cross-entropy losses from different output heads as the primary training objective. These heads learn very effectively -- their predictions are often highly accurate and sometimes even outperform the immediate next byte prediction, as shown in the figure below.
Multibyte prediction adds almost no training overhead, thanks to the particularly small vocabulary size. However, it greatly speeds up inference with self-speculative decoding, where multiple heads are combined via Medusa-like tree attention [24] and enable the model to predict multiple bytes in one decoding step.
However, multibyte prediction alone is not enough to speed up the byte-level model: the self-attention mechanism quickly becomes the major bottleneck as the context length grows. To address this, we build our model on EVA [26], an improved version of linearized attention [27, 28, 29]. Linearized attention approximates exact self-attention by designing feature maps such that
By linearizing the exponential function, one can rearrange the order of computation and achieve linear complexity in sequence length. This approach admits the form of a linear RNN, maintaining a global hidden state. With gating mechanisms and decay coefficients [28, 30, 31, 32], it also connects to recent state-space models like Mamba and Mamba-2 [33, 34]. Conventional linearized attention compresses past tokens into a single global hidden state, unlike standard attention, which explicitly caches every token.
EVA takes a middle ground by distributing the global state into multiple local memory slots. By splitting key-value pairs into consecutive chunks and applying linearization separately on each chunk, EVA maintains a local hidden state for each chunk and aggregates them together to produce the final output. This expands the design space of linearized attention mechanisms, simplifies implementation, and directly benefits from hardware-optimized kernels for standard attention mechanisms.
We pretrain EvaByte on a corpus of 1.5T bytes spanning from text to math and code, mainly sourced from Dolma v1.7, The Stack v2, FineWeb-Edu, and DCLM-Baseline. We constantly refined the data mix by tweaking the proportions or swapping in new sources mid-flight. After training on 1.2T bytes, we conduct two independent annealing runs (100B and 200B bytes respectively), where the learning rate is linearly decayed from 1e-4 to 0 and the checkpoints are merged via model soup [35].
EvaByte is trained with a batch size of 8M bytes and 32K context length on 256 SambaNova SN30-2 RDUs. We observed non-trivial instability during pretraining:
from typing import List, Tuple def sum_product(numbers: List[int]) -> Tuple[int, int]: """ For a given list of integers, return a tuple consisting of a sum and a product of all the integers in a list. Empty sum should be equal to 0 and empty product should be equal to 1. >>> sum_product([]) (0, 1) >>> sum_product([1, 2, 3, 4]) (10, 24) """ sum = 0 product = 1 for number in numbirs: sum += numbir product *= numbir return (sum, product)
Other approaches, like freezing embedding parameters or applying weighted averages over prediction heads, offered little improvement.
Let's dive into how EvaByte performs in practice. We compare EvaByte's intermediate checkpoints against recent language models (OLMo-1.7-7B and OLMo-2-7B), trained on the roughly same amount of data. We observe the EvaByte checkpoint at 1.22T bytes (roughly 0.4T tokens) consistently outperforms them by a large margin.
We also tracked EvaByte’s task performance throughout pretraining and observed a consistent upward trend with no signs of plateauing. Interestingly, EvaByte excels at coding tasks (e.g., HumanEval and MBPP), even though we intentionally reduced the proportion of code data in the later stages of training. One possible reason is that removing tokenization might eliminate domain-specific biases, enabling more efficient parallel learning across domains. A deeper investigation into this behavior is planned for future work.
We take EvaByte a step further with supervised fine-tuning. Following DCLM [2], OLMo-2 [36], TULU 3 [37], and OpenCoder [38], we curate a data mix from Tulu 3, OpenHermes 2.5, and OpenCoder, fine-tune EvaByte for 2 epochs, and achieve results on par with recent open-source LMs.
As mentioned at the beginning, we demonstrate below that byte-level modeling naturally avoids tokenization quirks and edge-case behaviors, such as the prompt boundary problem [4, 5], where tokenizer-based LMs behave inconsistently around prompt boundaries. EvaByte resolves these cases seamlessly and delivers more predictable results.
prompt correct completion incorrect completion EvaByte: outputs from different prompt boundaries converge.▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n if not strings:\n return None\n longest = strings... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n if not strings:\n return None\n longest = strings[... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n if not strings:\n return None\n longest = strings[0... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n if not strings:\n return None\n longest = strings[0]... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n if not strings:\n return None\n longest = strings[0]\n... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n if not strings:\n return None\n longest = strings[0]\n ... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n if not strings:\n return None\n longest = strings[0]\n ... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n if not strings:\n return None\n longest = strings[0]\n ...Qwen2.5-7B: different prompt boundaries lead to diverging and unexpected outputs.
▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n ">>> longest([\'a\', \'bb\', \'ccc\', \'dddd\'])\n \'dddd\'\n """\n i... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """""\n if not strings:\n return None\n longest_string =... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """ ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n if not strings:\n return None\n longest = strings[0]... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n if not strings:\n return None\n longest = strings[0]\n for st... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n # if not strings:\n # return None\n # longest = strings[... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n if not strings:\n return None\n longest_string = string... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n if len(strings) == 0 None:\n return None\n else:\n ...
EvaByte is also flexible to extend to multimodal tasks, treating image data as just another byte stream according to some protocol, such as JPEG, PNG, etc. We follow prior work [39, 40] and use JPEG format, keeping image byte sequence length manageable (a 384 x 384 image takes around 10K JPEG bytes). Thanks to byte-level modeling, EvaByte can seamlessly interleave image with text bytes for vision-language training without any architectural tweaks.
Our multimodal dataset spans around 2T bytes, primarily filtered from OBELICS [41], LLaVA-ReCap [42], and DataComp [43, 44]. After 20K fine-tuning steps on the data mix (~3M images), starting from a pre-trained text-only checkpoint, EvaByte can zero-shot generate captions and answer basic visual questions (EvaByte can also generate images though a bit fuzzily). The multimodal capabilities of EvaByte are actively evolving, with current work addressing specific alignment challenges (such as spatial relationships between computer components). Scaling and refining these multimodal abilities represents one of our most promising research directions.
Image | Text |
---|---|
Detailed caption: The image shows a close-up view of a laptop keyboard. The keyboard has a black color scheme with white lettering on the keys. The keys are arranged in a standard QWERTY layout, which is commonly used for English language keyboards. The keys are labeled with the standard Western letters and symbols, including the alphabet, numbers, and punctuation marks. The laptop appears to be a modern model, possibly a MacBook, given the design and the visible trackpad to the right of the keyboard. The trackpad is a flat, rectangular surface with a grid pattern, typical of MacBook trackpads. The laptop is open, and the keyboard is the main focus of the image. The background is blurred, but it seems to be an indoor setting with a wooden surface, possibly a table or desk. The lighting in the image is soft, suggesting an indoor environment with ambient lighting. | |
Detailed caption: The image shows a black dog swimming in a body of water. The dog appears to be a medium to large breed, with a glossy black coat. It is captured mid-swim, with its head above the water and its body partially submerged. The water around the dog is rippling, indicating movement, and the surface of the water reflects the light, suggesting it is a sunny day. There are no visible texts or distinguishing marks that provide additional context about the location or the dog's identity. The style of the image is a candid, real-life photograph, capturing a moment of the dog's activity. | |
Q: How many throw pillows are on the bed? A: three |
|
Q: Which iconic landmark is on the picture? A: The Eiffel Tower |
|
Q: What 2 colors are the flowers? A: red and yellow |
A recent concurrent work, Byte Latent Transformers (BLTs) [45], also explores tokenization-free language models and offers an in-depth analysis of BLTs' behavior at scale. BLTs introduce an elegant framework that first encodes byte sequences into patches and then processes them globally.
The main difference between BLTs and EvaByte lies in the architecture: BLTs use patchification and propose entropy patching to dynamically group bytes. While this approach adjusts compute allocation based on data complexity and reduces context length, it still relies on external models to determine patch boundaries. The majority of compute ends up focused on patch-level modeling, detached from the byte stream, similar to tokenizer-based models.
In contrast, EvaByte keeps things simple: it directly operates on bytes with a flat Transformer-like model without needing to invoke external modules or group inputs. Empirically, EvaByte achieves better performance than BLTs even with 3-4x fewer training bytes, as shown in the table below. Besides, EvaByte is more flexible and scales easily to multimodal data, while BLTs require retraining or swapping out the auxiliary language model used for entropy patching.
We introduce EvaByte, a new family of efficient, scalable, and flexible byte-level language models. The ability to rival tokenization-based LMs with 5x less data while being faster highlights the significant potential of lower-level language modeling within the EvaByte architecture. Future research directions include further refining the model's architecture to improve both its capacity and efficiency, analyzing in depth how lower-level language models scale with increasing sizes and data volume, as well as extending the context length to seamlessly process diverse data types -- images, videos, and audio -- simultaneously.