EvaByte: Efficient Byte-level Language Models at Scale

Written by L. Zheng, X. Zhao, G. Wang, C. Wu, D. Dong, A. Wang, M. Wang, Y. Du, H. Bo, A. Sharma, B. Li, K. Zhang, C. Hu, U. Thakker, L. Kong | January 21, 2025

Introducing EvaByte

In a collaborative effort between the University of Hong Kong and SambaNova Systems, we introduce EvaByte, a 6.5B state-of-the-art byte-level language model featuring an improved architecture and powered by EVA -- an efficient attention mechanism designed for scalability and performance. Trained on 1.5T bytes of natural language text, math, and code using the SambaNova SN30 RDU system, EvaByte demonstrates that efficient byte-level processing at scale is not just possible, but practically advantageous -- rivaling top open-source tokenizer-based LMs [1, 2, 3] despite using 5x less training data, excelling in coding tasks, and decoding up to 2x faster than tokenizer-based models. Its token-free design also brings added flexibility, avoiding tokenizer quirks while naturally extending to multimodal applications without any architecture tweaks.

Figure: scaling analysis between average task performance and training set size. Figure: comparison of language models on standard evaluation benchmarks. ‡ the number of tokens measured by Llama 3 tokenizer, corresponding to 1.5T training bytes. †Low scores are caused by failing to generate Python functions and repeat the input under EvalPlus prompt format.

To our knowledge, EvaByte is the first open-source byte-level model without tokenization that yet matches the performance of modern tokenizer-based LMs. Check out the model weights and code here:

Base model before annealing: EvaByte/EvaByte-Phase1
Base model: EvaByte/EvaByte
SFT model: EvaByte/EvaByte-SFT
Codebase: GitHub

Byte-level Modeling with Improved Architectures

Tokenization is a fundamental step in modern large language models, deciding how input is represented in Transformers. Although it efficiently compresses raw text into shorter sequences, tokenization comes with its own baggage -- it is an externally trained, detached component that can introduce complex biases and edge-case quirks, like the prompt boundary problem [4, 5, 6, 7, 8], undertrained tokens [9, 10, 11, 12, 13], and even pretraining data mixture leaks [14].

Byte-level modeling is an approach that inherently eliminates biases introduced by tokenization, although directly operating on bytes at scale is not easy [15, 16, 17, 18, 19, 20, 21]:

Figure: correspondence between tokens and bytes, as measured by the GPT-4o tokenizer.

Byte sequences are naturally longer -- 3.8x longer than their tokenized counterparts in our training corpus -- leading to more than 3.8x computational overhead under standard architectures.
Inference becomes more challenging due to the inherently long and sequential nature of byte-level predictions.
Training byte-level models is less stable as we observed in our experiments.

We address these hurdles with a streamlined architecture featuring two improvements: multibyte prediction and the efficient attention mechanism, EVA.

Figure: an overview of the EvaByte architecture.

Although vanilla byte-level language models typically run much slower than tokenizer-based LMs, by combining multibyte prediction with EVA, we have achieved a significant speed boost for byte models -- 5-10x faster decoding compared to vanilla architectures and up to 2x faster than tokenizer-based LMs, making byte-level models a practical choice for real-world applications.

Figure: bytes per second (↑) measured by generating 512 bytes (or tokens) with a batch size of 1 on one H800 GPU using the HF native generate() interface.

Multibyte Prediction

We draw inspiration from recent work [22, 23, 24, 25] and equip our model with multiple prediction heads, allowing it to predict several future bytes simultaneously. During training, we average the cross-entropy losses from different output heads as the primary training objective. These heads learn very effectively -- their predictions are often highly accurate and sometimes even outperform the immediate next byte prediction, as shown in the figure below.

Figure: multi-choice task performance across different prediction heads. Each head corresponds to using the likelihood from the immediate next byte prediction (Head 1), second-next byte prediction (Head 2), and so forth.

Multibyte prediction adds almost no training overhead, thanks to the particularly small vocabulary size. However, it greatly speeds up inference with self-speculative decoding, where multiple heads are combined via Medusa-like tree attention [24] and enable the model to predict multiple bytes in one decoding step.

Efficient Attention with EVA

However, multibyte prediction alone is not enough to speed up the byte-level model: the self-attention mechanism quickly becomes the major bottleneck as the context length grows. To address this, we build our model on EVA [26], an improved version of linearized attention [27, 28, 29]. Linearized attention approximates exact self-attention by designing feature maps such that

By linearizing the exponential function, one can rearrange the order of computation and achieve linear complexity in sequence length. This approach admits the form of a linear RNN, maintaining a global hidden state. With gating mechanisms and decay coefficients [28, 30, 31, 32], it also connects to recent state-space models like Mamba and Mamba-2 [33, 34]. Conventional linearized attention compresses past tokens into a single global hidden state, unlike standard attention, which explicitly caches every token.

EVA takes a middle ground by distributing the global state into multiple local memory slots. By splitting key-value pairs into consecutive chunks and applying linearization separately on each chunk, EVA maintains a local hidden state for each chunk and aggregates them together to produce the final output. This expands the design space of linearized attention mechanisms, simplifies implementation, and directly benefits from hardware-optimized kernels for standard attention mechanisms.

Figure: computation graphs for standard attention (left), linearized attention (middle), and EVA (right). Symbols: X denotes (multiple) matrix multiplication and Σ represents sum reduction.

Training

We pretrain EvaByte on a corpus of 1.5T bytes spanning from text to math and code, mainly sourced from Dolma v1.7, The Stack v2, FineWeb-Edu, and DCLM-Baseline. We constantly refined the data mix by tweaking the proportions or swapping in new sources mid-flight. After training on 1.2T bytes, we conduct two independent annealing runs (100B and 200B bytes respectively), where the learning rate is linearly decayed from 1e-4 to 0 and the checkpoints are merged via model soup [35].

EvaByte is trained with a batch size of 8M bytes and 32K context length on 256 SambaNova SN30-2 RDUs. We observed non-trivial instability during pretraining:

Byte-level collapses: Occasionally, intermediate checkpoints would produce bizarre typos (e.g., `e` in generated outputs turning into an `i`); interestingly, these glitches resolved themselves after a few thousand training steps and never appeared near the end of training.
A snapshot of code generation at an intermediate checkpoint with bizarre typos.

from typing import List, Tuple

def sum_product(numbers: List[int]) -> Tuple[int, int]:
    """ For a given list of integers, return a tuple consisting of a sum and a product of all the integers in a list.
    Empty sum should be equal to 0 and empty product should be equal to 1.
    >>> sum_product([])
    (0, 1)
    >>> sum_product([1, 2, 3, 4])
    (10, 24)
    """
    sum = 0
    product = 1
    for number in numbirs:
        sum += numbir
        product *= numbir
    return (sum, product)

Loss spikes: The most helpful techniques for stabilizing training include
- Lowering Adam epsilon from 1e-8 to 1e-12.
- Skipping batches that lead to spikes to keep the model in sane state.
- Periodically resetting Adam optimizer states to zero with quickly re-warming up the learning rate to remove bad out-of-track estimates.

Other approaches, like freezing embedding parameters or applying weighted averages over prediction heads, offered little improvement.

Empirical Results

Let's dive into how EvaByte performs in practice. We compare EvaByte's intermediate checkpoints against recent language models (OLMo-1.7-7B and OLMo-2-7B), trained on the roughly same amount of data. We observe the EvaByte checkpoint at 1.22T bytes (roughly 0.4T tokens) consistently outperforms them by a large margin.

Figure: performance of intermediate checkpoints on standard benchmarks.

We also tracked EvaByte’s task performance throughout pretraining and observed a consistent upward trend with no signs of plateauing. Interestingly, EvaByte excels at coding tasks (e.g., HumanEval and MBPP), even though we intentionally reduced the proportion of code data in the later stages of training. One possible reason is that removing tokenization might eliminate domain-specific biases, enabling more efficient parallel learning across domains. A deeper investigation into this behavior is planned for future work.

Supervised Fine-tuning

We take EvaByte a step further with supervised fine-tuning. Following DCLM [2], OLMo-2 [36], TULU 3 [37], and OpenCoder [38], we curate a data mix from Tulu 3, OpenHermes 2.5, and OpenCoder, fine-tune EvaByte for 2 epochs, and achieve results on par with recent open-source LMs.

Figure: performance of instruct models. † Evaluated by us. * Following Tulu 3, we evaluate the Pass@10 rate for HumanEval with 20 samples at temperature 0.8.

Flexibility

As mentioned at the beginning, we demonstrate below that byte-level modeling naturally avoids tokenization quirks and edge-case behaviors, such as the prompt boundary problem [4, 5], where tokenizer-based LMs behave inconsistently around prompt boundaries. EvaByte resolves these cases seamlessly and delivers more predictable results.

prompt correct completion incorrect completion EvaByte: outputs from different prompt boundaries converge.

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n    if not strings:\n        return None\n    longest = strings...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n    if not strings:\n        return None\n    longest = strings[...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n    if not strings:\n        return None\n    longest = strings[0...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n    if not strings:\n        return None\n    longest = strings[0]...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n    if not strings:\n        return None\n    longest = strings[0]\n...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n    if not strings:\n        return None\n    longest = strings[0]\n ...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n    if not strings:\n        return None\n    longest = strings[0]\n  ...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n    if not strings:\n        return None\n    longest = strings[0]\n   ...

Qwen2.5-7B: different prompt boundaries lead to diverging and unexpected outputs.

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    ">>> longest([\'a\', \'bb\', \'ccc\', \'dddd\'])\n    \'dddd\'\n    """\n    i...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """""\n    if not strings:\n        return None\n    longest_string =...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n    if not strings:\n        return None\n    longest = strings[0]...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n  if not strings:\n    return None\n  longest = strings[0]\n  for st...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n   # if not strings:\n    #    return None\n    # longest = strings[...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n    if not strings:\n        return None\n    longest_string = string...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n     if len(strings) == 0 None:\n        return None\n    else:\n      ...

Case Study: Multimodal Learning

EvaByte is also flexible to extend to multimodal tasks, treating image data as just another byte stream according to some protocol, such as JPEG, PNG, etc. We follow prior work [39, 40] and use JPEG format, keeping image byte sequence length manageable (a 384 x 384 image takes around 10K JPEG bytes). Thanks to byte-level modeling, EvaByte can seamlessly interleave image with text bytes for vision-language training without any architectural tweaks.

Our multimodal dataset spans around 2T bytes, primarily filtered from OBELICS [41], LLaVA-ReCap [42], and DataComp [43, 44]. After 20K fine-tuning steps on the data mix (~3M images), starting from a pre-trained text-only checkpoint, EvaByte can zero-shot generate captions and answer basic visual questions (EvaByte can also generate images though a bit fuzzily). The multimodal capabilities of EvaByte are actively evolving, with current work addressing specific alignment challenges (such as spatial relationships between computer components). Scaling and refining these multimodal abilities represents one of our most promising research directions.

Image	Text
	Detailed caption: The image shows a close-up view of a laptop keyboard. The keyboard has a black color scheme with white lettering on the keys. The keys are arranged in a standard QWERTY layout, which is commonly used for English language keyboards. The keys are labeled with the standard Western letters and symbols, including the alphabet, numbers, and punctuation marks. The laptop appears to be a modern model, possibly a MacBook, given the design and the visible trackpad to the right of the keyboard. The trackpad is a flat, rectangular surface with a grid pattern, typical of MacBook trackpads. The laptop is open, and the keyboard is the main focus of the image. The background is blurred, but it seems to be an indoor setting with a wooden surface, possibly a table or desk. The lighting in the image is soft, suggesting an indoor environment with ambient lighting.
	Detailed caption: The image shows a black dog swimming in a body of water. The dog appears to be a medium to large breed, with a glossy black coat. It is captured mid-swim, with its head above the water and its body partially submerged. The water around the dog is rippling, indicating movement, and the surface of the water reflects the light, suggesting it is a sunny day. There are no visible texts or distinguishing marks that provide additional context about the location or the dog's identity. The style of the image is a candid, real-life photograph, capturing a moment of the dog's activity.
	Q: How many throw pillows are on the bed? A: three
	Q: Which iconic landmark is on the picture? A: The Eiffel Tower
	Q: What 2 colors are the flowers? A: red and yellow

Comparison to Byte Latent Transformers (BLTs)

A recent concurrent work, Byte Latent Transformers (BLTs) [45], also explores tokenization-free language models and offers an in-depth analysis of BLTs' behavior at scale. BLTs introduce an elegant framework that first encodes byte sequences into patches and then processes them globally.

The main difference between BLTs and EvaByte lies in the architecture: BLTs use patchification and propose entropy patching to dynamically group bytes. While this approach adjusts compute allocation based on data complexity and reduces context length, it still relies on external models to determine patch boundaries. The majority of compute ends up focused on patch-level modeling, detached from the byte stream, similar to tokenizer-based models.

In contrast, EvaByte keeps things simple: it directly operates on bytes with a flat Transformer-like model without needing to invoke external modules or group inputs. Empirically, EvaByte achieves better performance than BLTs even with 3-4x fewer training bytes, as shown in the table below. Besides, EvaByte is more flexible and scales easily to multimodal data, while BLTs require retraining or swapping out the auxiliary language model used for entropy patching.

Table: we closely follow the evaluation setup in BLTs, testing zero-shot task performance on Arc-e, Arc-c, HellaSwag, PIQA, and HumanEval; 3-shot for the original MBPP split; and 5-shot for MMLU.

Conclusion

We introduce EvaByte, a new family of efficient, scalable, and flexible byte-level language models. The ability to rival tokenization-based LMs with 5x less data while being faster highlights the significant potential of lower-level language modeling within the EvaByte architecture. Future research directions include further refining the model's architecture to improve both its capacity and efficiency, analyzing in depth how lower-level language models scale with increasing sizes and data volume, as well as extending the context length to seamlessly process diverse data types -- images, videos, and audio -- simultaneously.

References

OLMo: Accelerating the Science of Language Models [link]
Groeneveld, D., Beltagy, I., Walsh, E., Bhagia, A., Kinney, R., Tafjord, O., Jha, A., Ivison, H., Magnusson, I., Wang, Y., Arora, S., Atkinson, D., Authur, R., Chandu, K., Cohan, A., Dumas, J., Elazar, Y., Gu, Y., Hessel, J., Khot, T., Merrill, W., Morrison, J., Muennighoff, N., Naik, A., Nam, C., Peters, M., Pyatkin, V., Ravichander, A., Schwenk, D., Shah, S., Smith, W., Strubell, E., Subramani, N., Wortsman, M., Dasigi, P., Lambert, N., Richardson, K., Zettlemoyer, L., Dodge, J., Lo, K., Soldaini, L., Smith, N. and Hajishirzi, H., 2024. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
DataComp-LM: In search of the next generation of training sets for language models
Li, J., Fang, A., Smyrnis, G., Ivgi, M., Jordan, M., Gadre, S., Bansal, H., Guha, E., Keh, S., Arora, K. and others,, 2024. arXiv preprint arXiv:2406.11794.
Map-neo: Highly capable and transparent bilingual large language model series
Zhang, G., Qu, S., Liu, J., Zhang, C., Lin, C., Yu, C.L., Pan, D., Cheng, E., Liu, J., Lin, Q. and others,, 2024. arXiv preprint arXiv:2405.19327.
Guidance [link]
Microsoft,, 2023.
The art of prompt design: Prompt boundaries and token healing [link]
Lundberg, S. and Ribeiro, M.T., 2023. Medium.
Getting the most out of your tokenizer for pre-training and domain adaptation
Dagan, G., Synnaeve, G. and Roziere, B., 2024. arXiv preprint arXiv:2402.01035.
Token Alignment via Character Matching for Subword Completion
Athiwaratkun, B., Wang, S., Shang, M., Tian, Y., Wang, Z., Gonugondla, S.K., Gouda, S.K., Kwiatowski, R., Nallapati, R. and Xiang, B., 2024. arXiv preprint arXiv:2403.08688.
From language models over tokens to language models over characters
Vieira, T., LeBrun, B., Giulianelli, M., Gastaldi, J.L., DuSell, B., Terilla, J., O'Donnell, T.J. and Cotterell, R., 2024. arXiv preprint arXiv:2412.03719.
SolidGoldMagikarp (plus, prompt generation) [link]
Rumbelow, J. and Watkins, M., 2023. AI ALIGNMENT FORUM.
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
Land, S. and Bartolo, M., 2024. arXiv preprint arXiv:2405.05417.
Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization
Wang, D., Li, Y., Jiang, J., Ding, Z., Jiang, G., Liang, J. and Yang, D., 2024. arXiv preprint arXiv:2405.17067.
Rethinking tokenization: Crafting better tokenizers for large language models
Yang, J., 2024. International Journal of Chinese Linguistics.
Problematic Tokens: Tokenizer Bias in Large Language Models
Yang, J., Wang, Z., Lin, Y. and Zhao, Z., 2024. arXiv preprint arXiv:2406.11214.
Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
Hayase, J., Liu, A., Choi, Y., Oh, S. and Smith, N.A., 2024. arXiv preprint arXiv:2407.16607.
Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation [link]
Clark, J.H., Garrette, D., Turc, I. and Wieting, J., 2022. Transactions of the Association for Computational Linguistics.
ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models [link]
Xue, L., Barua, A., Constant, N., Al-Rfou, R., Narang, S., Kale, M., Roberts, A. and Raffel, C., 2022. Transactions of the Association for Computational Linguistics.
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization [link]
Tay, Y., Tran, V.Q., Ruder, S., Gupta, J., Chung, H.W., Bahri, D., Qin, Z., Baumgartner, S., Yu, C. and Metzler, D., 2022. International Conference on Learning Representations.
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers [link]
Yu, L., Simig, D., Flaherty, C., Aghajanyan, A., Zettlemoyer, L. and Lewis, M., 2023. Thirty-seventh Conference on Neural Information Processing Systems.
SpaceByte: Towards Deleting Tokenization from Large Language Modeling
Slagle, K., 2024. arXiv preprint arXiv:2404.14408.
Mambabyte: Token-free selective state space model
Wang, J., Gangavarapu, T., Yan, J.N. and Rush, A.M., 2024. arXiv preprint arXiv:2401.13660.
MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
Kallini, J., Murty, S., Manning, C.D., Potts, C. and Csordas, R., 2024. arXiv preprint arXiv:2410.20771.
Blockwise parallel decoding for deep autoregressive models
Stern, M., Shazeer, N. and Uszkoreit, J., 2018. Advances in Neural Information Processing Systems.
ProphetNet: Predicting Future N-gram for Sequence-to-SequencePre-training [link]
Qi, W., Yan, Y., Gong, Y., Liu, D., Duan, N., Chen, J., Zhang, R. and Zhou, M., 2020. Findings of the Association for Computational Linguistics: EMNLP 2020.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads [link]
Cai, T., Li, Y., Geng, Z., Peng, H., Lee, J.D., Chen, D. and Dao, T., 2024. Forty-first International Conference on Machine Learning.
Better & faster large language models via multi-token prediction
Gloeckle, F., Idrissi, B.Y., Roziere, B., Lopez-Paz, D. and Synnaeve, G., 2024. arXiv preprint arXiv:2404.19737.
Efficient Attention via Control Variates [link]
Zheng, L., Yuan, J., Wang, C. and Kong, L., 2023. The Eleventh International Conference on Learning Representations .
Transformers are rnns: Fast autoregressive transformers with linear attention
Katharopoulos, A., Vyas, A., Pappas, N. and Fleuret, F., 2020. International Conference on Machine Learning, pp. 5156--5165.
Random Feature Attention [link]
Peng, H., Pappas, N., Yogatama, D., Schwartz, R., Smith, N. and Kong, L., 2021. International Conference on Learning Representations.
Rethinking Attention with Performers [link]
Choromanski, K.M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J.Q., Mohiuddin, A., Kaiser, L., Belanger, D.B., Colwell, L.J. and Weller, A., 2021. International Conference on Learning Representations.
HGRN2: Gated Linear RNNs with State Expansion [link]
Qin, Z., Yang, S., Sun, W., Shen, X., Li, D., Sun, W. and Zhong, Y., 2024. First Conference on Language Modeling.
Retentive network: A successor to transformer for large language models
Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J. and Wei, F., 2023. arXiv preprint arXiv:2307.08621.
Gated Linear Attention Transformers with Hardware-Efficient Training [HTML]
Yang, S., Wang, B., Shen, Y., Panda, R. and Kim, Y., 2024. Proceedings of the 41st International Conference on Machine Learning.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces [link]
Gu, A. and Dao, T., 2024. First Conference on Language Modeling.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality [HTML]
Dao, T. and Gu, A., 2024. Proceedings of the 41st International Conference on Machine Learning.
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time [HTML]
Wortsman, M., Ilharco, G., Gadre, S.Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A.S., Namkoong, H., Farhadi, A., Carmon, Y., Kornblith, S. and Schmidt, L., 2022. Proceedings of the 39th International Conference on Machine Learning.
OLMo 2: The best fully open language model to date [link]
AI2,, 2024.
TÜLU 3: Pushing Frontiers in Open Language Model Post-Training
Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, H., Brahman, F., Miranda, L.J.V., Liu, A., Dziri, N., Lyu, S. and others, 2024. arXiv preprint arXiv:2411.15124.
Opencoder: The open cookbook for top-tier code large language models
Huang, S., Cheng, T., Liu, J.K., Hao, J., Song, L., Xu, Y., Yang, J., Liu, J., Zhang, C., Chai, L. and others, 2024. arXiv preprint arXiv:2411.04905.
Compressed-Language Models for Understanding Compressed File Formats: a JPEG Exploration
Perez, J.C., Pardo, A., Soldan, M., Itani, H., Leon-Alcazar, J. and Ghanem, B., 2024. arXiv preprint arXiv:2405.17146.
JPEG-LM: LLMs as Image Generators with Canonical Codec Representations
Han, X., Ghazvininejad, M., Koh, P.W. and Tsvetkov, Y., 2024. arXiv preprint arXiv:2408.08459.
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents [link]
Laurençon, H., Saulnier, L., Tronchon, L., Bekman, S., Singh, A., Lozhkov, A., Wang, T., Karamcheti, S., Rush, A.M., Kiela, D., Cord, M. and Sanh, V., 2023. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
LLaVA-NeXT: What Else Influences Visual Instruction Tuning Beyond Data? [link]
Li, B., Zhang, H., Zhang, K., Guo, D., Zhang, Y., Zhang, R., Li, F., Liu, Z. and Li, C., 2024.
DataComp: In search of the next generation of multimodal datasets [link]
Gadre, S.Y., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., Orgad, E., Entezari, R., Daras, G., Pratt, S.M., Ramanujan, V., Bitton, Y., Marathe, K., Mussmann, S., Vencu, R., Cherti, M., Krishna, R., Koh, P.W., Saukh, O., Ratner, A., Song, S., Hajishirzi, H., Farhadi, A., Beaumont, R., Oh, S., Dimakis, A., Jitsev, J., Carmon, Y., Shankar, V. and Schmidt, L., 2023. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Pavan Kumar Anasosalu Vasu, H.P., 2024. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Byte Latent Transformer: Patches Scale Better Than Tokens
Pagnoni, A., Pasunuru, R., Rodriguez, P., Nguyen, J., Muller, B., Li, M., Zhou, C., Yu, L., Weston, J., Zettlemoyer, L. and others,, 2024. arXiv preprint arXiv:2412.09871.

View full post