Reimagining Tokens: From Glitchy Inputs to Byte-Native Thinking
Glitch tokens, patch‑based language models, and context engineering are rewriting the rules—here’s how to sharpen your insight loop and future‑proof your AI product strategy.
It’s 11 PM, and I’m staring at a weird word Twitter told me to pump into GPT-3: “SolidGoldMagikarp.” The moment I feed it into the language model, the output spirals into gibberish. As someone who’s spent years building in the AI space, I recognize this as a glitch token – a kind of Achilles’ heel for language models. Glitch tokens are tokens that inexplicably cause anomalous outputs (for example, “SolidGoldMagikarp” in GPT-3). They’re rare, they’re bizarre, and they highlight a deeper truth: our AI systems are only as robust as the tokens we feed them.
A slew of recent articles I’ve been reading prompted me to reflect on how we’ve long approached tokenization, model design, and scaling – and how rapidly those assumptions are evolving. In this post, I’ll share that reflection and some strategic insights, wearing my hat as a product innovator and wanna-be AI researcher. We’ll journey from the quirks of glitch tokens to cutting-edge byte-level models, exploring new architectures that break old bottlenecks. Along the way, I’ll channel some systems-level thinking (and even a bit of philosophical insight) to help you examine your own mental models as you build AI products.
The Tokenization Hangover: From Clever Shortcuts to Lingering Headaches
If you’ve worked with language models, you know the drill: raw text doesn’t go straight into a neural network. It first gets chopped into tokens – pieces like words or subwords. This was a necessary trick; it gave models a manageable vocabulary to learn. Nearly all popular NLP models have relied on an explicit tokenization step. Even the modern subword algorithms (BPE, WordPiece, SentencePiece) were essentially clever compression schemes for text.
For a while, this looked like a solved problem. Subword tokenizers handled many languages and never exactly ran out of vocabulary. But they weren’t perfect. Any fixed vocabulary can limit a model’s ability to adapt to new jargon or languages. Tokenizers can be brittle – early pipelines broke on emojis or rare unicode symbols. And then there are the glitch tokens, those odd artifacts of byte-pair encoding that live on the fringe of the vocabulary and confuse models. They’re a symptom of a larger issue: by pre-defining the units of language, we baked a certain worldview (or word-view) into our models. It’s as if we handed the model a dictionary and said “these are your LEGOs – build everything from them.” Most of the time it works, but occasionally the model tries to chew on a piece that doesn’t make sense.
The deeper I looked, the clearer it became that tokenization is a legacy. One paper put it bluntly: explicit tokenization itself is problematic. It’s a holdover from the old NLP pipeline days, like feature engineering’s last stand. What if, instead, we let the model figure out language from scratch? What if we freed it from our pre-packaged vocabulary and let it read text as raw bytes or characters? This idea isn’t new – it’s been percolating for years. In fact, a 2022 model from Google, CANINE (“Character Architecture with No tokenization In Neural Encoders”), did exactly that. It ditched the subword vocab and operated directly on characters, using a clever downsampling scheme to keep sequence lengths in check. The reward? Greater flexibility across languages and no “out-of-vocabulary” issues. CANINE even outperformed a comparable mBERT model on a multilingual QA benchmark, despite having fewer parameters.
Still, going tokenizer-free came with a cost: longer sequences mean more compute. Transformers famously scale quadratically with sequence length. Feed a transformer raw bytes, and you quadruple the sequence length (since “hello” might be one token but five characters) – your compute requirements explode. For years, that brute-force approach just wasn’t worth it. We stuck with subwords as a necessary evil for efficiency’s sake.
But things have changed. Compute is cheaper, and research is showing ways to make models handle longer sequences more gracefully. The hangover from tokenization – the brittleness, the blind spots – is finally pushing us to find a cure. And the cure coming into focus is to go native: treat language as the byte sequence it truly is, and train models to understand it end-to-end. Recent breakthroughs are proving that this is not only feasible, but possibly superior.
Byte-Native Models: Hello, Raw Data – Goodbye, Handcrafted Tokens
A new generation of models is emerging that throws out the tokenizer altogether. These models work directly with raw bytes (or characters), and they’re closing the performance gap with traditional token-based models. In fact, they’re starting to outperform them in some cases, while being more robust.
One landmark is Meta AI’s Byte Latent Transformer (BLT). BLT is a tokenizer-free architecture that matches the performance of token-based large language models, at scale, with significant boosts in efficiency and robustness. How? BLT doesn’t naively feed every single byte into a giant Transformer – that would drown in a sea of tokens. Instead, it encodes bytes into dynamically sized “patches” that become the units of computation. Think of patches as a smarter form of token: instead of a fixed 50k-word vocabulary, the model itself decides how to group bytes into meaningful chunks on the fly.
These patches aren’t uniform; BLT dynamically adjusts them. The rule is intuitive: if the next bytes look predictable or low in information, group them into a long patch; if something complex or uncertain is coming, break the patch and let the model pay more attention. In practice, BLT measures the entropy of the incoming byte stream to decide patch boundaries – high entropy (unpredictable content) triggers a new patch. This means the model allocates more compute to the hard parts of the input and less to the easy parts. It’s like a reader slowing down on a dense paragraph but skimming through simple sentences.
The results are impressive. For a fixed inference cost (same number of FLOPs), BLT scales better than standard token models by growing both patch length and model size simultaneously. In other words, given the same compute budget, a BLT can handle more data or a bigger model than a vanilla Transformer reading subword tokens. Crucially, BLT’s patches ended up making sequences shorter on average than they would be with subword tokenization, saving compute without losing meaning. And because BLT’s “vocabulary” is just bytes (0–255), it’s inherently more flexible – it can ingest anything from English to emoji to programming code without needing a handcrafted tokenizer. No more mystery tokens lurking at the fringes, waiting to glitch out your model.
BLT isn’t the only byte-native pioneer. DeepMind’s MEGABYTE introduced a multi-scale decoder that can model sequences over a million bytes long. It segments an input into fixed-size byte patches and uses a local Transformer to capture patterns within each patch, and a global Transformer to capture patterns across patches. This two-level approach slashes the cost of attention (sub-quadratic scaling) and lets the model focus on local details and global context separately. The payoff: a 1.3B-parameter MEGABYTE model was able to generate 1-million-character sequences and perform competitively with subword-based models on long-context language tasks. In fact, MEGABYTE achieved state-of-the-art results in domains like image and audio modeling directly from raw data. Together, these results establish the viability of tokenization-free sequence modeling at scale. In plain terms, we can finally train models on raw text (or sound, or pixels) without needing a human-crafted compression step at all.
Another fascinating entrant is MambaByte, which took a different route to the same destination. Instead of Transformers, MambaByte is built on a state-space model (SSM) backbone. State-space models, like the recent S4 architecture, can maintain a fixed-size hidden state regardless of sequence length – neatly avoiding the quadratic blow-up of attention. MambaByte adapted an SSM (the “Mamba” model) to work on byte sequences without any tokenizer. The authors found that this model was competitive with, and even outperformed, state-of-the-art subword Transformers on language modeling tasks. And it did so while inheriting the benefits of token-free input, such as robustness to noise (one can imagine it doesn’t get thrown off by a single odd byte or a glitch token). They even tackled the speed issue: by using a clever speculative decoding trick – drafting with a traditional tokenizer model and then verifying with the byte model – they got a 2.6× inference speedup. In the end, MambaByte made a strong case that state-space models can enable token-free language modeling efficiently.
What these efforts have in common is the trend away from hand-crafted tokens and toward native byte/character inputs. This trajectory isn’t just academic; it carries practical significance for product builders. It means future models could handle any language or format you throw at them, out of the box. It means fewer engineering hours spent maintaining custom tokenization pipelines for every new market or data type. And it could mean more robust systems – ones that don’t freak out at an out-of-vocab token or need patchy fixes for every odd corner case. In short, as a product leader, you might soon be able to treat text as just data, and let the model deal with the raw bytes. That opens up a kind of simplicity on the far side of complexity – a chance to focus on higher-level problems because the low-level text processing is finally taken care of.
Beyond Attention: New Architectures to Slay Old Bottlenecks
Dropping tokenization is one side of the coin. The other side is evolving the model architecture itself. After all, if you feed a billion-byte text into a Transformer, you still face the compute bottleneck. So researchers have been busy rethinking how models can handle long sequences and complex inputs more efficiently. This is leading to some inventive architectural trends, which are worth watching if you’re charting a product roadmap.
One approach we touched on is patching – used by BLT and MEGABYTE. This is essentially patch-based inference for text, akin to how vision transformers use image patches. By chunking a long sequence into patches and processing hierarchically, the model avoids blowing up its internal workload. BLT’s twist of dynamic patch sizing (entropy-based) is especially interesting: it’s a form of adaptive computation, allocating resources where needed. That means if your input has a lot of fluff (say, repetitive spaces or easy boilerplate), the model doesn’t waste cycles on it. But if suddenly a DNA sequence or code snippet appears (high entropy, unfamiliar pattern), the model zooms in and spends more compute there. From a systems perspective, this is beautiful – it’s an AI doing resource allocation on the fly, rather than treating every input token equally.
Another trend is dynamic token pooling. Traditionally, Transformers process all tokens at all layers, which is overkill if some tokens are redundant. Dynamic pooling mechanisms attempt to shorten the sequence as it flows through the network. For example, a model might learn to merge or drop less important tokens after the first few layers, reducing the length for deeper layers. Recent research showed that a Transformer with dynamic token pooling can jointly segment and model language, achieving faster and more accurate results than vanilla Transformers within the same compute budget. Essentially, the model itself learns how to compress the text representation as it goes, instead of relying on a fixed tokenizer or a naive pooling like “take every 4 tokens”. This joint segmentation+modeling approach blurs the line between “tokenization” and “modeling” – they happen together, dynamically. For product folks, it hints at models that are more efficient and possibly more interpretable (imagine seeing which parts of a document the model deemed worth “zooming in” on or merging).
And we can’t forget the state-space model (SSM) path, as exemplified by MambaByte. SSMs like S4 sprung from outside NLP – a new way to handle sequences using linear dynamical systems principles. What makes them special is that, unlike attention, their memory cost doesn’t balloon with sequence length. An SSM can, in theory, handle an arbitrarily long input with fixed computational footprint, because it processes the sequence through a recurrence with a fixed state vector size (using a lot of clever math to do so fast). We’re still in early days of seeing SSMs compete with Transformers in pure performance, but their success in token-free modeling is a proof-of-concept that alternative architectures can overcome some Transformer bottlenecks. They might be more resistant to very long sequences or allow streaming data processing in ways Transformers struggle with. For AI product strategy, it means the space of viable model architectures is widening – and if your use case involves very long data streams (say, logging data, genomics, long videos, etc.), keep an eye on these alternatives.
All these innovations – patches, dynamic pooling, SSMs – are converging toward a common goal: breaking the trade-off between input length and performance. We want models that can read more, and read it efficiently, without needing us to spoon-feed them with pre-digested tokens or to prune context manually. It’s a thrilling area where fundamental research meets practical need. As a leader, one of your jobs is to sense when a research idea is nearing a breakout point for real-world impact. Byte-native Transformers and dynamic sequence modeling might be at that point. The next generation of AI products could very well ride on these capabilities, offering users seamless handling of huge, messy, multilingual data – and doing it faster and cheaper than before.
The Bitter Lesson Revisited: Scaling, Compute, and the Folly of Cuteness
Back in 2019, Rich Sutton wrote The Bitter Lesson, essentially pointing out that in AI, general methods that leverage compute have always won out over domain-specific cleverness. It was a sober reminder that often the best way to improve a system is not by adding more intricate rules or features, but by making it bigger, faster, and more data-hungry. In the context of tokenization and architecture, we’re seeing that play out in real time.
Consider tokenization: It was a clever hack, a human-engineered way to inject linguistic knowledge and efficiency. But the bitter lesson would predict that if we throw enough compute and learning capacity at the problem, a learned approach (like byte-level modeling) will eventually outperform the clever hack. And lo and behold, we now have models like BLT and MEGABYTE proving exactly that – given sufficient scale, letting the model learn from raw bytes works as well as or better than our carefully designed subwords. It’s compute over cognizance. It’s a bit humbling: all the time we spent perfecting tokenizers might eventually be obsolete, replaced by brute-force learning.
The same pattern emerges in architecture. Think of all the exotic architectures researchers tried over the years – gated RNNs, neurosymbolic hybrids, handcrafted parsing modules – many added complexity in hopes of a leap in understanding. Some helped on small scales, but when the dust settled, plain Transformers scaled up with mountains of data steamrolled past most of these bespoke ideas. The Transformer itself succeeded not because it was simpler than an RNN, but because it was more scalable (parallelizable, amenable to big data) – again a victory of scale and compute. Now, even Transformers have limitations (like that pesky quadratic cost), so new general methods like patching or state-spaces are coming in – but notice, these aren’t injecting human linguistic rules; they’re general-purpose strategies to handle more data. Patching bytes based on entropy isn’t a linguist’s idea, it’s an engineer’s way to optimize compute. State-space models didn’t come from analyzing grammar; they came from math that scales in a different way.
One very concrete example of compute over cleverness is in how we boost model performance today. If you want a model to get better at a task, you could hire linguists and domain experts, or… you could train a bigger model on more data, or use techniques like ensemble or retrieval. Nine times out of ten, the latter wins. The “bitter” truth is that scale (in parameters, data, or compute steps) often beats niche optimization. This is not to devalue scientific insight – it’s to recalibrate where insight is applied. Insight in AI has shifted from crafting the perfect rule to crafting the perfect scaling strategy or architecture that can handle scale. Insight is now in finding the simple rules that let brute force shine (like “scale length by patching” or “use byte-level to avoid vocab limits”), rather than micromanaging the model’s internals.
As product and research leaders, we should internalize this lesson. It doesn’t mean we always just buy the bigger GPU cluster – budgets are real – but it means we should be careful about chasing diminishing returns on clever hacks. If your team is spending months tinkering with a special-case feature to improve model X by 5%, consider: could a larger model or more data or a more general technique yield 15% with less fuss? Often, yes. The tokenization saga is a perfect case in point: instead of meticulously updating vocabularies and scripts for each new domain (a very clever but manual process), many teams now just switch to a model that doesn’t need that, and pour compute into training it. The long-term trend is clear: favor approaches that scale with compute and data over those that rely on frozen human insight. This also future-proofs your strategy – because if there’s one thing that ages quickly in AI, it’s our cute bespoke solution when someone else’s scaled-up model leaps ahead.
Yet, the bitter lesson has a sweet coda when paired with human judgment. We still need strategic cleverness – in choosing what to scale and how. The recent advances didn’t happen by accident; researchers recognized where a targeted change (like patching or an SSM) could remove a scaling pain point. They asked, “What if we remove the tokenizer?” – a very insightful question – and then let the scale do its thing. This interplay of insight and scale is where leadership comes in: posing the right questions and then leveraging compute to answer them, rather than manually crafting the full answer ourselves.
Context Is King: Beyond Model Size in the Real World
While model scale has driven incredible progress, another truth has become evident in practice: it’s not just about the model anymore; it’s about the ecosystem around the model. Specifically, how we manage and supply context to our models often matters more for real-world performance and cost than adding a few billion more parameters.
Think about it: GPT-3 was 175B parameters trained on basically the internet. But if you ask it a specific question about, say, your company’s internal data, it might flub it – not for lack of size, but for lack of relevant context. Enter retrieval-augmented generation (RAG) and related techniques. Instead of making the model bigger, we give the model a brainy assistant: a retrieval system that fetches relevant information on the fly. DeepMind’s RETRO model demonstrated this dramatically – a 7.5B parameter model hooked up to a large text database matched the performance of a 175B parameter GPT-3 on the Pile benchmark. In other words, 25× fewer parameters achieved comparable results by smartly bringing in external knowledge. That’s a giant leap in efficiency. The takeaway: sometimes knowledge in a database trumps knowledge baked into weights.
This has huge implications for product strategy. Instead of spending millions training a model from scratch to memorize your entire knowledge base, you might get more mileage by combining a moderately sized model with a fast search index or vector database. You keep the model light and nimble, and let it query for details as needed. It’s like having a small agile team with great internet access versus a giant know-it-all that’s slow and expensive.
Summarization is another form of context engineering changing the game. Long conversations, documents, or workflows quickly overflow any model’s context window (even though those windows are growing – 4k tokens, 16k, 100k…). Rather than brute force a huge context length at full detail (which is expensive), many applications now use summarization or distillation. They generate summaries of earlier chat messages, or compress a 100-page document into key bullet points, and feed that to the model. It’s a bit meta – using the model to help itself by creating condensed context. Done right, this dramatically cuts cost and latency while preserving relevant information. A summarized context can maintain coherence in a customer support chatbot without needing an exorbitantly large model or context size.
And then there’s caching and reuse: If your system answers a question once, cache it. If a particular expensive reasoning step is done, memoize it. Traditional software engineering taught us to optimize repeated computations, and AI systems are no different. Many product teams set up hybrid pipelines where the AI does heavy lifting once and reuses results, or where a smaller model handles easy cases and only escalates to a big model for the hard ones. All these tactics are about being smart with context and computation, not just throwing size at the problem.
Why am I harping on this? Because it represents a shift in what effective scaling looks like. For a long time, the mantra was “bigger model = better.” Now it’s more nuanced: better data and context management = better. We’re augmenting our large models with retrieval, tools, memory – essentially giving them external support rather than internally growing them endlessly. For product leaders, this means the frontier of improvement might be less about training the next 100B model and more about how you orchestrate information around the model. It’s a bit like the evolution of computing hardware: at some point, adding more cores yields less benefit than improving memory access or storage. In AI, adding parameters yields less benefit than ensuring the model always has the right information at its fingertips.
The savvy strategy is to combine approaches: use sufficiently large models to get general capabilities (thanks, scaling laws), but then use context engineering to ground and specialize those models on the fly. This often yields a better cost-performance tradeoff than a naive scale-up. And practically, it means you can deliver more value to users by making the model smarter (through context) rather than just bigger. As a side effect, these techniques also help with forward-compatibility: if you design your system to fetch knowledge and handle summaries, you’re less tied to any one model. You can swap in a new model later, maybe a byte-native one, without retooling your entire knowledge pipeline.
Insight as a Process: Rethinking Our Mental Models
All this talk of tokens, bytes, and scaling strategies ultimately points back to one thing: how we think and make decisions in this fast-moving field. It’s one thing to adopt a new model or technique; it’s another to cultivate the mindset that consistently spots these opportunities and pitfalls. Here’s where I’ll get a bit philosophical – without going too abstract – to share how I approach “insight” in AI strategy, inspired by some heavy thinkers I admire.
Insight often starts with better questions. Someone asked, “Do we really need a tokenizer?” – and that question led to CANINE, BLT, and beyond. In my own career, the biggest leaps forward came not from knowing the answers, but from bravely asking the questions that others thought were settled. For example, a few years ago everyone assumed “more parameters = better model.” But asking “what if we keep parameters the same and add retrieval?” led to a new line of products and research that changed the game. Encourage a culture of inquiry in your team. Ask things like: What are we assuming that might no longer be true? or Where are we being clever instead of letting the model learn? These questions open doors.
Next is detecting blind spots. We all have them – an area we overlook or an assumption we take for granted. Glitch tokens were a literal blind spot in language models – edge cases tokenizers didn’t handle well. When they first popped up, it was easy to shrug them off as oddities. But those who dug in realized they signaled a deeper issue in how models represent information. In our decision-making, blind spots can be strategic (maybe we’re fixated on model accuracy and ignoring inference cost, or vice versa). One way to reveal blind spots is to listen to diverse perspectives: your research scientists, your engineers, your users – each might see a risk or opportunity you don’t. Another way is to simulate failure: actively imagine scenarios where your plan falls apart, and see what you overlooked. This kind of reflection is like debugging your own thought process – a habit that pays off immensely in a complex field like AI.
A powerful step beyond identifying blind spots is integrating competing positions. In AI, we often see camps form: “All in on Transformers!” vs “New architecture now!”, or “Just scale!” vs “We need more data and retrieval!”. It’s easy to pick a side and develop tunnel vision. But the real breakthroughs often come from integrating the truths on both sides. The saga of model scaling vs. data augmentation is a great example: Both more parameters and more context turned out to be important. The best teams aimed to do both, or to balance them in creative ways. As a leader, you should be part diplomat, part synthesizer – able to take the intuition of one approach and marry it with the strength of another. When I lead strategy sessions, I often draw two seemingly opposed ideas from different team members and literally force a fusion: “If we weren’t to choose, how might we achieve both outcomes?” It’s amazing how often that yields a novel solution – a true insight that feels almost obvious in hindsight.
Under the hood, what I’m really advocating is a self-reflective insight process. It’s an internal game of hypothesis and verification, much like the scientific method, but applied to one’s own thinking. A famous epistemologist once described understanding as a spiral: we experience data, we ask questions, we get insights, we test them, and each time our understanding lifts to a higher integration. In the rush of AI development, taking the time for this reflective spiral is tough, but worth it. It helps us build not just better models, but better mental models of how we approach problems. And those mental models guide countless decisions big and small.
So, as you navigate the rapid changes – tokenization falling out of favor, architectures shifting under your feet, scaling throwing curveballs – remember to step back and think about how you’re thinking. Are you reacting on autopilot (“We’ve always done it this way”)? Are you clinging to a comfortable proxy metric while the world changes around it? Or are you actively questioning, learning, and reframing your approach? Leading in AI requires as much clarity of thought as it does technical know-how.
Building Forward-Compatible Mental Models: Key Takeaways
We’ve covered a lot, so let’s distill a few practical takeaways. These are guiding principles I use to keep my strategy robust and future-proof in the face of AI’s rapid evolution:
Assume Change in Foundations: What is true today (like subword tokenization dominance) might not hold tomorrow. Be ready to pivot when core assumptions (e.g., “we need a tokenizer”) are challenged by new evidence. Design your systems in a modular way so you can swap out pieces (like tokenizers or model backbones) as the tech evolves.
Prioritize Scalable Over Clever: When choosing solutions, favor those that scale with more data/compute over highly specialized tweaks. Many “clever” shortcuts in NLP – from hand-built token rules to custom features – have been outpaced by approaches that let the model learn with more compute. Invest in infrastructure that lets you scale, and in talent that knows how to leverage scale, rather than micro-optimize. As history shows, compute plus general algorithms tend to win.
Leverage Context Engineering: Before defaulting to a larger model to boost performance, exhaust options to enrich the model’s context. Can you retrieve relevant facts? Summarize long inputs? Cache previous results? Often a smaller model with the right context outruns a bigger model blundering in the dark. This not only improves performance but can drastically cut costs. It’s a no-brainer for product pragmatists.
Stay Architecture-Agnostic (to a point): Keep an open mind about new model architectures. Transformers won’t be the end-all forever. Whether it’s state-space models, patch-based hybrids, or something entirely new, be willing to experiment when your use case aligns (e.g., extremely long sequences might warrant an SSM like MambaByte). However, balance this with healthy skepticism – many new ideas fade. The key is to run small experiments early rather than betting the farm, so you’re ready to scale up if a new approach proves itself.
Cultivate an Insight Culture: Encourage your team (and yourself) to question assumptions and learn from anomalies. Glitch tokens, odd errors, outlier data points – these often hide valuable lessons. Create space for reflection in your development cycle: post-mortems, “insight of the week” discussions, scenario planning. The goal is a team that’s not just executing the next task, but also continuously updating its mental model of what works and why.
Integrate, Don’t Isolate: When faced with “X vs Y” debates (e.g., big models vs. smart context, speed vs. accuracy), look for integrative solutions. Often the best path is a hybrid: X and Y in moderation, or X enabled by Y. By avoiding extreme positions, you’ll build systems that are more balanced and resilient to change.
In closing, the landscape of AI product development is one of constant reinvention. Today’s glitch is tomorrow’s insight. We started with a weird token that broke a model, and ended up contemplating architectures that break the mold. The through-line is learning – at scale (for the models) and in depth (for us humans). As you lead teams or products in this space, ground yourself in first principles but be ready to update them. Build mental models that are sturdy yet adaptable, rooted in understanding yet open to new information. That way, whether it’s a paradigm shift from tokens to bytes, or Transformers to something new, or shallow models to deeply integrated systems – you’ll not only adapt, you’ll thrive. Here’s to the insights yet to come, and the exciting uncertainties that will drive them. 🚀
Sources:
Yu et al., “MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers” (2023) – Multi-scale byte Transformer, patches for long sequences.
Pagnoni et al., “Byte Latent Transformer: Patches Scale Better Than Tokens” (2024) – Dynamic byte patching, matches token model performance.
Nawrot et al., “Efficient Transformers with Dynamic Token Pooling” (ACL 2023) – Dynamic segmentation of sequences improves speed and accuracy.
Wang et al., “MambaByte: Token-free Selective State Space Model” (2024) – Byte-level state-space LM, competitive with subword Transformers.
Clark et al., “CANINE: Pre-training an Efficient Tokenization-Free Encoder” (TACL 2022) – Character-level model with downsampling, no fixed vocab.
Li et al., “Glitch Tokens in Large Language Models” (2024) – Study of anomalous tokenizer outputs impacting LLM behavior.
LessWrong Wiki, “Glitch Tokens” (2023) – Definition and examples of glitch tokens causing gibberish output.
Synced AI, “DeepMind’s RETRO vs GPT-3” (2021) – Retrieval-augmented model matches GPT-3 with 25× fewer parameters.
Kevin Rohling, “BLT Deep Dive: Hello bytes, goodbye tokens” (2024) – Accessible overview of BLT’s patching approach and benefits.