The next frontier for large language models (LLMs) is not just about having bigger memory capacity. It’s about improving judgement and ensuring that the right information is used at the right time.
LLMs keep getting bigger, but they are not necessarily getting better. Increased scale gives LLMs more space to process more information, but this does not help models use that information more effectively. As LLMs evolve into agentic systems that can reason and act, the ability to choose what information matters in any given context will determine how capable they really are.
Bigger but not necessarily better
The context windows for LLMs, how much recent text it can remember and use to shape its next response, have dramatically expanded in recent years. They have grown from a few thousand tokens to a few hundred thousand tokens, and in some cases have even reached a million tokens. In theory, this should allow LLMs to read and reason across entire documents, sustain longer conversations, and use information from multiple sources to produce more coherent answers.
However, Stanford’s 2025 AI Index shows that the standard tests for language model proficiency amongst leading LLMs are producing near identical results despite wide differences in model size and memory. This suggests that increased scale is not enough to make a meaningful difference to LLM efficacy.
At the same time, using larger LLMs is more costly. This isn’t necessarily a bad thing as bigger contexts ensure that LLMs can handle longer documents, recall past exchanges, and reason across complex information. But it’s important for business ROI that the higher spend on compute is matched by better outputs.
Nvidia estimates that keeping a 128K token conversation (which is roughly the length of a short book) in an LLM’s working memory can consume about 40 gigabytes of graphics processing unit (GPU) memory. This means that one long chat can max out an entire GPU, which is very costly for potentially only marginal gains in performance.
New data. New problems.
LLMs need the right data to produce answers that are accurate, relevant, and useful. Today, they are being fed more information than ever in a bid to make their responses richer and more precise. This can include recent documents, data from internal knowledge bases, previous chat histories, database records, and live information pulled from APIs or other connected applications.
Each of these sources adds useful information, but they also bring more complexity. The data is often scattered across different systems, updated at different speeds, and stored in different formats, so stitching it all together takes longer and more computing power. The crux of the issue however is that even with all that data, LLMs aren’t guaranteed to use the right information at the right time.
It’s crucial that LLMs develop better judgement and are able to choose the right data, at the right moment, to deliver the right answer. This balance between scale and judgement will define the next generation of LLMs.
Stanford and Berkeley’s Lost in the Middle research shows that when models are flooded with long contexts, they often fail to recall what matters most. In other words, simply giving LLMs more information doesn’t help if they can’t recognise what’s relevant.
For example, a customer support bot scrolling through an entire chat history instead of focusing on the last issue you raised, is slowed down by the additional information and is not able to make a better judgement simply because it has access to more data.
The same issue can crop up in enterprise search. Ask an AI assistant for your company’s latest travel policy, and it might pull up five versions — including one from 2019 — because it can’t judge which source is current. The answer looks comprehensive, but it’s not actually useful.
In short, the problem isn’t simply how much data an LLM can access, but how well it manages that data.
The fix: context engineering
If more data alone isn’t the answer, better context is. Context engineering is deciding what information an LLM needs, when it needs it, and where that information should come from. The aim here isn’t to feed models everything, but to help them focus on the right things to produce better outputs.
Getting context engineering right depends on improving performance, relevance, and access. Performance is improved when LLMs have the ability to reuse work they’ve already done, so time and energy isn’t wasted recomputing answers. Relevance, on the other hand, is about helping LLMs narrow their field of view to the data that improves reasoning in relation to a specific task. Access is about ensuring useful data is always available, accurate, and secure when the model needs it. Actioned together, these three elements are what can enable LLMs to make better choices about what to use and when, transforming raw information into meaningful context.
Turning context into capability
Modern data infrastructure is what makes this all possible. Real-time in-memory storage can be used to speed up retrieval so that LLMs can recall useful context in milliseconds. In addition, semantic caching ensures that LLMs can identify when a question has already been answered, avoiding unnecessary compute. Vector search also helps LLMs surface the most relevant information from large stores of data. Together, these techniques are what give LLMs the ability to use the right context at the right moment, rather than simply remembering everything.
For example, a business might use an LLM designed to help employees find and summarise company compliance policies. Without context engineering, the model is at risk of merging information from outdated or unrelated compliance documents, producing an inaccurate answer to a specific query. With context engineering however the model is able to judge which sources are relevant, filtering for the most recent and relevant verified documents before providing a response. Here vector search helps the system identify semantically similar content and pinpoint sections relevant to the compliance query. In addition, real-time retrieval ensures that the model only draws information from up-to-date documents. Ultimately, the model retrieves only what matters, so its answers are faster and more accurate. Simply put: that model is not remembering more, it’s reasoning better.
The rise of judgement in AI
As AI systems evolve from static models to dynamic agents, the focus is shifting from how much they can remember towards how effectively they can use what they know. Scaling context windows is critical to future performance, but scale alone is not sufficient to make systems intelligent. It’s crucial that LLMs develop better judgement and are able to choose the right data, at the right moment, to deliver the right answer. This balance between scale and judgement will define the next generation of LLMs.
Manvinder Singh
Manvinder Singh is VP of Product Management for AI at Redis where he is responsible for the portfolio of AI offerings including vector search, semantic caching and agent memory. Previously, he spent 10+ years in various AI and Cloud Infrastructure roles at Google.


