Wednesday 19 March 2025
Deep learning models have long been praised for their ability to generalise well on unseen data, but a new study has shed light on a phenomenon that challenges this assumption. Grokking, as it’s called, is a type of delayed generalisation where a model’s performance improves significantly only after training has converged.
Researchers have struggled to understand why grokking occurs, with some attributing it to factors such as data sparsity, large initial weights, and high regularisation rates. However, a new study suggests that the key to understanding grokking lies in the way data is distributed between training and test sets.
The researchers created two synthetic datasets designed to mimic real-world scenarios, where classes are structured into subclasses with varying distances from each other. They found that by subtly shifting the distribution of these subclasses, they could induce grokking in their models.
In one dataset, the distance between subclasses was carefully controlled, allowing the team to manipulate the model’s ability to generalise. They discovered that even when only a small number of samples were used for training, the model would still exhibit delayed generalisation if the test data distribution was sufficiently different from the training data.
The second dataset was more complex, with classes structured into super-classes and subclasses. The researchers found that even without any samples from specific subclasses, the model could still learn to generalise, thanks to the similarity between subclasses within a class.
These findings have significant implications for our understanding of deep learning models and their ability to generalise. They suggest that delayed generalisation is not just a quirk of certain datasets or models, but rather a fundamental property of neural networks themselves.
The study’s results also highlight the importance of considering data distribution shifts when training models. By acknowledging these shifts and designing more robust models that can adapt to changing distributions, researchers may be able to improve the performance of deep learning algorithms in real-world scenarios.
Furthermore, the discovery of grokking has sparked new avenues of research into the underlying mechanisms of neural networks. By delving deeper into the intricacies of delayed generalisation, scientists may uncover novel insights into how these models learn and adapt.
The findings of this study have far-reaching implications for fields such as computer vision, natural language processing, and robotics, where accurate generalisation is crucial. As researchers continue to unravel the mysteries of grokking, they are one step closer to developing more reliable and effective AI systems.
Cite this article: “Grokking: A New Understanding of Delayed Generalization in Deep Learning”, The Science Archive, 2025.
Deep Learning, Generalization, Grokking, Delayed Generalisation, Data Distribution, Neural Networks, Machine Learning, Artificial Intelligence, Computer Vision, Natural Language Processing, Robotics







