GROKFAST: A Machine Studying Method that Accelerates Grokking by Amplifying Sluggish Gradients
Grokking is a newly developed phenomenon the place a mannequin begins to generalize nicely lengthy after it has overfitted to the coaching knowledge. It was first seen in a two-layer Transformer skilled on a easy dataset. In grokking, generalization happens solely after many extra coaching iterations than overfitting. This requires excessive computational assets, making it…