Neel Nanda argues that the lottery ticket hypothesis may constitute the reason for why neural networks form sophisticated circuits.
The vital insight of the lottery ticket hypothesis paper is that it may also be possible to prune the network before training to make both training and inference more efficient.
At initialization, the neurons in the subcircuits they're finding [in the multi-prize lottery ticket hypothesis paper] would not light up in recognition of a dog, because they're still connected to a bunch of other stuff that's not in the subcircuit - the subcircuit only detects dogs once the other stuff is disconnected.
do they provide evidence for the lottery ticket conjecture as well?
if the approach of the original LTH paper (first train the dense network, then choose the winning ticket and wind back the weights) and the approach of most later papers (use supermasks to find the winning ticket without training the original network at all) were found to produce almost identical subnetworks, then that would constitute very strong evidence for the conjecture.
That space, he argues, is the parameter tangent space of the initial network.
The linear approximation of the right-hand side is then f ( x , θ 0 ) + Δ θ d f d θ ( x , θ 0 ) . This approximation is the parameter tangent space.
At initialization, we randomly choose θ 0 , which determines the parameter tangent space, i.e. the set of lottery tickets. SGD throws out all lottery tickets that don't perfectly match the data. Out of the multiple remaining lottery tickets that do match the data, SGD just picks one at random.
the following constitutes a more useful mental model than the original lottery ticket hypothesis:
His parameter tangent space version of the hypothesis is mainly based on Mingard et al.'s (2020) finding that the generalization performance of overparameterized neural nets can mostly be explained using Bayesian models that these networks approximate
By mindfully replicating (or dropping) and re-ordering layers for one network, its corresponding winning ticket could be stretched (or squeezed) into a subnetwork for another deeper (or shallower) network from the same family
First, they assume that the architectures across which a single winning ticket is transferred come from the same family, such as ResNets. Secondly, under their current approach, an elastic winning ticket can scale only along the depth dimension
the question of what it exactly is that makes lottery tickets special.
However, the weights of the pruned network may also not be that important - as Zhou et al. showed, the signs of the weights matter more than their specific values.
Wentworth's position that pruning is just a weird way of doing optimization and changes the functional behavior of the network nodes seems pretty plausible
interplay with phenomena like grokking and double descent seems valuable.
comparing the winning tickets generated through iterative magnitude-based pruning and the tickets generated using at-initialization pruning methods might yield insights into the question of whether the lottery ticket conjecture is true.
A randomly-initialized, dense neural network contains a subnetwork that is initialized such that—when trained in isolation—it can match the test accuracy of the original network after training for at most the same number of iterations.
Train the full dense network on some classification task Prune out some fraction of the weights with the smallest magnitude Reinitialize the remaining weights to their original values Repeat the same procedure for a number of times
We extend our hypothesis into an untested conjecture that SGD seeks out and trains a subset of well-initialized weights. Dense, randomly-initialized networks are easier to train than the sparse networks that result from pruning because there are more possible subnetworks from which training might recover a winning ticket.