Simplicity-Bias Research Project

active

What does 'simplicity' mean for neural networks? how complex a property is to represent in an input tensor (a priori), and how complex an internal mechanism must be to detect and exploit it for reward (a posteriori). The project aims to formalize the first with descriptive complexity theory and the second with singular learning theory, and to study how the two interact — with simplicity bias and goal misgeneralization as the motivating phenomena.

what

When several features predict the training signal equally well, networks tend to rely on whichever is “easier” and stay invariant to the rest. This is benign when the easy feature is the right one and dangerous when it’s a proxy — a plausible mechanism behind goal misgeneralization, where a policy keeps its capabilities out-of-distribution but pursues the wrong objective.

The word doing all the work is “easy”, and it hides at least two distinct things:

A priori — representational complexity. How hard is a property to express over the data structure we feed a network, i.e. a tensor? How many entries does it involve, how loosely are they indexed, how much can their values vary? This is a property of the feature-and-representation, before any learning happens.
A posteriori — mechanism complexity. How hard is it for the model to build an internal mechanism that detects the feature and acts on it to obtain reward? This is a property of the function the network must implement and of the optimization that has to find it.

Both matter, and they interact: a system has to both detect features and assemble machinery for detection and for action. The interesting object is the translation between the two — which is largely what architecture does. A convolutional net makes “shift-invariant local pattern” cheap a posteriori precisely because it bakes in an a priori assumption about input structure, absorbing some of the positional cost the same feature would carry for an MLP.

approach

For the a priori side I want to use descriptive complexity theory: represent a feature as a predicate over a tensor in a suitable logical language, and measure its complexity by the logical resources the defining formula needs (fragment, quantifier rank, first- vs. second-order, fixed-point operators). The honest difficulty is that classical descriptive complexity concerns decision problems on finite relational structures, whereas here the objects are real-valued predicates over tensors, so the framework needs adapting — that adaptation is part of the work, and my background in mathematical logic is what makes it tractable.

For the a posteriori side, singular learning theory seems the right language. The cost of a mechanism is captured by how much volume the corresponding solutions occupy in parameter space — the real log canonical threshold / local learning coefficient. Lower-complexity, more degenerate solution sets are favored by the Bayesian posterior (and approximately by SGD) because they occupy exponentially more mass as data grows. This connects simplicity-as-fewer-effective-parameters to the Solomonoff / algorithmic view that simpler, more general solutions are privileged because there are more of them (the SLT→AIT bridge).

The empirical hook is a controlled feature-competition setup: a classifier with two features F_s and F_c of matched predictivity, where I can compute or estimate the a priori complexity gap from the descriptive-complexity side and measure reliance and basin geometry (e.g. local-learning-coefficient estimation) on the a posteriori side, across architectures that translate the gap differently.

current status

Early and largely conceptual. There’s a small toy experiment probing the simplest a priori criteria (entry count, positional fixedness, value range); the next steps are formalizing the tensorial logical language properly and connecting its complexity measure to measured solution-space geometry. Open questions: which representational axes the dynamics are actually sensitive to; how cleanly descriptive complexity in an architecture’s “native vocabulary” predicts basin degeneracy; and whether the picture transfers from supervised classification to a reward setting that bears directly on goal misgeneralization.

references

Framing

Hubinger et al., Risks from Learned Optimization in Advanced ML Systems, 2019.
Langosco et al., Goal Misgeneralization in Deep Reinforcement Learning, ICML 2022.
Shah et al., The Pitfalls of Simplicity Bias in Neural Networks, NeurIPS 2020.
Geirhos et al., Shortcut Learning in Deep Neural Networks, Nat. Mach. Intell. 2020.

Representational complexity (a priori)

Immerman, Descriptive Complexity, Springer, 1999.
Libkin, Elements of Finite Model Theory, Springer, 2004.

Mechanism complexity (a posteriori)

Watanabe, Algebraic Geometry and Statistical Learning Theory, Cambridge, 2009.
Lau, Murfet, Wei, Chen, Quantifying degeneracy in singular models via the learning coefficient, 2023.
Valle-Pérez, Camargo, Louis, Deep Learning Generalises Because the Parameter-Function Map is Biased Towards Simple Functions, ICLR 2019.
Bushnaq, From SLT to AIT: NN generalisation out-of-distribution, LessWrong, 2024.