Why does deep and cheap learning work so well?

Lin et al., 2017

Source: Lin et al., 2017

Summary

  • How do properties frequently encountered in physics (e.g. symmetry, locality, compositionality) translate into simple neural networks?
  • Show that when data is generated from a common hierarchical form, a deep neural network can be more efficient
    • Prove various “no-flattening theorems”
  • Links: [ website ] [ pdf ]

Background

  • While the universal approximation theorems prove that neural networks can approximate arbitrary functions well, they do little to explain the empirical efficiency of deep networks
  • Factors that affect quality of neural networks:
    • Expressibility: What class of functions can be expressed?
    • Efficiency: How many resources are required to approximate a given function?
    • Learnability: How rapidly can the network learn good parameters?
  • Focuses on expressibility and efficiency by addressing:

    How can neural networks approximate functions well in practice, when the set of possible functions is exponentially larger than the set of practically possible networks?

    • The fact that networks of feasible size are useful implies the class of functions we care about is dramatically smaller than all possible functions

Methods

  • Interested in probability distribution $p(\textbf{x}|y)$, where $\textbf{x}$ might be the pixels in an image and $y$ the object label (e.g. cat)
  • Probabilities and Hamiltonians
    • Hamiltonian: $H_y(\textbf{x}) \equiv -\ln p(\textbf{x}|y)$
    • After some algebra (see paper), $p(\textbf{x}) = \frac{1}{N(\textbf{x})}e^{-\left[\mathbf{H(x)+\mu}\right]}$
    • Investigate how well $p(\mathbf{x})$ can be approximated by a neural network
      • $p(\mathbf{x})$ can be obtained by applying softmax layer to $\mathbf{H(x)}$, therefore question becomes how to approximate $\mathbf{H(x)}$ well

Results

  • Nature favors Hamiltonians that are low-order, sparse, symmetric polynomials – which require few continuous parameters to represent
    • Polynomials can be built using multiplications and additions, multiplication can be achieved with four neurons
    • Low polynomial order: typically degree ranging from 2 to 4
    • Locality: things are only affected by what’s in their immediate vicinity, results in sparsity
    • Symmetry: invariance under some transformation (e.g. tranlation or rotation), results in fewer independent parameters
  • Why deep?
    • No-flattening Theorems: Inability of neural networks representing the data from generative processes with hierarchical structure to be efficiently flattened
      • E.g. how matrix multiplication can be optimized by factoring
    • Result of the generative process contains much more information than the actual process
      • E.g. “generate a random string of $10^9$ bits”

Conclusion

  • Success of reasonably sized neural networks hinges on symmetry, locality, and polynomial log-probability in data from the natural world
Elias Z. Wang
Elias Z. Wang
AI Researcher | PhD Candidate