Ideas:
Kernels run out of memory while NN's are compact function classes providing a trade off between storage vs training time computation.
Exploit the trades of both of the methodls and combine them for nonparametric statistical test, generative modes, message passing, bandit algorithms and other things that need good statistical analysis and flexible models.
Problem which still remains to be solved, is how to incorporate model
decompositions efficiently into deep learning?
deep learning + spectral methods ==> How to combine them?
This can be done e.g. using some of the objective functions for from
graphical models .e.g. Conditional Random Fields, Structured loss,
anything similar
Differences between graphical models and deep learning:
- graphical models are good if you've got a lot of variables and want to know how they depend on each other. Explains a lot about clustering, topic models, Bayesian nonparametrics, causality and message passing
- deep learning is about understanding how to use them efficiently and which are the limitations. Statistical learning theory in this case is necessary to prove theorems about whether your algorithm works or not. You want to have a guarantee for what you're doing won't go wrong but you don't really want to use the theoremsfor parameter tuning.
LSTM's are latent variable auto-regressive models with some fine tuning to deal with vanishing gradients
Adverserial Environments hard to handle
The Master Algorithm by P.Domingos:
At the time of writing it has been identified that there are 5 different tribes, schools/paradigms regarding machine learning related to the way that each school or technique uses they preferred methodology or algorithm inside the machine learning community.
How do computers discover new knowledge?
- Fill in gaps in existing knowledge
- Emulate the brain
- Simulate evolution
- Systematically reduce uncertainty
- Notice similarities between old and new
Tribe | Origins | Master Algorithm | People | |||
---|---|---|---|---|---|---|
Symbolists | Logic, philosophy | Inverse deduction | Tom Mitchel, Steve Muggleton, Ross Quinlan | |||
Connectionists | Neuroscience | Backpropagation | LeCun, Hinton, Bengio | |||
Evolutionaries | Evolutionary Biology | Genetic programming | John Koza, John Holland, Hod Lipson | |||
Bayesians | Statistics | Probabilistic inference | David Heckerman, Judea Pearl, Michael Jordan | |||
Analogizers | Psychology | Kernel machines | Peter Hart, V.Vapnik, Douglas Hofstadter |
Putting pieces together:
- Representation
- Probabilistic logic (e.g. Markov logic networks)
- Weighted formulas –> Distriubtion over states
- Evaluation
- Posterior probability
- User defined objective function
- Optimization
- Formula discovery: Genetic programming
- Weight learning: Backpropagation
- Towards a universal learner
- New ideas and tribes are needed ==> ?
Grand unifying theory => unify all 5 learning tribes
Unifying representations => starting from theorist and Bayesians (logic and graphical models => has been done => Markov Logic Networks)
- Start with a FOL rule if…then…
- Give each rule a weight depending whether you believe it or not
- Evaluation function. Find in the hypotheses space the candidate that maximizes or minimizes my evaluation function. In this case that's just the posterior that Bayesians use. It shouldn't be part of the algorithm. It should be provided by the user. The objective function to optimize should be given by the user.
- How do we find the model to optimize the algorithm. When you have your formulas you have to come up with weights for optimizing those formulas i.e. Backprop.
Different projects:
Project 1: Methods for Semi-supervised Learning and Active Labeling How can we exploit unlabeled data for a supervised learning problem and how can we identify the most informative subset of examples to be annotated by an expert?
Project 2: Methods for Robust Feature Learning How can we learn robust features that remain maximally predictive even if the distribution of test data is very different from the distribution of training data?
Project 3: Calibrated Uncertainty Estimation How can we provide reliable confidence intervals for deep neural network predictions?