Update on 7/16: This course is damn hard! And poor organized somewhere, for example, in week 13 one video is missing and the last two questions in the quiz are wrong themselves. But all in all it’s still a course in depth.

This is a note for Course: Neural Networks for Machine Learning University of Toronto

I found Prof. Geoffrey Hinton’s British English was a little hard for me to understand, but he definitely has the insight of neural network, the content is really of high quality and helped me a lot to understand neural network thoroughly.

Week 1: Introduction
Week 2: The Perceptron learning procedure
Week 3: The backpropagation learning proccedure
Week 4: Learning feature vectors for words
Week 5: Object recognition with neural nets.
Week 6: Optimization: How to make the learning go faster
Week 7: Recurrent neural networks
Week 8: More recurrent neural networks
Week 9
Week 10
Week 11
Week 12
Week 13
Week 14
Week 15
Week 16

Week 1: Introduction

Why do we need machine learning?

What are neural networks?

Some simple models of neurons

Linear, Binary Threshold, Logistic Sigmoid, Rectified Linear, Stochastic Binary

A simple example of learning

Three types of learning

Supervised Learning
- Regression
- Classification
Unsupervised Learning
- It provides a compact, low-dimensional representation of the input
- It provides an economical high-dimensional representation of the input in terms of learned features
- It finds sensible clusters in the input
Reinforced Learning

Week 2: The Perceptron learning procedure

Types of neural network architectures

Feed-forward neural networks
Recurrent networks
Symmetrically connected networks

Perceptrons: The first generation of neural networks

In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers

A geometrical view of perceptrons

Why the learning works

What perceptrons can’t do

Week 3: The backpropagation learning proccedure

Learning the weights of a linear neuron

The error surface for a linear neuron

Learning the weights of a logistic output neuron

The backpropagation algorithm

Using the derivatives computed by backpropagation

A large number of different methods have been developed. – Weight-decay

Weight-sharing
Early stopping
Model averaging
Bayesian fitting of neural nets – Dropout
Generative pre-training

Linear hidden units don’t add modeling capacity to the network.

Week 4: Learning feature vectors for words

Learning to predict the next word

One Hot

Why One Hot?

A brief diversion into cognitive science

There has been a long debate in cognitive science between two rival theories of what it means to have a concept:

The feature theory: A concept is a set of semantic features.
- This is good for explaining similarities between concepts. – Its convenient: a concept is a vector of feature activities.
The structuralist theory: The meaning of a concept lies in its relationships to other concepts.
- So conceptual knowledge is best expressed as a relational graph.
- Minsky used the limitations of perceptrons as evidence against feature vectors and in favor of relational graph representations.
These two theories need not be rivals. A neural net can use vectors of semantic features to implement a relational graph.
- In the neural network that learns family trees, no explicit inference is required to arrive at the intuitively obvious consequences of the facts that have been explicitly learned.
- The net can “intuit” the answer in a forward pass.
We may use explicit rules for conscious, deliberate reasoning, but we do a lot of commonsense, analogical reasoning by just “seeing” the answer with no conscious intervening steps.
- Even when we are using explicit rules, we need to just see which rules to apply.

Another diversion: The softmax output function

Using squared error as logstic’s cost function may not be a good idea, because the derivative is likely very near to zero, resulting into very slow learning.

Instead, we can use Softmax, its cost function is \(E=-t\log(y)-(1-t)\log(1-y)\), also called cross-entropy.

Neuro-probabilistic language models

Ways to deal with the large number of possible outputs

Week 5: Object recognition with neural nets.

Why object recognition is difficult

Achieving viewpoint invariance

Several different approaches to achieve viewpoint invariance:

Use redundant invariant features
Put a box around the object and use normalized pixels
Use replicated features with pooling. This is called “convolutional neural network”
Use hierarchy of parts of that have explicit poses relative to the camera

Convolutional nets for digit recognition

Convolutional nets for object recognition

Graded: Lecture 5 Quiz Graded: Programming Assignment 2: Learning Word Representations.

Week 6: Optimization: How to make the learning go faster

I find nothing worth taking notes of, many overlapping content with the course by Stanford.

Overview of mini-batch gradient descent

A bag of tricks for mini-batch gradient descent

The momentum method

Adaptive learning rates for each connection

Rmsprop: Divide the gradient by a running average of its recent magnitude

Week 7: Recurrent neural networks

Modeling sequences: A brief overview

Linear dynamic systems and hidden Markov models are stochastic models, Recurrent neural networks are deterministic.

Training RNNs with back propagation

A toy example of training an RNN

A recurrent network can emulate a finite state automaton, but it is exponentially more powerful. With N hidden neurons it has 2^N possible binary activity vectors (but only N^2 weights).

Why it is difficult to train an RNN

Four effective ways to learn an RNN

Long Short Term Memory
Hessian Free Optimization
Echo State Network
Good initialization with momentum

Long-term Short-term-memory

Week 8: More recurrent neural networks

Video: Modeling character strings with multiplicative connections

Video: Learning to predict the next character using HF

Video: Echo State Networks

Week 9

Preventing overfitting

Approach 1: Get more data

Almost always the best bet if you have enough compute power to train on more data
Approach 2: Use a model that has the right capability
- enough to fit the right regularity
- not enough to fit spurious regularities (if they are weaker)
Approach 3: Average many different models
- Use models with different forms
- Or train the model with different subsets of training data (this is called “bagging”)
Approach 4: (Bayesian) Use a single neural network architecture, but average different prediction made by many different weight vectors.

The capability can be controlled by many ways

Architecture: Limit the number of hidden layers and the number of units per layer
Early Stopping: Start with small weights and stop the learning before it overfits
Weight decay: Penalize large weights using penalties or constrains on the their squared values (L2 penalty) or absolute values (L1 penalty)
Noise: Add noise to the weights or the activities

Typically a combination of these methods is used.

Cross-validation: a better way to choose meta parameters

Divide the total dataset into three subsets:

Training data: is used for learning the parameters of the model.
Validation data: is not used for learning but is used to decide what settings of the meta parameters work best.
Test data: is used to get a final, unbiased estimate of how well the network works. We expect this estimate to be worse than on the validation data.

N-fold cross-validation (Easton’s note: This definition of N-fold cross-validation is different from elsewhere)

We could divide the total dataset into one final test set and N other subset and train on all but one of the subsets to get N different estimate of the validation error rate.

Noise can be used as regularizer against overfit in input, output and activating functions

Week 10

Making models differ by changing their training data

Bagging: Train different models on different subsets of the data
Boosting: Train a sequence of low capability models. Weight the training cases differently for each model in the sequence.

Week 11

Hopfield nets and Boltzmann machines

5 videos, 1 reading Reading: Lecture Slides (and resources) Video: Hopfield Nets Video: Dealing with spurious minima Video: Hopfield nets with hidden units Video: Using stochastic units to improv search Video: How a Boltzmann machine models data Graded: Lecture 11 Quiz

Week 12

Restricted Boltzmann machines (RBMs) This module deals with Boltzmann machine learning
5 videos, 1 reading Reading: Lecture Slides (and resources) Video: Boltzmann machine learning Video: OPTIONAL VIDEO: More efficient ways to get the statistics Video: Restricted Boltzmann Machines Video: An example of RBM learning Video: RBMs for collaborative filtering

Week 13

Stacking RBMs to make Deep Belief Nets

3 videos, 1 reading Reading: Lecture Slides (and resources) Video: The ups and downs of back propagation Video: Belief Nets Video: The wake-sleep algorithm Graded: Programming Assignment 4: Restricted Boltzmann Machines Graded: Lecture 13 Quiz

Week 14

Deep neural nets with generative pre-training

5 videos, 1 reading Reading: Lecture Slides (and resources) Video: Learning layers of features by stacking RBMs Video: Discriminative learning for DBNs Video: What happens during discriminative fine-tuning? Video: Modeling real-valued data with an RBM Video: OPTIONAL VIDEO: RBMs are infinite sigmoid belief nets

Week 15

Modeling hierarchical structure with neural nets

6 videos, 1 reading Reading: Lecture Slides (and resources) Video: From PCA to autoencoders Video: Deep auto encoders Video: Deep auto encoders for document retrieval Video: Semantic Hashing Video: Learning binary codes for image retrieval Video: Shallow autoencoders for pre-training

Week 16

Recent applications of deep neural nets

3 videos Video: OPTIONAL: Learning a joint model of images and captions Video: OPTIONAL: Hierarchical Coordinate Frames Video: OPTIONAL: Bayesian optimization of hyper-parameters