Update on 7/16: This course is damn hard! And poor organized somewhere, for example, in week 13 one video is missing and the last two questions in the quiz are wrong themselves. But all in all it’s still a course in depth.

This is a note for Course: Neural Networks for Machine Learning University of Toronto

I found Prof. Geoffrey Hinton’s British English was a little hard for me to understand, but he definitely has the insight of neural network, the content is really of high quality and helped me a lot to understand neural network thoroughly.

# Week 1: Introduction

## Some simple models of neurons

Linear, Binary Threshold, Logistic Sigmoid, Rectified Linear, Stochastic Binary

## Three types of learning

• Supervised Learning
• Regression
• Classification
• Unsupervised Learning
• It provides a compact, low-dimensional representation of the input
• It provides an economical high-dimensional representation of the input in terms of learned features
• It finds sensible clusters in the input
• Reinforced Learning

# Week 2: The Perceptron learning procedure

## Types of neural network architectures

• Feed-forward neural networks
• Recurrent networks
• Symmetrically connected networks

## Perceptrons: The first generation of neural networks

In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers

# Week 3: The backpropagation learning proccedure

## Using the derivatives computed by backpropagation

A large number of different methods have been developed. – Weight-decay

• Weight-sharing
• Early stopping
• Model averaging
• Bayesian fitting of neural nets – Dropout
• Generative pre-training

Linear hidden units don’t add modeling capacity to the network.

# Week 4: Learning feature vectors for words

One Hot

Why One Hot?

## A brief diversion into cognitive science

There has been a long debate in cognitive science between two rival theories of what it means to have a concept:

• The feature theory: A concept is a set of semantic features.
• This is good for explaining similarities between concepts. – Its convenient: a concept is a vector of feature activities.
• The structuralist theory: The meaning of a concept lies in its relationships to other concepts.
• So conceptual knowledge is best expressed as a relational graph.
• Minsky used the limitations of perceptrons as evidence against feature vectors and in favor of relational graph representations.
• These two theories need not be rivals. A neural net can use vectors of semantic features to implement a relational graph.
• In the neural network that learns family trees, no explicit inference is required to arrive at the intuitively obvious consequences of the facts that have been explicitly learned.
• The net can “intuit” the answer in a forward pass.
• We may use explicit rules for conscious, deliberate reasoning, but we do a lot of commonsense, analogical reasoning by just “seeing” the answer with no conscious intervening steps.
• Even when we are using explicit rules, we need to just see which rules to apply.

## Another diversion: The softmax output function

Using squared error as logstic’s cost function may not be a good idea, because the derivative is likely very near to zero, resulting into very slow learning.

Instead, we can use Softmax, its cost function is $$E=-t\log(y)-(1-t)\log(1-y)$$, also called cross-entropy.

# Week 5: Object recognition with neural nets.

## Achieving viewpoint invariance

Several different approaches to achieve viewpoint invariance:

• Use redundant invariant features
• Put a box around the object and use normalized pixels
• Use replicated features with pooling. This is called “convolutional neural network”
• Use hierarchy of parts of that have explicit poses relative to the camera

# Week 6: Optimization: How to make the learning go faster

I find nothing worth taking notes of, many overlapping content with the course by Stanford.

# Week 7: Recurrent neural networks

## Modeling sequences: A brief overview

Linear dynamic systems and hidden Markov models are stochastic models, Recurrent neural networks are deterministic.

## A toy example of training an RNN

A recurrent network can emulate a finite state automaton, but it is exponentially more powerful. With N hidden neurons it has 2^N possible binary activity vectors (but only N^2 weights).

## Why it is difficult to train an RNN

Four effective ways to learn an RNN

• Long Short Term Memory
• Hessian Free Optimization
• Echo State Network
• Good initialization with momentum

# Week 9

Preventing overfitting

• Approach 1: Get more data

Almost always the best bet if you have enough compute power to train on more data

• Approach 2: Use a model that has the right capability

• enough to fit the right regularity
• not enough to fit spurious regularities (if they are weaker)
• Approach 3: Average many different models

• Use models with different forms
• Or train the model with different subsets of training data (this is called “bagging”)
• Approach 4: (Bayesian) Use a single neural network architecture, but average different prediction made by many different weight vectors.

The capability can be controlled by many ways

• Architecture: Limit the number of hidden layers and the number of units per layer
• Early Stopping: Start with small weights and stop the learning before it overfits
• Weight decay: Penalize large weights using penalties or constrains on the their squared values (L2 penalty) or absolute values (L1 penalty)
• Noise: Add noise to the weights or the activities

Typically a combination of these methods is used.

Cross-validation: a better way to choose meta parameters

Divide the total dataset into three subsets:

• Training data: is used for learning the parameters of the model.
• Validation data: is not used for learning but is used to decide what settings of the meta parameters work best.
• Test data: is used to get a final, unbiased estimate of how well the network works. We expect this estimate to be worse than on the validation data.

N-fold cross-validation (Easton’s note: This definition of N-fold cross-validation is different from elsewhere)

We could divide the total dataset into one final test set and N other subset and train on all but one of the subsets to get N different estimate of the validation error rate.

Noise can be used as regularizer against overfit in input, output and activating functions

# Week 10

Making models differ by changing their training data

• Bagging: Train different models on different subsets of the data
• Boosting: Train a sequence of low capability models. Weight the training cases differently for each model in the sequence.

# Week 11

Hopfield nets and Boltzmann machines

5 videos, 1 reading Reading: Lecture Slides (and resources) Video: Hopfield Nets Video: Dealing with spurious minima Video: Hopfield nets with hidden units Video: Using stochastic units to improv search Video: How a Boltzmann machine models data Graded: Lecture 11 Quiz

# Week 12

Restricted Boltzmann machines (RBMs) This module deals with Boltzmann machine learning
5 videos, 1 reading Reading: Lecture Slides (and resources) Video: Boltzmann machine learning Video: OPTIONAL VIDEO: More efficient ways to get the statistics Video: Restricted Boltzmann Machines Video: An example of RBM learning Video: RBMs for collaborative filtering

# Week 13

Stacking RBMs to make Deep Belief Nets

3 videos, 1 reading Reading: Lecture Slides (and resources) Video: The ups and downs of back propagation Video: Belief Nets Video: The wake-sleep algorithm Graded: Programming Assignment 4: Restricted Boltzmann Machines Graded: Lecture 13 Quiz

# Week 14

Deep neural nets with generative pre-training

5 videos, 1 reading Reading: Lecture Slides (and resources) Video: Learning layers of features by stacking RBMs Video: Discriminative learning for DBNs Video: What happens during discriminative fine-tuning? Video: Modeling real-valued data with an RBM Video: OPTIONAL VIDEO: RBMs are infinite sigmoid belief nets

# Week 15

Modeling hierarchical structure with neural nets

6 videos, 1 reading Reading: Lecture Slides (and resources) Video: From PCA to autoencoders Video: Deep auto encoders Video: Deep auto encoders for document retrieval Video: Semantic Hashing Video: Learning binary codes for image retrieval Video: Shallow autoencoders for pre-training

# Week 16

Recent applications of deep neural nets

3 videos Video: OPTIONAL: Learning a joint model of images and captions Video: OPTIONAL: Hierarchical Coordinate Frames Video: OPTIONAL: Bayesian optimization of hyper-parameters