This course is one of the most famous courses on Coursera. Now I go two weeks ahead of the deadline and reach Week 5, I plan to finish it in the flowing few days.

Update 03-19: I finished the course with full marks today, but this post is still incomplete, I will keep updating it as reviewing this great course.

This course is the perfect choice if you are not satisfied with just being able to drive some machine learning framework to work but also eager to know what is under the hood, this course will teach you the most concrete mathematical principles and equations underlying most AI applications. Overall in this course, Prof. Ng delivered profound knowledge in a comprehensive way. But this course isn’t flawless, for example Week 5 uses intuition to explain backpropogation and example applications, which I would say verbose and useless.

Bellow is my note of important concept, it may be incomplete and biased, feel free to leave comment and let me know, I will keep it updated.

The Syllabus skeleton is left, to remind readers in which section that concept is taught.

# Week 1: Introduction

Supervised Learning and Unsupervised Learning

# Week 3: Logistic Regression

Questions:

1. Is the gradient too small?

2. Why logistic regression has advantage over linear regression when it comes to classification.

http://www.theanalysisfactor.com/why-logistic-regression-for-binary-response/

## Classification and Representation

### Hypothesis Representation

$\theta(x)=g(\theta^Tx)$ $z=\theta^Tx$ $g(z)=\frac{1}{1+e^{-z}}$

### Decision Boundary

TODO convex function

# Week 4:

### Advanced Optimization

fminunc in Octave is very useful to auto generate cost and gradient

# Week 4

## Neural Networks

### Model Representation

#### How to determine the dimension of one layer?

If network has $s_j$ units in layer j and $s_j+1$ units in layer j+1, then $\theta(j)$ will be of dimension $s_{j+1}(s_j+1)$.

# Week 5: Neural Networks: Learning

## Cost Function and Backpropagation

### Backpropagation Algorithm

Error(delta) of cost for Node

## Backpropagation in Practice

### Gradient Checking

gradApprox ≈ deltaVector

The code to compute gradApprox can be very slow

### Random Initialization

Initialization theta can’t be set all to 0, otherwise the backpropagation will get all same theta. So theta matrix should be initialize randomly. This is also called Symmetry Breaking.

One effective strategy for choosing $\epsilon_{init}$ is to base it on the number of units in the network. A good choice of $\epsilon_{init}$ is $\epsilon_{init} = \frac{\sqrt6}{\sqrt{L_{in}+L_{out}}}$ , where $L_{in} = s_l$ and $L_{out} = s_l+1$ are the number of units in the layers adjacent to $\Theta^{(l)}$.

### Putting It Together

Question: Can we just skip gradient checking? A: No, we need to check the backpropogation is bug free.

First, pick a network architecture; choose the layout of your neural network, including how many hidden units in each layer and how many layers in total you want to have.

Number of input units = dimension of features x(i)

Number of output units = number of classes

Number of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units)

Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer.

Training a Neural Network

1. Randomly initialize the weights
2. Implement forward propagation to get hΘ(x(i)) for any x(i)
3. Implement the cost function
4. Implement backpropagation to compute partial derivatives
5. Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
6. Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.

## Application of Neural Networks

### Programming Assignment

Visualizing the hidden layer

One way to understand what your neural network is learning is to visualize what the representations captured by the hidden units.

# Week 6:

## Bias vs. Variance

### Regularization and Bias/Variance

In order to choose the model and the regularization term λ, we need to:

Create a list of lambdas (i.e. λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24}); Create a set of models with different degrees or any other variants. Iterate through the λs and for each λ go through all the models to learn some Θ. Compute the cross validation error using the learned Θ (computed with λ) on the JCV(Θ) without regularization or λ = 0. Select the best combo that produces the lowest error on the cross validation set. Using the best combo Θ and λ, apply it on Jtest(Θ) to see if it has a good generalization of the problem.

## Building a Spam Classifier

### Error Analysis

Accuracy = (true positives + true negatives) / (total examples)

Precision = (true positives) / (true positives + false positives)

Recall = (true positives) / (true positives + false negatives)

F1 score = (2 * precision * recall) / (precision + recall)

TODO

# Week 7: Support Vector Machines

Question: What is SVM for?

## Kernels

kernel refers to similarity function.

### Using An SVM

Do not perform feature scaling before using the Gaussian kernel.

Gaussian kernel, linear kernel.

# Week 8

## Dimensionality Reduction

### Principal Component Analysis

#### Principal Component Analysis Problem Formulation

Preprocess is needed: Feature scaling and mean normalization

### Review

#### Programming Assignment: K-Means Clustering and PCA

I’m excited about this exercise, about how images’ pixels or other high dimension can be reduced to low dimension, I was shocked when the “eigenfaces” was drawn, look how well it did give the outlines of faces.

# Anomaly Detection

## Density Estimation

If $% $, we say $x_i$ is anomalous. We use Gaussian Distribution to calculate $p(x_i)$.

## Building an Anomaly Detection System

### Anomaly Detection vs. Supervised Learning

What’s difference between Anomaly Detection and Supervised Learning?

Anomaly detection Supervised learning
Very small number of positive examples (y=1). (0-20 is common) Large number of positive and negative examples.
Large number of negative (y=0) examples
Many different “types” of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we’ve seen so far. Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set.
Fraud detection Email spam classification
Manufacturing (e.g. aircraft engines) Weather prediction (sunny/rainy/etc)
Monitoring machines in a data center Cancer classification

### Choosing What Features to Use

Choose features that might take on unusually large or small values in the event of an anomaly.

If features are not normally distributed, use 1/2 power or log function to normalize them.

## Multivariate Gaussian Distribution (Optional)

### Anomaly Detection using the Multivariate Gaussian Distribution

Flag an anomaly if $% $

where $\mu=\frac{1}{m}\sum_{i=1}^{m}x^{(i)}$

Origin model is like this:

Original model Multivariate Gaussian
Manually create features to capture anomalies where x1, x2 take unusual combinations of values Automatically capture correlations between features
Computationally cheaper (alternatively, scales betters to large n) Computationally more expensive
OK even if m(training set size) is small Must have m>n, or else $\Sigma$ is invertible

## Recommender Systems

### Predicting Movie Ratings

This section assumes we have movies features, then we train parameters for every user.

## Collaborative Filtering

### Collaborative Filtering

This section talks about feature learning, assuming we don’t have movies’ features yet. But once some users rate one unfeatured movie, we can calculate the movie’s feature which makes the cost function minimum.

### Collaborative Filtering Algorithm

x and theta matrix should both be initialized randomly to break symmetry.

### Implementational Detail: Mean Normalization

For new users who haven’t any rating, thus haven’t any theta, you could assign the average rating and theta to them.

## Review

### Programming Assignment: Anomaly Detection and Recommender Systems

This assignment is not so interesting, I recommend you another Machine Learning specialization by Washington University on Cousera, I think they teach Recommending System better.

# Week 10: Large Scale Machine Learning

## Gradient Descent with Large Datasets

Observe learning curves over training set and cross validation set, if they converge and reach to the same level, that means your training set is large enough.

### Learning With Large Datasets

Batch gradient descent is more suitable for large dataset.

### Stochastic Gradient Descent

Stochastic gradient descent:

1. randomly shuffle training examples
2. use single training example to update theta
3. repeat step 2

Stochastic gradient descent is much faster than batch gradient descent, but is not guaranteed to reach optimum eventually. I think it’s hard for stochastic gradient descent to pick a proper learning rate.

### Mini-Batch Gradient Descent

A compromise of batch and stochastic gradient descent.

### Stochastic Gradient Descent Convergence

You can decrease learning rate to guarantee the cost function converge.

## Advanced Topics

### Online Learning

Similar to stochastic gradient descent, every time you use new training set to adjust your model, and only for once. The method can adapt your model when user preference changes.

### Map Reduce and Data Parallelism

Parallel computing gradient on batch, then sum up on central node and update your sigma. This way also speeds up the learning process and enable you to deal with large scale dataset.

# Week 11: Application Example: Photo OCR

## Photo OCR

### Ceiling Analysis: What Part of the Pipeline to Work on Next

Analyze every component, assume that component and the components before it are perfect, calculate the accuracy of the whole pipeline, so you can find improvement space at every component, if the improvement is little, then it’s not very worth it to improve it.