A Road of Equations to Recurrent Neural Network
This is a cheat sheet for Recurrent Neural Network learner, covers from math essentials of neural network to several different kinds of RNN. If you go through the whole content and understand thoroughly, you will at least be able to make your own neural network architecture, derive forward/backward propagation equations, which are fundamental to building your own working neural network.
 Math Essentials
 Types of probability spaces
 Expected value
 Expected value in continuous spaces
 Variance
 Covariance
 Correlation
 Complement rule
 Product rule
 Rule of total probability
 Independence
 Bayes rule
 Some Linear algebra knowledge
 Vector arithmetic
 Matrix arithmetic
 Vector projection
 Linear Regression
 Nonlinear Regression
 RNN (Recurrent Neural Network)
 Thanks to
Math Essentials
Notation  Meaning 

a ∈ A  set membership 
 B   A cardinality: number of items in set B 
 v   norm: length of vector v 
\(\Sigma\)  summation 
\(\int\)  integral 
\(\Re\)  the set of real numbers 
\(\Re^{n}\)  real number space of dimension n n = 2 : plane or 2space n = 3 : 3 (dimensional) space n > 3 : nspace or hyperspace 
x, y, z, u, v  vector (bold, lower case) 
A, B, X  matrix (bold, upper case) 
\(y=f(x)\)  function(map):assigns unique value in range of y to each value in domain of x 
\(dy / dx\)  derivative of y with respect to single 
\(y=f(x)\)  function on multiple variables,i.e. a vector of variables; function in nspace 
\(∂y / ∂x_i\)  partial derivative of y with respect to element i of vector x 
\(Ω\)  the set of possible outcomes O 
F  the set of possible events E 
P  the probability distribution 
Types of probability spaces
Define \(\Ω\\); = number of possible outcomes
Discrete space \(Ω\) is finite – Analysis involves summations \((∑)\)
Continuous space \(Ω\) is infinite – Analysis involves integrals \((∫)\)
Expected value
Given:
 A discrete random variable X, with possible values x = x1, x2, … xn
 Probabilities \(p( X = xi )\) that X takes on the various values of xi
 A function \(y_i = f( x_i )\) defined on X The expected value of f is the
probabilityweighted “average” value of \(f( x_i )\): \(E( f ) = ∑_i p( x_i ) ⋅ f( x_i )\)
Common form
\[μ =\frac{1}{N}∑_{i=1}^{n}x_i\]Expected value in continuous spaces
\[E(f)=∫_{x=a→b} p(x)⋅f(x)\]Variance
\[σ^2 = ∑_i p(x^i)⋅(x^iμ)^2\]Common form
\[σ^2=\frac{1}{N1}∑_{i=1}^{n}(x_i−μ)^2\]Covariance
\[cov(x,y)=∑_ip(x_i,y_i )⋅(x_iμ_x )⋅(y_i μ_y )\]Common form
\[cov(x,y)=\frac{1}{N−1}∑_{i=1}^{n}(x_i −μ_x)(y_i −μ_y)\]Correlation
Pearson’s correlation coefficient is covariance normalized by the standard deviations of the two variables
\[corr(x, y)=\frac{cov(x, y)}{σ_xσ_y}\] Always lies in range 1 to 1
 Only reflects linear dependence between variables
Complement rule
Given: event A, which can occur or not
\[p( not A ) = 1  p( A )\]Product rule
Given: events A and B, which can cooccur (or not)
\[p( A, B ) = p( A \ B ) ⋅ p( B )\]Rule of total probability
Given: events A and B, which can cooccur (or not)
\[p( A ) = p( A, B ) + p( A, not B )\]Independence
Given: events A and B, which can cooccur (or not)
\(p(A \ B)=p(A)\) or \(p(A,B)=p(A)⋅p(B)\)
Bayes rule
A way to find conditional probabilities for one variable when conditional probabilities for another variable are known.
posterior probability ∝ likelihood × prior probability
\[p( B \ A ) = p( A \ B ) ⋅ p( B ) / p( A )\]where \(p( A ) = p( A, B ) + p( A, not B )\)
Some Linear algebra knowledge
Vector arithmetic
\[z=x+y=(x_1+y_1 ... x_n+y_n)^T\] \[y=ax=(ax_1 ... ax_n)^T\] \[a=x⋅y=∑_{i=1}^{n}x_iy_i\] \[a=x⋅y=xycos(θ)\]Matrix arithmetic
 Multiplication is associative
\(A ⋅( B ⋅ C ) = ( A ⋅B ) ⋅C\)
 Multiplication is not commutative
\(A ⋅ B ≠ B ⋅ A\) (generally)
 Transposition rule:
Vector projection
Orthogonal projection of y onto x is the vector
\[proj_x(y) = x⋅\\y\\⋅cos(θ)/\\x\\ =[(x⋅y)/\\x\\^2 ]x\](using dot product alternate form)
Stanford Machine Learning Tutorial:
Linear Regression
TODO
Nonlinear Regression
Sigmoid Function
\[P(y=1\x)=h_θ(x)=\frac{1}{1+exp(−θ^⊤x)}≡σ(θ^⊤x)\] \[P(y=0\x)=1−P(y=1\x)=1−h_θ(x)\] \[\sigma(x)=\frac{1}{1+e^{x}},x\in\mathbb{R}\] \[h_{\theta}(x)=\frac{1}{1+exp(−\theta^{⊤}x)}\]Cost function
\[J(θ)=−∑_i(y^{(i)}log(h_θ(x^{(i)}))+(1−y^{(i)})log(1−h_θ(x^{(i)})))\]Gradient
\[∇_θJ(θ)=\frac{∂J(θ)}{∂θ_j}=∑_ix^{(i)}_j(h_θ(x^{(i)})−y^{(i)})\]Softmax Regression
\[h_θ(x)=\left[\begin{matrix}P(y=1x;θ)\\P(y=2x;θ)\\⋮\\P(y=Kx;θ)\end{matrix}\right]=\frac{1}{∑_{j=1}^Kexp(θ^{(j)⊤}x)}\left[\begin{matrix}exp(θ^{(1)⊤}x)\\exp(θ^{(2)⊤}x)\\⋮\\exp(θ^{(K)⊤}x)\end{matrix}\right]\] \[P(y^{(i)}=kx^{(i)};θ)=\frac{exp(θ^{(k)⊤}x^{(i)})}{∑^{K}_{j=1}exp(θ^{(j)⊤}x^{(i)})}\]Cost function (crossentropy)
\[J(\theta)=log(\frac { exp(w^T_jx) }{\sum^n_{i=1}exp(w^T_ix) } )=−w^T_jx+log(\sum^n_{i=1}exp(w^T_ix))\]Gradient
\[∇_{θ^{(k)}}J(θ)=−∑_{i=1}^{m}\left[x^{(i)}(1\{y^{(i)}=k\}−P(y^{(i)}=k;x^{(i)};θ))\right]\]tanh
TODO
ReLU
TODO
TODO: more
RNN (Recurrent Neural Network)
LSTM
Overview of LSTM cell
Details of LSTM cell
Intuition

New memory generation:This stage is analogous to the new memory generation stage we saw in GRUs. We essentially use the input word \(x_t\) and the past hidden state \(h_{t−1}\) to generate a new memory \(\tilde{c_t}\) which includes aspects of the new word \(x_{(t)}\).

Input Gate: We see that the new memory generation stage doesn’t check if the new word is even important before generating the new memory – this is exactly the input gate’s function. The input gate uses the input word and the past hidden state to determine whether or not the input is worth preserving and thus is used to gate the new memory. It thus produces it as an indicator of this information.

Forget Gate: This gate is similar to the input gate except that it does not make a determination of usefulness of the input word – instead it makes an assessment on whether the past memory cell is useful for the computation of the current memory cell. Thus, the forget gate looks at the input word and the past hidden state and produces \(f_t\).

Final memory generation: This stage first takes the advice of the forget gate \(f_t\) and accordingly forgets the past memory \(c_{t−1}\). Similarly, it takes the advice of the input gate it and accordingly gates the new memory \(\tilde{c_t}\). It then sums these two results to produce the final memory \(c_t$$.

Output/Exposure Gate: This is a gate that does not explicitly exist in GRUs. It’s purpose is to separate the final memory from the hidden state. The final memory \(c_t\) contains a lot of information that is not necessarily required to be saved in the hidden state. Hidden states are used in every single gate of an LSTM and thus, this gate makes the assessment regarding what parts of the memory ct needs to be exposed/present in the hidden state ht. The signal it produces to indicate this is ot and this is used to gate the pointwise tanh of the memory.
Equations
\(i_t = σ(W^{(i)}x_t + U^{(i)}h_{t−1})\) (Input gate)
\(f_t = σ(W^{(f)}x_t + U^{(f)}h_{t−1})\) (Forget gate)
\(o_t = σ(W^{(o)}x_t + U^{(o)}h_{t−1})\) (Output/Exposure gate)
\(\tilde{c_t} = tanh(W^{(c)}x_t + U^{(c)}h_{t−1})\) (New memory cell)
\(c_t = f_t ◦ c_{t−1} + i_t ◦ \tilde{c_t}\) (Final memory cell)
\[h_t =o_t◦tanh(c_t)\]How LSTM backward propagation works?
Here’s a slide explaining this: LSTM Forward and Backward Pass
GRU
Overview of GRU cell
Details of GRU cell
Intuition

New memory generation: A new memory \(\tilde{h_t}\) is the consolidation of a new input word \(x_t\) with the past hidden state \(h_{t−1}\). Anthropomorphically, this stage is the one who knows the recipe of combining a newly observed word with the past hidden state \(h_{t−1}\) to summarize this new word in light of the contextual past as the vector \(\tilde{h_t}\).

Reset Gate: The reset signal \(r_t\) is responsible for determining how important \(h_{t−1}\) is to the summarization \(\tilde{h_t}\). The reset gate has the ability to completely diminish past hidden state if it finds that \(h_{t−1}\) is irrelevant to the computation of the new memory.

Update Gate: The update signal \(z_t\) is responsible for determining how much of \(h_{t−1}\) should be carried forward to the next state. For instance, if \(z_t ≈ 1\), then \(h_{t−1}\) is almost entirely copied out to ht. Conversely, if \(z_t ≈ 0\), then mostly the new memory \(\tilde{h_t}\) is forwarded to the next hidden state.

Hidden state: The hidden state ht is finally generated using the past hidden input \(h_{t−1}\) and the new memory generated \(\tilde{h_{t1}}\) with the advice of the update gate.
Equations
\(r_t = sigm (W_{xr}x_t + W_{hr}h_t−1 + b_r)\) (Reset gate)
\(z_t = sigm(W_{xz}x_t + W_{hz}h_t−1 + b_z)\) (Update gate)
\(\tilde{h_t} = tanh(W_{xh}x_t + W_{hh}(r_t ⊙ h_{t−1}) + b_h)\) (New memory)
\(h_t = z_t ⊙h_{t−1} +(1−z_t)⊙\tilde{h_t}\) (Hidden state)
TODO: more RNN network and math
Thanks to

Jeff Howbert for his lecture note summarizing Machine Learning Math Essentials: Washington University CSS490 Winter 2012 lecture slide

Computer Science Department, Stanford University for their Deep Learning Tutorial: Stanford Machine Learning Tutorial

Nal Kalchbrenner, Ivo Danihelka, Alex Graves for their paper: Grid Long ShortTerm Memory

Rafal Jozefowicz, Wojciech Zaremba and Ilya Sutskever for their paper: An Empirical Exploration of Recurrent Network Architectures

Christopher Olah for his clear explanation of LSTM: Understanding LSTM Networks

Andrej Karpathy for his post: The Unreasonable Effectiveness of Recurrent Neural Networks and excellent charrnn code

Arun Mallya for his explanation how LSTM backward propagation works: LSTM Forward and Backward Pass