Classical ML Equations in LaTeX
A collection of classical ML equations in Latex . Some of them are provided with simple notes and paper link. Hopes to help writings such as papers and blogs.
Better viewed at https://blmoistawinde.github.io/ml_equations_latex/
Model
RNNs(LSTM, GRU)
encoder hidden state at time step , with input token embedding
decoder hidden state at time step , with input token embedding
h_t = RNN_{enc}(x_t, h_{t-1})
s_t = RNN_{dec}(y_t, s_{t-1})
-
LSTM (paper: Long short-term memory)
-
GRU (paper: Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation).
Attentional Seq2seq
The attention weight , the th decoder step over the th encoder step, resulting in context vector
c_i = \sum_{j=1}^{T_x} \alpha_{ij}h_j
\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}
e_{ij} = a(s_{i-1}, h_j)
is an specific attention function, which can be
Bahdanau Attention
Paper: Neural Machine Translation by Jointly Learning to Align and Translate
e_{ij} = v^T tanh(W[s_{i-1}; h_j])
Luong(Dot-Product) Attention
Paper: Effective Approaches to Attention-based Neural Machine Translation
If and has same number of dimension.
otherwise
e_{ij} = s_{i-1}^T h_j
e_{ij} = s_{i-1}^T W h_j
Finally, the output is produced by:
s_t = tanh(W[s_{t-1};y_t;c_t])
o_t = softmax(Vs_t)
Transformer
Paper: Attention Is All You Need
Scaled Dot-Product attention
Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V
where is the dimension of the key vector and query vector .
Multi-head attention
where
MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O
head_i = Attention(Q W^Q_i, K W^K_i, V W^V_i)
Generative Adversarial Networks(GAN)
Paper: Generative Adversarial Networks
Minmax game objective
\min_{G}\max_{D}\mathbb{E}_{x\sim p_{\text{data}}(x)}[\log{D(x)}] + \mathbb{E}_{z\sim p_{\text{z}}(z)}[1 - \log{D(G(z))}]
Variational Auto-Encoder(VAE)
Paper: Auto-Encoding Variational Bayes
Reparameterization trick
To produce a latent variable z such that , we sample , than z is produced by
z \sim q_{\mu, \sigma}(z) = \mathcal{N}(\mu, \sigma^2)
\epsilon \sim \mathcal{N}(0,1)
z = \mu + \epsilon \cdot \sigma
Above is for 1-D case. For a multi-dimensional (vector) case we use:
\epsilon \sim \mathcal{N}(0, \textbf{I})
\vec{z} \sim \mathcal{N}(\vec{\mu}, \sigma^2 \textbf{I})
Activations
Sigmoid
Related to Logistic Regression. For single-label/multi-label binary classification.
\sigma(z) = \frac{1} {1 + e^{-z}}
Tanh
tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = \frac{1 - e^{-2x}}{1 + e^{-2x}}
Softmax
For multi-class single label classification.
\sigma(z_i) = \frac{e^{z_{i}}}{\sum_{j=1}^K e^{z_{j}}} \ \ \ for\ i=1,2,\dots,K
Relu
Relu(z) = max(0, z)
Gelu
where is the cumulative distribution function of Gaussian distribution.
Gelu(x) = x\Phi(x)
Loss
Regression
Below and are dimensional vectors, and denotes the value on the th dimension of .
Mean Absolute Error(MAE)
\sum_{i=1}^{D}|x_i-y_i|
Mean Squared Error(MSE)
\sum_{i=1}^{D}(x_i-y_i)^2
Huber loss
It’s less sensitive to outliers than the MSE as it treats error as square only inside an interval.
L_{\delta}=
\left\{\begin{matrix}
\frac{1}{2}(y - \hat{y})^{2} & if \left | (y - \hat{y}) \right | < \delta\\
\delta ((y - \hat{y}) - \frac1 2 \delta) & otherwise
\end{matrix}\right.
Classification
Cross Entropy
- In binary classification, where the number of classes equals 2, Binary Cross-Entropy(BCE) can be calculated as:
- If (i.e. multiclass classification), we calculate a separate loss for each class label per observation and sum the result.
-{(y\log(p) + (1 - y)\log(1 - p))}
-\sum_{c=1}^My_{o,c}\log(p_{o,c})
M - number of classes
log - the natural log
y - binary indicator (0 or 1) if class label c is the correct classification for observation o
p - predicted probability observation o is of class c
Negative Loglikelihood
Minimizing negative loglikelihood
is equivalent to Maximum Likelihood Estimation(MLE).
Here is a scaler instead of vector. It is the value of the single dimension where the ground truth lies. It is thus equivalent to cross entropy (See wiki).\
NLL(y) = -{\log(p(y))}
\min_{\theta} \sum_y {-\log(p(y;\theta))}
\max_{\theta} \prod_y p(y;\theta)
Hinge loss
Used in Support Vector Machine(SVM).
max(0, 1 - y \cdot \hat{y})
KL/JS divergence
KL(\hat{y} || y) = \sum_{c=1}^{M}\hat{y}_c \log{\frac{\hat{y}_c}{y_c}}
JS(\hat{y} || y) = \frac{1}{2}(KL(y||\frac{y+\hat{y}}{2}) + KL(\hat{y}||\frac{y+\hat{y}}{2}))
Regularization
The below can be any of the above loss.
L1 regularization
A regression model that uses L1 regularization technique is called Lasso Regression.
Loss = Error(Y - \widehat{Y}) + \lambda \sum_1^n |w_i|
L2 regularization
A regression model that uses L1 regularization technique is called Ridge Regression.
Loss = Error(Y - \widehat{Y}) + \lambda \sum_1^n w_i^{2}
Metrics
Some of them overlaps with loss, like MAE, KL-divergence.
Classification
Accuracy, Precision, Recall, F1
Accuracy = \frac{TP+TN}{TP+TN+FP+FN}
Precision = \frac{TP}{TP+FP}
Recall = \frac{TP}{TP+FN}
F1 = \frac{2*Precision*Recall}{Precision+Recall} = \frac{2*TP}{2*TP+FP+FN}
Sensitivity, Specificity and AUC
Sensitivity = Recall = \frac{TP}{TP+FN}
Specificity = \frac{TN}{FP+TN}
AUC is calculated as the Area Under the (TPR)- (FPR) Curve.
Regression
MAE, MSE, equation above.
Clustering
(Normalized) Mutual Information (NMI)
The Mutual Information is a measure of the similarity between two labels of the same data. Where is the number of the samples in cluster and is the number of the samples in cluster , the Mutual Information between cluster and is given as:
MI(U,V)=\sum_{i=1}^{|U|} \sum_{j=1}^{|V|} \frac{|U_i\cap V_j|}{N}
\log\frac{N|U_i \cap V_j|}{|U_i||V_j|}
Normalized Mutual Information (NMI) is a normalization of the Mutual Information (MI) score to scale the results between 0 (no mutual information) and 1 (perfect correlation). In this function, mutual information is normalized by some generalized mean of H(labels_true) and H(labels_pred)), See wiki.
Skip RI, ARI for complexity.
Also skip metrics for related tasks (e.g. modularity for community detection[graph clustering], coherence score for topic modeling[soft clustering]).
Ranking
Skip nDCG (Normalized Discounted Cumulative Gain) for its complexity.
(Mean) Average Precision(MAP)
Average Precision is calculated as:
\text{AP} = \sum_n (R_n - R_{n-1}) P_n
where and are the precision and recall at the th threshold.
AP can also be regarded as the area under the precision-recall curve.
MAP is the mean of AP over all the queries.
Similarity/Relevance
Cosine
Cosine(x,y) = \frac{x \cdot y}{|x||y|}
Jaccard
Jaccard(U,V) = \frac{|U \cap V|}{|U \cup V|}
Pointwise Mutual Information(PMI)
PMI(x;y) = \log{\frac{p(x,y)}{p(x)p(y)}}
For example, and is the frequency of word and appearing in corpus and is the frequency of the co-occurrence of the two.
Notes
This repository now only contains simple equations for ML. They are mainly about deep learning and NLP now due to personal research interests.
For time issues, elegant equations in traditional ML approaches like SVM, SVD, PCA, LDA are not included yet.
Moreover, there is a trend towards more complex metrics, which have to be calculated with complicated program (e.g. BLEU, ROUGE, METEOR), iterative algorithms (e.g. PageRank), optimization (e.g. Earth Mover Distance), or even learning based (e.g. BERTScore). They thus cannot be described using simple equations.
Reference
https://blog.floydhub.com/gans-story-so-far/
https://ermongroup.github.io/cs228-notes/extras/vae/
Thanks for a-rodin's solution to show Latex in Github markdown, which I have wrapped into latex2pic.py
.