• Stars
    star
    174
  • Rank 219,104 (Top 5 %)
  • Language
    HTML
  • License
    Creative Commons ...
  • Created over 4 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Classical ML equations in Latex, helps paper and blog writing

Classical ML Equations in LaTeX

A collection of classical ML equations in Latex . Some of them are provided with simple notes and paper link. Hopes to help writings such as papers and blogs.

Better viewed at https://blmoistawinde.github.io/ml_equations_latex/

Model

RNNs(LSTM, GRU)

encoder hidden state math at time step math , with input token embedding math

math

decoder hidden state math at time step math , with input token embedding math

math

h_t = RNN_{enc}(x_t, h_{t-1})
s_t = RNN_{dec}(y_t, s_{t-1})

The math , math are usually either

Attentional Seq2seq

The attention weight math , the math th decoder step over the math th encoder step, resulting in context vector math

math

math

math

c_i = \sum_{j=1}^{T_x} \alpha_{ij}h_j

\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}

e_{ij} = a(s_{i-1}, h_j)

math is an specific attention function, which can be

Bahdanau Attention

Paper: Neural Machine Translation by Jointly Learning to Align and Translate

math

e_{ij} = v^T tanh(W[s_{i-1}; h_j])

Luong(Dot-Product) Attention

Paper: Effective Approaches to Attention-based Neural Machine Translation

If math and math has same number of dimension.

math

otherwise

math

e_{ij} = s_{i-1}^T h_j

e_{ij} = s_{i-1}^T W h_j

Finally, the output math is produced by:

math

math

s_t = tanh(W[s_{t-1};y_t;c_t])
o_t = softmax(Vs_t)

Transformer

Paper: Attention Is All You Need

Scaled Dot-Product attention

math

Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V

where math is the dimension of the key vector math and query vector math .

Multi-head attention

math

where

math

MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O

head_i = Attention(Q W^Q_i, K W^K_i, V W^V_i)

Generative Adversarial Networks(GAN)

Paper: Generative Adversarial Networks

Minmax game objective

math

\min_{G}\max_{D}\mathbb{E}_{x\sim p_{\text{data}}(x)}[\log{D(x)}] +  \mathbb{E}_{z\sim p_{\text{z}}(z)}[1 - \log{D(G(z))}]

Variational Auto-Encoder(VAE)

Paper: Auto-Encoding Variational Bayes

Reparameterization trick

To produce a latent variable z such that math , we sample math , than z is produced by

math

z \sim q_{\mu, \sigma}(z) = \mathcal{N}(\mu, \sigma^2)
\epsilon \sim \mathcal{N}(0,1)
z = \mu + \epsilon \cdot \sigma

Above is for 1-D case. For a multi-dimensional (vector) case we use:

math

math

\epsilon \sim \mathcal{N}(0, \textbf{I})
\vec{z} \sim \mathcal{N}(\vec{\mu}, \sigma^2 \textbf{I})

Activations

Sigmoid

Related to Logistic Regression. For single-label/multi-label binary classification.

math

\sigma(z) = \frac{1} {1 + e^{-z}}

Tanh

math

tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = \frac{1 - e^{-2x}}{1 + e^{-2x}}

Softmax

For multi-class single label classification.

math

\sigma(z_i) = \frac{e^{z_{i}}}{\sum_{j=1}^K e^{z_{j}}} \ \ \ for\ i=1,2,\dots,K

Relu

math

Relu(z) = max(0, z)

Gelu

math

where math is the cumulative distribution function of Gaussian distribution.

Gelu(x) = x\Phi(x)

Loss

Regression

Below math and math are math dimensional vectors, and math denotes the value on the math th dimension of math .

Mean Absolute Error(MAE)

math

\sum_{i=1}^{D}|x_i-y_i|

Mean Squared Error(MSE)

math

\sum_{i=1}^{D}(x_i-y_i)^2

Huber loss

It’s less sensitive to outliers than the MSE as it treats error as square only inside an interval.

math

L_{\delta}=
    \left\{\begin{matrix}
        \frac{1}{2}(y - \hat{y})^{2} & if \left | (y - \hat{y})  \right | < \delta\\
        \delta ((y - \hat{y}) - \frac1 2 \delta) & otherwise
    \end{matrix}\right.

Classification

Cross Entropy

  • In binary classification, where the number of classes math equals 2, Binary Cross-Entropy(BCE) can be calculated as:

math

  • If math (i.e. multiclass classification), we calculate a separate loss for each class label per observation and sum the result.

math

-{(y\log(p) + (1 - y)\log(1 - p))}

-\sum_{c=1}^My_{o,c}\log(p_{o,c})

M - number of classes

log - the natural log

y - binary indicator (0 or 1) if class label c is the correct classification for observation o

p - predicted probability observation o is of class c

Negative Loglikelihood

math

Minimizing negative loglikelihood

math

is equivalent to Maximum Likelihood Estimation(MLE).

math

Here math is a scaler instead of vector. It is the value of the single dimension where the ground truth math lies. It is thus equivalent to cross entropy (See wiki).\

NLL(y) = -{\log(p(y))}

\min_{\theta} \sum_y {-\log(p(y;\theta))}

\max_{\theta} \prod_y p(y;\theta)

Hinge loss

Used in Support Vector Machine(SVM).

math

max(0, 1 - y \cdot \hat{y})

KL/JS divergence

math

math

KL(\hat{y} || y) = \sum_{c=1}^{M}\hat{y}_c \log{\frac{\hat{y}_c}{y_c}}

JS(\hat{y} || y) = \frac{1}{2}(KL(y||\frac{y+\hat{y}}{2}) + KL(\hat{y}||\frac{y+\hat{y}}{2}))

Regularization

The math below can be any of the above loss.

L1 regularization

A regression model that uses L1 regularization technique is called Lasso Regression.

math

Loss = Error(Y - \widehat{Y}) + \lambda \sum_1^n |w_i|

L2 regularization

A regression model that uses L1 regularization technique is called Ridge Regression.

math

Loss = Error(Y - \widehat{Y}) +  \lambda \sum_1^n w_i^{2}

Metrics

Some of them overlaps with loss, like MAE, KL-divergence.

Classification

Accuracy, Precision, Recall, F1

math

math

math

math

Accuracy = \frac{TP+TN}{TP+TN+FP+FN}
Precision = \frac{TP}{TP+FP}
Recall = \frac{TP}{TP+FN}
F1 = \frac{2*Precision*Recall}{Precision+Recall} = \frac{2*TP}{2*TP+FP+FN}

Sensitivity, Specificity and AUC

math

math

Sensitivity = Recall = \frac{TP}{TP+FN}
Specificity = \frac{TN}{FP+TN}

AUC is calculated as the Area Under the math (TPR)- math (FPR) Curve.

Regression

MAE, MSE, equation above.

Clustering

(Normalized) Mutual Information (NMI)

The Mutual Information is a measure of the similarity between two labels of the same data. Where math is the number of the samples in cluster math and math is the number of the samples in cluster math , the Mutual Information between cluster math and math is given as:

math

MI(U,V)=\sum_{i=1}^{|U|} \sum_{j=1}^{|V|} \frac{|U_i\cap V_j|}{N}
\log\frac{N|U_i \cap V_j|}{|U_i||V_j|}

Normalized Mutual Information (NMI) is a normalization of the Mutual Information (MI) score to scale the results between 0 (no mutual information) and 1 (perfect correlation). In this function, mutual information is normalized by some generalized mean of H(labels_true) and H(labels_pred)), See wiki.

Skip RI, ARI for complexity.

Also skip metrics for related tasks (e.g. modularity for community detection[graph clustering], coherence score for topic modeling[soft clustering]).

Ranking

Skip nDCG (Normalized Discounted Cumulative Gain) for its complexity.

(Mean) Average Precision(MAP)

Average Precision is calculated as:

math

\text{AP} = \sum_n (R_n - R_{n-1}) P_n

where math and math are the precision and recall at the math th threshold.

AP can also be regarded as the area under the precision-recall curve.

MAP is the mean of AP over all the queries.

Similarity/Relevance

Cosine

math

Cosine(x,y) = \frac{x \cdot y}{|x||y|}

Jaccard

Similarity of two sets math and math .

math

Jaccard(U,V) = \frac{|U \cap V|}{|U \cup V|}

Pointwise Mutual Information(PMI)

Relevance of two events math and math .

math

PMI(x;y) = \log{\frac{p(x,y)}{p(x)p(y)}}

For example, math and math is the frequency of word math and math appearing in corpus and math is the frequency of the co-occurrence of the two.

Notes

This repository now only contains simple equations for ML. They are mainly about deep learning and NLP now due to personal research interests.

For time issues, elegant equations in traditional ML approaches like SVM, SVD, PCA, LDA are not included yet.

Moreover, there is a trend towards more complex metrics, which have to be calculated with complicated program (e.g. BLEU, ROUGE, METEOR), iterative algorithms (e.g. PageRank), optimization (e.g. Earth Mover Distance), or even learning based (e.g. BERTScore). They thus cannot be described using simple equations.

Reference

Pytorch Documentation

Scikit-learn Documentation

Machine Learning Glossary

Wikipedia

https://blog.floydhub.com/gans-story-so-far/

https://ermongroup.github.io/cs228-notes/extras/vae/

Thanks for a-rodin's solution to show Latex in Github markdown, which I have wrapped into latex2pic.py.

More Repositories

1

HarvestText

文本挖掘和预处理工具(文本清洗、新词发现、情感分析、实体识别链接、关键词抽取、知识抽取、句法分析等),无监督或弱监督方法
Python
2,401
star
2

hello_world

博客文章开源代码分享区
Jupyter Notebook
123
star
3

fense

Fluency ENhanced Sentence-bert Evaluation (FENSE), metric for audio caption evaluation. And Benchmark dataset AudioCaps-Eval, Clotho-Eval.
Python
19
star
4

scale_early_depress_detect

codes for the IJCAI 2022 paper "Psychiatric Scale Guided Risky Post Screening for Early Detection of Depression"
Python
19
star
5

facetnet-python

An unofficial implementation in python3.6 for the paper: Facetnet: a framework for analyzing communities and their evolutions in dynamic networks
Python
17
star
6

EMNLP22-PsySym

Code for EMNLP22 paper "Symptom Identification for Interpretable Detection of Multiple Mental Disorders"
Python
14
star
7

KPCNet

Code for the WWW 2021 paper: Diverse and Specific Clarification Question Generation with Keywords
JavaScript
11
star
8

pl_prompt_sst2

An example project using OpenPrompt under pytorch-lightning for prompt-based SST2 sentiment analysis model
Python
5
star
9

simpleSanGuoNLP

相关博客:https://blog.csdn.net/blmoistawinde/article/details/82377843
HTML
4
star
10

DoT_reimplementation

Python
3
star
11

dgcn_tagging

the code and data for the CIKM 2021 paper "Enriching Ontology with Temporal Commonsense for Low-Resource Audio Tagging"
Python
3
star
12

MT-GCN_re

A trial reimplementation of the paper "MT-GCN FOR MULTI-LABEL AUDIO TAGGING WITH NOISY LABELS" for DCASE2019 Task2.
Python
2
star
13

auto_annotation

博客 https://blog.csdn.net/blmoistawinde/article/details/81780401 的实现
HTML
2
star
14

DASC

Code and Data for EMNLP 2023 paper "Semantic Space Grounded Weighted Decoding for Multi-Attribute Controllable Dialogue Generation"
Python
1
star