Suggested Notation for Machine Learning
Authors
- Beijing Academy of Artificial Intelligence (北京智源人工智能研究院)
- Peking University (北京大学)
- Shanghai Jiao Tong University (上海交通大学)
- Zhi-qin John Xu (许志钦), Tao Luo (罗涛), Zheng Ma (马征), Yaoyu Zhang (张耀宇) - Initial work
Introduction
This introduces a suggestion of mathematical notation protocol for machine learning.
The field of machine learning is evolving rapidly in recent years. Communication between different researchers and research groups becomes increasingly important. A key challenge for communication arises from inconsistent notation usages among different papers. This proposal suggests a standard for commonly used mathematical notation for machine learning. In this first version, only some notation are mentioned and more notation are left to be done. This proposal will be regularly updated based on the progress of the field. We look forward to more suggestions to improve this proposal in future versions.
Tabel of Contents
Dataset
Dataset
-
$\mathcal{X}$ is the instances domain (a set) -
$\mathcal{Y}$ is the label domain (a set) -
$\mathcal{Z}=\mathcal{X}\times\mathcal{Y}$ is the example domain (a set)
Usually,
Function
A hypothesis space is denoted by
If there exists a target function, it is denoted by
Loss function
A loss function, denoted by
-
$L^2$ loss:$\ell(f_{\mathbf{\theta}},\mathbf{z})=(f_{\mathbf{\theta}}(\mathbf{x})-\mathbf{y})^2$ , where$\mathbf{z}=(\mathbf{x},\mathbf{y})$ .$\ell(f_{\mathbf{\theta}},\mathbf{z})$ can also be written as$\ell(f_{\mathbf{\theta}},\mathbf{y}))$ for convenience.
Empirical risk or training loss for a set
The population risk or expected loss is denoted by
$$ L_{\mathcal{D}}(\mathbf{\theta})=\mathbb{E}{\mathcal{D}}\ell(f{\mathbf{\theta}}(\mathbf{x}),\mathbf{y})), $$
where
Activation function
An activation function is denoted by
Example 1. Some commonly used activation functions are
$\sigma(x)=\text{ReLU}(x)=\text{max}(0,x)$ $\sigma(x)=\text{sigmoid}(x)=\dfrac{1}{1+e^{-x}}$ $\sigma(x)=\tanh(x)$ $\sigma(x)=\cos x, \sin x$
Two-layer neural network
The neuron number of the hidden layer is denoted by
where
General deep neural network
The counting of the layer number excludes the input layer. An
where
This can also be defined recursively,
$$ f^{[l]}{\mathbf{\theta}}(\mathbf{x})=\sigma\circ(\mathbf{W}^{[l-1]}f^{[l-1]}{\mathbf{\theta}}(\mathbf{x})+\mathbf{b}^{[l-1]}), \quad 1\le l\le L-1, $$
$$ f_{\mathbf{\theta}}(\mathbf{x})=f^{[L]}{\mathbf{\theta}}(\mathbf{x})=\mathbf{W}^{[L-1]}f^{[L-1]}{\mathbf{\theta}}(\mathbf{x})+\mathbf{b}^{[L-1]}, \quad 1\le l\le L-1. $$
Complexity
The VC-dimension of a hypothesis class
The Rademacher complexity of a hypothesis space
Training
The Gradient Descent is oftern denoted by GD. THe Stochastic Gradient Descent is often denoted by SGD.
A batch set is denoted by
The learning rate is denoted by
Fourier Frequency
The discretized frequency is denoted by
Convolution
The convolution operation is denoted by
Notation table
symbol | meaning | Latex | simplied |
---|---|---|---|
input | \bm{x} |
\mathbf{x} |
|
output, label | \bm{y} |
\vy |
|
input dimension | d |
||
output dimension | d_{\rm o} |
||
number of samples | n |
||
instances domain (a set) | \mathcal{X} |
\fX |
|
labels domain (a set) | \mathcal{Y} |
\fY |
|
|
\mathcal{Z} |
\fZ |
|
hypothesis space (a set) | \mathcal{H} |
\mathcal{H} |
|
a set of parameters | \bm{\theta} |
\mathbf{\theta} |
|
hypothesis function | \f_{\bm{\theta}} |
f_{\mathbf{\theta}} |
|
|
target function | f, f^* |
|
loss function | \ell |
||
distribution of |
\mathcal{D} |
\fD |
|
$S={\mathbf{z}i}{i=1}^n$ | $={(\mathbf{x}_i,\mathbf{y}i)}{i=1}^n$ sample set | ||
|
empirical risk or training loss | ||
population risk or expected loss | |||
activation function | \sigma |
||
input weight | \bm{w}_j |
\mathbf{w}_j |
|
output weight | a_j |
||
bias term | b_j |
||
|
neural network | f_{\bm{\theta}} |
f_{\mathbf{\theta}} |
two-layer neural network | |||
|
VC-dimension of |
||
|
Rademacher complexity of |
||
Rademacher complexity over samples of size |
|||
gradient descent | |||
stochastic gradient descent | |||
a batch set | B |
||
batch size | b |
||
learning rate | \eta |
||
discretized frequency | \bm{k} |
\mathbf{k} |
|
continuous frequency | \bm{\xi} |
\mathbf{xi} |
|
convolution operation | * |
L-layer neural network
symbol | meaning | Latex | simplied |
---|---|---|---|
input dimension | d |
||
output dimension | d_{\rm o} |
||
the number of |
m_l |
||
the |
\bm{W}^{[l]} |
\mathbf{W}^{[l]} |
|
the |
\bm{b}^{[l]} |
\mathbf{b}^{[l]} |
|
entry-wise operation | \circ |
||
activation function | \sigma |
||
|
\bm{\theta} |
\mathbf{\theta} |
|
|
|||
|
Acknowledgements
Chenglong Bao (Tsinghua), Zhengdao Chen (NYU), Bin Dong (Peking), Weinan E (Princeton), Quanquan Gu (UCLA), Kaizhu Huang (XJTLU), Shi Jin (SJTU), Jian Li (Tsinghua), Lei Li (SJTU), Tiejun Li (Peking), Zhenguo Li (Huawei), Zhemin Li (NUDT), Shaobo Lin (XJTU), Ziqi Liu (CSRC), Zichao Long (Peking), Chao Ma (Princeton), Chao Ma (SJTU), Yuheng Ma (WHU), Dengyu Meng (XJTU), Wang Miao (Peking), Pingbing Ming (CAS), Zuoqiang Shi (Tsinghua), Jihong Wang (CSRC), Liwei Wang (Peking), Bican Xia (Peking), Zhouwang Yang (USTC), Haijun Yu (CAS), Yang Yuan (Tsinghua), Cheng Zhang (Peking), Lulu Zhang (SJTU), Jiwei Zhang (WHU), Pingwen Zhang (Peking), Xiaoqun Zhang (SJTU), Chengchao Zhao (CSRC), Zhanxing Zhu (Peking), Chuan Zhou (CAS), Xiang Zhou (cityU).