ML Papers Explained
Explanations to key concepts in ML
Language Models
Paper | Date | Description |
---|---|---|
Transformer | June 2017 | An Encoder Decoder model, that introduced multihead attention mechanism for language translation task. |
Elmo | February 2018 | Deep contextualized word representations that captures both intricate aspects of word usage and contextual variations across language contexts. |
GPT | June 2018 | A Decoder only transformer which is autoregressively pretrained and then finetuned for specific downstream tasks using task-aware input transformations. |
BERT | October 2018 | Introduced pre-training for Encoder Transformers. Uses unified architecture across different tasks. |
Transformer XL | January 2019 | Extends the original Transformer model to handle longer sequences of text by introducing recurrence into the self-attention mechanism. |
XLNet | June 2019 | Extension of the Transformer-XL, pre-trained using a new method that combines ideas from AR and AE objectives. |
RoBERTa | July 2019 | Built upon BERT, by carefully optimizing hyperparameters and training data size to improve performance on various language tasks . |
Sentence BERT | August 2019 | A modification of BERT that uses siamese and triplet network structures to derive sentence embeddings that can be compared using cosine-similarity. |
Tiny BERT | September 2019 | Uses attention transfer, and task specific distillation for distilling BERT. |
ALBERT | September 2019 | Presents certain parameter reduction techniques to lower memory consumption and increase the training speed of BERT. |
Distil BERT | October 2019 | Distills BERT on very large batches leveraging gradient accumulation, using dynamic masking and without the next sentence prediction objective. |
T5 | October 2019 | A unified encoder-decoder framework that converts all text-based language problems into a text-to-text format. |
BART | October 2019 | A Decoder pretrained to reconstruct the original text from corrupted versions of it. |
FastBERT | April 2020 | A speed-tunable encoder with adaptive inference time having branches at each transformer output to enable early outputs. |
MobileBERT | April 2020 | Compressed and faster version of the BERT, featuring bottleneck structures, optimized attention mechanisms, and knowledge transfer. |
Longformer | April 2020 | Introduces a linearly scalable attention mechanism, allowing handling texts of exteded length. |
DeBERTa | June 2020 | Enhances BERT and RoBERTa through disentangled attention mechanisms, an enhanced mask decoder, and virtual adversarial training. |
Codex | July 2021 | A GPT language model finetuned on publicly available code from GitHub. |
FLAN | September 2021 | An instruction-tuned language model developed through finetuning on various NLP datasets described by natural language instructions. |
Gopher | December 2021 | Provides a comprehensive analysis of the performance of various Transformer models across different scales upto 280B on 152 tasks. |
Instruct GPT | March 2022 | Fine-tuned GPT using supervised learning (instruction tuning) and reinforcement learning from human feedback to align with user intent. |
Chinchilla | March 2022 | Investigated the optimal model size and number of tokens for training a transformer LLM within a given compute budget (Scaling Laws). |
PALM | April 2022 | A 540-B parameter, densely activated, Transformer, trained using Pathways, (ML system that enables highly efficient training across multiple TPU Pods). |
OPT | May 2022 | A suite of decoder-only pre-trained transformers with parameter ranges from 125M to 175B. OPT-175B being comparable to GPT-3. |
BLOOM | November 2022 | A 176B-parameter open-access decoder-only transformer, collaboratively developed by hundreds of researchers, aiming to democratize LLM technology. |
Galactica | November 2022 | An LLM trained on scientific data thus specializing in scientific knowledge. |
ChatGPT | November 2022 | An interactive model designed to engage in conversations, built on top of GPT 3.5. |
LLaMA | February 2023 | A collection of foundation LLMs by Meta ranging from 7B to 65B parameters, trained using publicly available datasets exclusively. |
Alpaca | Marcg 2023 | A fine-tuned LLaMA 7B model, trained on instruction-following demonstrations generated in the style of self-instruct using text-davinci-003. |
Vision Models
Paper | Date | Description |
---|---|---|
Vision Transformer | October 2020 | Images are segmented into patches, which are treated as tokens and a sequence of linear embeddings of these patches are input to a Transformer |
DeiT | December 2020 | A convolution-free vision transformer that uses a teacher-student strategy with attention-based distillation tokens. |
Swin Transformer | March 2021 | A hierarchical vision transformer that uses shifted windows to addresses the challenges of adapting the transformer model to computer vision. |
BEiT | June 2021 | Utilizes a masked image modeling task inspired by BERT in, involving image patches and visual tokens to pretrain vision Transformers. |
MobileViT | October 2021 | A lightweight vision transformer designed for mobile devices, effectively combining the strengths of CNNs and ViTs. |
Masked AutoEncoder | November 2021 | An encoder-decoder architecture that reconstructs input images by masking random patches and leveraging a high proportion of masking for self-supervision. |
Convolutional Neural Networks
Paper | Date | Description |
---|---|---|
Lenet | December 1998 | Introduced Convolutions. |
Alex Net | September 2012 | Introduced ReLU activation and Dropout to CNNs. Winner ILSVRC 2012. |
VGG | September 2014 | Used large number of filters of small size in each layer to learn complex features. Achieved SOTA in ILSVRC 2014. |
Inception Net | September 2014 | Introduced Inception Modules consisting of multiple parallel convolutional layers, designed to recognize different features at multiple scales. |
Inception Net v2 / Inception Net v3 | December 2015 | Design Optimizations of the Inception Modules which improved performance and accuracy. |
Res Net | December 2015 | Introduced residual connections, which are shortcuts that bypass one or more layers in the network. Winner ILSVRC 2015. |
Inception Net v4 / Inception ResNet | February 2016 | Hybrid approach combining Inception Net and ResNet. |
Dense Net | August 2016 | Each layer receives input from all the previous layers, creating a dense network of connections between the layers, allowing to learn more diverse features. |
Xception | October 2016 | Based on InceptionV3 but uses depthwise separable convolutions instead on inception modules. |
Res Next | November 2016 | Built over ResNet, introduces the concept of grouped convolutions, where the filters in a convolutional layer are divided into multiple groups. |
Mobile Net V1 | April 2017 | Uses depthwise separable convolutions to reduce the number of parameters and computation required. |
Mobile Net V2 | January 2018 | Built upon the MobileNetv1 architecture, uses inverted residuals and linear bottlenecks. |
Mobile Net V3 | May 2019 | Uses AutoML to find the best possible neural network architecture for a given problem. |
Efficient Net | May 2019 | Uses a compound scaling method to scale the network's depth, width, and resolution to achieve a high accuracy with a relatively low computational cost. |
Conv Mixer | January 2022 | Processes image patches using standard convolutions for mixing spatial and channel dimensions. |
Single Stage Object Detectors
Paper | Date | Description |
---|---|---|
SSD | December 2015 | Discretizes bounding box outputs over a span of various scales and aspect ratios per feature map. |
Feature Pyramid Network | December 2016 | Leverages the inherent multi-scale hierarchy of deep convolutional networks to efficiently construct feature pyramids. |
Focal Loss | August 2017 | Addresses class imbalance in dense object detectors by down-weighting the loss assigned to well-classified examples. |
Region-based Convolutional Neural Networks
Paper | Date | Description |
---|---|---|
RCNN | November 2013 | Uses selective search for region proposals, CNNs for feature extraction, SVM for classification followed by box offset regression. |
Fast RCNN | April 2015 | Processes entire image through CNN, employs RoI Pooling to extract feature vectors from ROIs, followed by classification and BBox regression. |
Faster RCNN | June 2015 | A region proposal network (RPN) and a Fast R-CNN detector, collaboratively predict object regions by sharing convolutional features. |
Mask RCNN | March 2017 | Extends Faster R-CNN to solve instance segmentation tasks, by adding a branch for predicting an object mask in parallel with the existing branch. |
Document AI
Paper | Date | Description |
---|---|---|
Table Net | January 2020 | An end-to-end deep learning model designed for both table detection and structure recognition. |
Donut | November 2021 | An OCR-free Encoder-Decoder Transformer model. The encoder takes in images, decoder takes in prompts & encoded images to generate the required text. |
DiT | March 2022 | An Image Transformer pre-trained (self-supervised) on document images |
UDoP | December 2022 | Integrates text, image, and layout information through a Vision-Text-Layout Transformer, enabling unified representation. |
Layout Transformers
Paper | Date | Description |
---|---|---|
Layout LM | December 2019 | Utilises BERT as the backbone, adds two new input embeddings: 2-D position embedding and image embedding (Only for downstream tasks). |
LamBERT | February 2020 | Utilises RoBERTa as the backbone and adds Layout embeddings along with relative bias. |
Layout LM v2 | December 2020 | Uses a multi-modal Transformer model, to integrate text, layout, and image in the pre-training stage, to learn end-to-end cross-modal interaction. |
Structural LM | May 2021 | Utilises BERT as the backbone and feeds text, 1D and (2D cell level) embeddings to the transformer model. |
Doc Former | June 2021 | Encoder-only transformer with a CNN backbone for visual feature extraction, combines text, vision, and spatial features through a multi-modal self-attention layer. |
LiLT | February 2022 | Introduced Bi-directional attention complementation mechanism (BiACM) to accomplish the cross-modal interaction of text and layout. |
Layout LM V3 | April 2022 | A unified text-image multimodal Transformer to learn cross-modal representations, that imputs concatenation of text embedding and image embedding. |
ERNIE Layout | October 2022 | Reorganizes tokens using layout information, combines text and visual embeddings, utilizes multi-modal transformers with spatial aware disentangled attention. |
Tabular Deep Learning
Paper | Date | Description |
---|---|---|
Entity Embeddings | April 2016 | Maps categorical variables into continuous vector spaces through neural network learning, revealing intrinsic properties. |
Wide and Deep Learning | June 2016 | Combines memorization of specific patterns with generalization of similarities. |
Deep and Cross Network | August 2017 | Combines the a novel cross network with deep neural networks (DNNs) to efficiently learn feature interactions without manual feature engineering. |
Tab Transformer | December 2020 | Employs multi-head attention-based Transformer layers to convert categorical feature embeddings into robust contextual embeddings. |
Tabular ResNet | June 2021 | An MLP with skip connections. |
Feature Tokenizer Transformer | June 2021 | Transforms all features (categorical and numerical) to embeddings and applies a stack of Transformer layers to the embeddings. |
Miscellaneous
Paper | Date | Description |
---|---|---|
ColD Fusion | December 2022 | A method enabling the benefits of multitask learning through distributed computation without data sharing and improving model performance. |
Literature Reviewed
- Convolutional Neural Networks
- Layout Transformers
- Region-based Convolutional Neural Networks
- Tabular Deep Learning
Reading Lists
- Language Models
- Layout Transformers
- Object Detection
- RCNNs
- Vision Models
- Document Information Processing
Reach out to Ritvik or Elvis if you have any questions.
If you are interested to contribute, feel free to open a PR.