đ
2021: A Year Full of Amazing AI papers- A Review A curated list of the latest breakthroughs in AI by release date with a clear video explanation, link to a more in-depth article, and code.
While the world is still recovering, research hasn't slowed its frenetic pace, especially in the field of artificial intelligence. More, many important aspects were highlighted this year, like the ethical aspects, important biases, governance, transparency and much more. Artificial intelligence and our understanding of the human brain and its link to AI are constantly evolving, showing promising applications improving our life's quality in the near future. Still, we ought to be careful with which technology we choose to apply.
"Science cannot tell us what we ought to do, only what we can do."
- Jean-Paul Sartre, Being and Nothingness
Here are the most interesting research papers of the year, in case you missed any of them. In short, it is curated list of the latest breakthroughs in AI and Data Science by release date with a clear video explanation, link to a more in-depth article, and code (if applicable). Enjoy the read!
The complete reference to each paper is listed at the end of this repository. Star this repository to stay up to date!
Maintainer: louisfb01
Subscribe to my newsletter - The latest updates in AI explained every week.
Feel free to message me any interesting paper I may have missed to add to this repository.
Tag me on Twitter @Whats_AI or LinkedIn @Louis (What's AI) Bouchard if you share the list!
Watch a complete 2021 rewind in 15Â minutes
If you are interested in Computer Vision research, here is another great repository for you:
A curated list of the top 10 CV publications in 2021 with a clear video explanation, link to a more in-depth article, and code.
The Top 10 Computer Vision Papers of 2021
The Full List
- DALL¡E: Zero-Shot Text-to-Image Generation from OpenAI [1]
- VOGUE: Try-On by StyleGAN Interpolation Optimization [2]
- Taming Transformers for High-Resolution Image Synthesis [3]
- Thinking Fast And Slow in AIÂ [4]
- Automatic detection and quantification of floating marine macro-litter in aerial images [5]
- ShaRF: Shape-conditioned Radiance Fields from a Single View [6]
- Generative Adversarial Transformers [7]
- We Asked Artificial Intelligence to Create Dating Profiles. Would You Swipe Right? [8]
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [9]
- IMAGE GANS MEET DIFFERENTIABLE RENDERING FOR INVERSE GRAPHICS AND INTERPRETABLE 3D NEURAL RENDERING [10]
- Deep nets: What have they ever done for vision? [11]
- Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image [12]
- Portable, Self-Contained Neuroprosthetic Hand with Deep Learning-Based Finger Control [13]
- Total Relighting: Learning to Relight Portraits for Background Replacement [14]
- LASR: Learning Articulated Shape Reconstruction from a Monocular Video [15]
- Enhancing Photorealism Enhancement [16]
- DefakeHop: A Light-Weight High-Performance Deepfake Detector [17]
- High-Resolution Photorealistic Image Translation in Real-Time: A Laplacian Pyramid Translation Network [18]
- Barbershop: GAN-based Image Compositing using Segmentation Masks [19]
- TextStyleBrush: Transfer of text aesthetics from a single example [20]
- Animating Pictures with Eulerian Motion Fields [21]
- CVPR 2021 Best Paper Award: GIRAFFE - Controllable Image Generation [22]
- GitHub Copilot & Codex: Evaluating Large Language Models Trained on Code [23]
- Apple: Recognizing People in Photos Through Private On-Device Machine Learning [24]
- Image Synthesis and Editing with Stochastic Differential Equations [25]
- Sketch Your Own GANÂ [26]
- Tesla's Autopilot Explained [27]
- Styleclip: Text-driven manipulation of StyleGAN imagery [28]
- TimeLens: Event-based Video Frame Interpolation [29]
- Diverse Generation from a Single Video Made Possible [30]
- Skillful Precipitation Nowcasting using Deep Generative Models of Radar [31]
- The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks [32]
- ADOP: Approximate Differentiable One-Pixel Point Rendering [33]
- (Style)CLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis [34]
- SwinIR: Image restoration using swin transformer [35]
- EditGAN: High-Precision Semantic Image Editing [36]
- CityNeRF: Building NeRF at City Scale [37]
- ClipCap: CLIP Prefix for Image Captioning [38]
- Paper references
DALL¡E: Zero-Shot Text-to-Image Generation from OpenAI [1]
OpenAI successfully trained a network able to generate images from text captions. It is very similar to GPT-3 and Image GPT and produces amazing results.
- Short Video Explanation:
- Short read: OpenAIâs DALL¡E: Text-to-Image Generation Explained
- Paper: Zero-Shot Text-to-Image Generation
- Code: Code & more information for the discrete VAE used for DALL¡E
VOGUE: Try-On by StyleGAN Interpolation Optimization [2]
Google used a modified StyleGAN2 architecture to create an online fitting room where you can automatically try-on any pants or shirts you want using only an image of yourself.
- Short Video Explanation:
- Short read: The AI-Powered Online Fitting Room: VOGUE
- Paper: VOGUE: Try-On by StyleGAN Interpolation Optimization
Taming Transformers for High-Resolution Image Synthesis [3]
Tl;DR: They combined the efficiency of GANs and convolutional approaches with the expressivity of transformers to produce a powerful and time-efficient method for semantically-guided high-quality image synthesis.
- Short Video Explanation:
- Short read: Combining the Transformers Expressivity with the CNNs Efficiency for High-Resolution Image Synthesis
- Paper: Taming Transformers for High-Resolution Image Synthesis
- Code: Taming Transformers
Thinking Fast And Slow in AIÂ [4]
Drawing inspiration from Human Capabilities Towards a more general and trustworthy AI & 10 Questions for the AI Research Community.
- Short Video Explanation:
- Short read: Third Wave of AI | Thinking Fast and Slow
- Paper: Thinking Fast And Slow in AI
Automatic detection and quantification of floating marine macro-litter in aerial images [5]
Odei Garcia-Garin et al. from the University of Barcelona have developed a deep learning-based algorithm able to detect and quantify floating garbage from aerial images. They also made a web-oriented application allowing users to identify these garbages, called floating marine macro-litter, or FMML, within images of the sea surface.
- Short Video Explanation:
- Short read: An AI Software Able To Detect and Count Plastic Waste in the Ocean
- Paper: Automatic detection and quantification of floating marine macro-litter in aerial images: Introducing a novel deep learning approach connected to a web application in R, Environmental Pollution
- Click here for the code
ShaRF: Shape-conditioned Radiance Fields from a Single View [6]
Just imagine how cool it would be to just take a picture of an object and have it in 3D to insert in the movie or video game you are creating or in a 3D scene for an illustration.
- Short Video Explanation:
- Short read: ShaRF: Take a Picture From a Real-Life Object, and Create a 3D Model of It
- Paper: ShaRF: Shape-conditioned Radiance Fields from a Single View
- Click here for the code
Generative Adversarial Transformers [7]
They basically leverage transformersâ attention mechanism in the powerful StyleGAN2 architecture to make it even more powerful!
- Short Video Explanation:
- Short read: GANsformers: Scene Generation with Generative Adversarial Transformers
- Paper: Generative Adversarial Transformers
- Click here for the code
Subscribe to my weekly newsletter and stay up-to-date with new publications in AI for 2022!
We Asked Artificial Intelligence to Create Dating Profiles. Would You Swipe Right? [8]
Would you swipe right on an AI profile? Can you distinguish an actual human from a machine? This is what this study reveals using AI-made-up people on dating apps.
- Short Video Explanation:
- Short read: Would You Swipe Right on an AI Profile?
- Paper: We Asked Artificial Intelligence to Create Dating Profiles. Would You Swipe Right?
- Click here for the code
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [9]
Will Transformers Replace CNNs in Computer Vision? In less than 5 minutes, you will know how the transformer architecture can be applied to computer vision with a new paper called the Swin Transformer.
- Short Video Explanation:
- Short read: Will Transformers Replace CNNs in Computer Vision?
- Paper: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
- Click here for the code
IMAGE GANS MEET DIFFERENTIABLE RENDERING FOR INVERSE GRAPHICS AND INTERPRETABLE 3D NEURAL RENDERINGÂ [10]
This promising model called GANverse3D only needs an image to create a 3D figure that can be customized and animated!
- Short Video Explanation:
- Short read: Create 3D Models from Images! GANverse3D & NVIDIA Omniverse
- Paper: IMAGE GANS MEET DIFFERENTIABLE RENDERING FOR INVERSE GRAPHICS AND INTERPRETABLE 3D NEURAL RENDERING
Deep nets: What have they ever done for vision? [11]
"I will openly share everything about deep nets for vision applications, their successes, and the limitations we have to address."
- Short Video Explanation:
- Short read: What is the state of AI in computer vision?
- Paper: Deep nets: What have they ever done for vision?
Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image [12]
The next step for view synthesis: Perpetual View Generation, where the goal is to take an image to fly into it and explore the landscape!
- Short Video Explanation:
- Short read: Infinite Nature: Fly into an image and explore the landscape
- Paper: Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image
- Click here for the code
- Colab demo
Portable, Self-Contained Neuroprosthetic Hand with Deep Learning-Based Finger Control [13]
With this AI-powered nerve interface, the amputee can control a neuroprosthetic hand with life-like dexterity and intuitiveness.
- Short Video Explanation:
- Short read: An amputee with an AI-Powered Hand!
𦾠- Paper: Portable, Self-Contained Neuroprosthetic Hand with Deep Learning-Based Finger Control
Total Relighting: Learning to Relight Portraits for Background Replacement [14]
Properly relight any portrait based on the lighting of the new background you add. Have you ever wanted to change the background of a picture but have it look realistic? If youâve already tried that, you already know that it isnât simple. You canât just take a picture of yourself in your home and change the background for a beach. It just looks bad and not realistic. Anyone will just say âthatâs photoshoppedâ in a second. For movies and professional videos, you need the perfect lighting and artists to reproduce a high-quality image, and thatâs super expensive. Thereâs no way you can do that with your own pictures. Or can you?
- Short Video Explanation:
- Short read: Realistic Lighting on Different Backgrounds
- Paper: Total Relighting: Learning to Relight Portraits for Background Replacement
LASR: Learning Articulated Shape Reconstruction from a Monocular Video [15]
Generate 3D models of humans or animals moving from only a short video as input. This is a new method for generating 3D models of humans or animals moving from only a short video as input. Indeed, it actually understands that this is an odd shape, that it can move, but still needs to stay attached as this is still one "object" and not just many objects together...
- Short Video Explanation:
- Short read: Articulated 3D Reconstruction from Videos
- Paper: LASR: Learning Articulated Shape Reconstruction from a Monocular Video
- Click here for the code
Enhancing Photorealism Enhancement [16]
This AI can be applied live to the video game and transform every frame to look much more natural. The researchers from Intel Labs just published this paper called Enhancing Photorealism Enhancement. And if you think that this may be "just another GAN," taking a picture of the video game as an input and changing it following the style of the natural world, let me change your mind. They worked on this model for two years to make it extremely robust. It can be applied live to the video game and transform every frame to look much more natural. Just imagine the possibilities where you can put a lot less effort into the game graphic, make it super stable and complete, then improve the style using this model...
- Short Video Explanation:
- Short read: Is AI The Future Of Video Game Design? Enhancing Photorealism Enhancement
- Paper: Enhancing Photorealism Enhancement
- Click here for the code
DefakeHop: A Light-Weight High-Performance Deepfake Detector [17]
How to Spot a Deep Fake in 2021. Breakthrough US Army technology using artificial intelligence to find deepfakes.
While they seem like theyâve always been there, the very first realistic deepfake didnât appear until 2017. It went from the first-ever resembling fake images automatically generated to todayâs identical copy of someone on videos, with sound.
The reality is that we cannot see the difference between a real video or picture and a deepfake anymore. How can we tell whatâs real from what isnât? How can audio files or video files be used in court as proof if an AI can entirely generate them? Well, this new paper may provide answers to these questions. And the answer here may again be the use of artificial intelligence. The saying âIâll believe it when Iâll see itâ may soon change for âIâll believe it when the AI tells me to believe itâŚâ
- Short Video Explanation:
- Short read: How to Spot a Deep Fake. Breakthrough US Army technology (2021)
- Paper: DefakeHop: A Light-Weight High-Performance Deepfake Detector
High-Resolution Photorealistic Image Translation in Real-Time: A Laplacian Pyramid Translation Network [18]
Apply any style to your 4K image in real-time using this new machine learning-based approach!
- Short Video Explanation:
- Short read: High-Resolution Photorealistic Image Translation in Real-Time
- Paper: High-Resolution Photorealistic Image Translation in Real-Time: A Laplacian Pyramid Translation Network
- Click here for the code
Barbershop: GAN-based Image Compositing using Segmentation Masks [19]
This article is not about a new technology in itself. Instead, it is about a new and exciting application of GANs. Indeed, you saw the title, and it wasnât clickbait. This AI can transfer your hair to see how it would look like before committing to the changeâŚ
- Short Video Explanation:
- Short read: Barbershop: Try Different Hairstyles and Hair Colors from Pictures (GANs)
- Paper: Barbershop: GAN-based Image Compositing using Segmentation Masks
- Click here for the code
TextStyleBrush: Transfer of text aesthetics from a single example [20]
This new Facebook AI model can translate or edit text directly in the image in your own language, following the same style!
Imagine you are on vacation in another country where you do not speak the language. You want to try out a local restaurant, but their menu is in the language you donât speak. I think this wonât be too hard to imagine as most of us already faced this situation whether you see menu items or directions and you canât understand whatâs written. Well, in 2020, you would take out your phone and google translate what you see. In 2021 you donât even need to open google translate anymore and try to write what you see one by one to translate it. Instead, you can simply use this new model by Facebook AI to translate every text in the image in your own languageâŚ
- Short Video Explanation:
- Short read: Translate or Edit Text from Images Emulating the Style: TextStyleBrush
- Paper: TextStyleBrush: Transfer of text aesthetics from a single example
- Click here for the code
If youâd like to read more research papers as well, I recommend you read my article where I share my best tips for finding and reading more research papers.
Animating Pictures with Eulerian Motion Fields [21]
This model takes a picture, understands which particles are supposed to be moving, and realistically animates them in an infinite loop while conserving the rest of the picture entirely still creating amazing-looking videos like this one...
- Short Video Explanation:
- Short read: Create Realistic Animated Looping Videos from Pictures
- Paper: Animating Pictures with Eulerian Motion Fields
- Click here for the code
CVPR 2021 Best Paper Award: GIRAFFE - Controllable Image Generation [22]
Using a modified GAN architecture, they can move objects in the image without affecting the background or the other objects!
- Short Video Explanation:
- Short read: CVPR 2021 Best Paper Award: GIRAFFE - Controllable Image Generation
- Paper: GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields
- Click here for the code
GitHub Copilot & Codex: Evaluating Large Language Models Trained on Code [23]
Find out how this new model from OpenAI Generates Code From Words!
- Short Video Explanation:
- Short read: OpenAI's New Code Generator: GitHub Copilot (and Codex)
- Paper: Evaluating Large Language Models Trained on Code
- Click here for the code
Apple: Recognizing People in Photos Through Private On-Device Machine Learning [24]
Using multiple machine learning-based algorithms running privately on your device, Apple allows you to accurately curate and organize your images and videos on iOS 15.
- Short Video Explanation:
- Short read: How Apple Photos Recognizes People in Private Photos Using Machine Learning
- Paper: Recognizing People in Photos Through Private On-Device Machine Learning
Image Synthesis and Editing with Stochastic Differential Equations [25]
Say goodbye to complex GAN and transformer architectures for image generation! This new method by Chenling Meng et al. from Stanford University and Carnegie Mellon University can generate new images from any user-based inputs. Even people like me with zero artistic skills can now generate beautiful images or modifications out of quick sketches...
- Short Video Explanation:
- Short read: Image Synthesis and Editing from Sketches: SDEdit. No more tedious training needed!
- Paper: Image Synthesis and Editing with Stochastic Differential Equations
- Click here for the code
- Colab demo
Sketch Your Own GANÂ [26]
Make GANs training easier for everyone by generating Images following a sketch! Indeed, whit this new method, you can control your GANâs outputs based on the simplest type of knowledge you could provide it: hand-drawn sketches.
- Short Video Explanation:
- Short read: Make GANs Training Easier for Everyone : Generate Images Following a Sketch
- Paper: Sketch Your Own GAN
- Click here for the code
Tesla's Autopilot Explained [27]
If you wonder how a Tesla car can not only see but navigate the roads with other vehicles, this is the video you were waiting for. A couple of days ago was the first Tesla AI day where Andrej Karpathy, the Director of AI at Tesla, and others presented how Teslaâs autopilot works from the image acquisition through their eight cameras to the navigation process on the roads.
- Short Video Explanation:
- Short read: Tesla's Autopilot Explained
Styleclip: Text-driven manipulation of StyleGAN imagery [28]
AI could generate images, then, using a lot of brainpower and trial and error, researchers could control the results following specific styles. Now, with this new model, you can do that using only text!
- Short Video Explanation:
- Short read: Manipulate Real Images With Text - An AI For Creative Artists! StyleCLIP Explained
- Paper: Styleclip: Text-driven manipulation of StyleGAN imagery.
- Click here for the code
- Colab demo
TimeLens: Event-based Video Frame Interpolation [29]
TimeLens can understand the movement of the particles in-between the frames of a video to reconstruct what really happened at a speed even our eyes cannot see. In fact, it achieves results that our intelligent phones and no other models could reach before!
- Short Video Explanation:
- Short read: How to Make Slow Motion Videos With AI!
- Paper: TimeLens: Event-based Video Frame Interpolation
- Click here for the code
Subscribe to my weekly newsletter and stay up-to-date with new publications in AI for 2022!
Diverse Generation from a Single Video Made Possible [30]
Have you ever wanted to edit a video?
Remove or add someone, change the background, make it last a bit longer, or change the resolution to fit a specific aspect ratio without compressing or stretching it. For those of you who already ran advertisement campaigns, you certainly wanted to have variations of your videos for AB testing and see what works best. Well, this new research by Niv Haim et al. can help you do all of these out of a single video and in HD!
Indeed, using a simple video, you can perform any tasks I just mentioned in seconds or a few minutes for high-quality videos. You can basically use it for any video manipulation or video generation application you have in mind. It even outperforms GANs in all ways and doesnât use any deep learning fancy research nor requires a huge and impractical dataset! And the best thing is that this technique is scalable to high-resolution videos.
- Short Video Explanation:
- Short read: Generate Video Variations - No dataset or deep learning required!
- Paper: Diverse Generation from a Single Video Made Possible
- Click here for the code
Skillful Precipitation Nowcasting using Deep Generative Models of Radar [31]
DeepMind just released a Generative model able to outperform widely-used nowcasting methods in 89% of situations for its accuracy and usefulness assessed by more than 50 expert meteorologists! Their model focuses on predicting precipitations in the next 2 hours and achieves that surprisingly well. It is a generative model, which means that it will generate the forecasts instead of simply predicting them. It basically takes radar data from the past to create future radar data. So using both time and spatial components from the past, they can generate what it will look like in the near future.
You can see this as the same as Snapchat filters, taking your face and generating a new face with modifications on it. To train such a generative model, you need a bunch of data from both the human faces and the kind of face you want to generate. Then, using a very similar model trained for many hours, you will have a powerful generative model. This kind of model often uses GANs architectures for training purposes and then uses the generator model independently.
- Short Video Explanation:
- Short read: DeepMind uses AI to Predict More Accurate Weather Forecasts
- Paper: Skillful Precipitation Nowcasting using Deep Generative Models of Radar
- Click here for the code
The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks [32]
Have you ever tuned in to a video or a TV show and the actors were completely inaudible, or the music was way too loud? Well, this problem, also called the cocktail party problem, may never happen again. Mitsubishi and Indiana University just published a new model as well as a new dataset tackling this task of identifying the right soundtrack. For example, if we take the same audio clip we just ran with the music way too loud, you can simply turn up or down the audio track you want to give more importance to the speech than the music.
The problem here is isolating any independent sound source from a complex acoustic scene like a movie scene or a youtube video where some sounds are not well balanced. Sometimes you simply cannot hear some actors because of the music playing or explosions or other ambient sounds in the background. Well, if you successfully isolate the different categories in a soundtrack, it means that you can also turn up or down only one of them, like turning down the music a bit to hear all the other actors correctly. This is exactly what the researchers achieved.
- Short Video Explanation:
- Short read: Isolate Voice, Music, and Sound Effects With AI
- Paper: The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks
- Click here for the code
ADOP: Approximate Differentiable One-Pixel Point Rendering [33]
Imagine you want to generate a 3D model or simply a fluid video out of a bunch of pictures you took. Well, it is now possible! I don't want to give out too much, but the results are simply amazing and you need to check it out by yourself!
- Short Video Explanation:
- Short read: AI Synthesizes Smooth Videos from a Couple of Images!
- Paper: ADOP: Approximate Differentiable One-Pixel Point Rendering
- Click here for the code
(Style)CLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis [34]
Have you ever dreamed of taking the style of a picture, like this cool TikTok drawing style on the left, and applying it to a new picture of your choice? Well, I did, and it has never been easier to do. In fact, you can even achieve that from only text and can try it right now with this new method and their Google Colab notebook available for everyone (see references). Simply take a picture of the style you want to copy, enter the text you want to generate, and this algorithm will generate a new picture out of it! Just look back at the results above, such a big step forward! The results are extremely impressive, especially if you consider that they were made from a single line of text!
- Short Video Explanation:
- Short read: Text-to-Drawing Synthesis With Artistic Control | CLIPDraw & StyleCLIPDraw
- Paper (CLIPDraw): CLIPDraw: exploring text-to-drawing synthesis through language-image encoders
- Paper (StyleCLIPDraw): StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis
- CLIPDraw Colab demo
- StyleCLIPDraw Colab demo
SwinIR: Image restoration using swin transformer [35]
Have you ever had an image you really liked and could only manage to find a small version of it that looked like this image below on the left? How cool would it be if you could take this image and make it twice look as good? Itâs great, but what if you could make it even four or eight times more high definition? Now weâre talking, just look at that.
Here we enhanced the resolution of the image by a factor of four, meaning that we have four times more height and width pixels for more details, making it look a lot smoother. The best thing is that this is done within a few seconds, completely automatically, and works with pretty much any image. Oh, and you can even use it yourself with a demo they made available...
- Short Video Explanation:
- Short read: SwinIR: Image restoration using swin transformer
- Paper: SwinIR: Image restoration using swin transformer
- Click here for the code
- Demo
EditGAN: High-Precision Semantic Image Editing [36]
Control any feature from quick drafts, and it will only edit what you want keeping the rest of the image the same! SOTA Image Editing from sketches model based on GANs by NVIDIA, MIT and UofT.
- Short Video Explanation:
- Short read: NVIDIA EditGAN: Image Editing with Full Control From Sketches
- Paper: EditGAN: High-Precision Semantic Image Editing
- Click here for the code (will be released soon)
CityNeRF: Building NeRF at City Scale [37]
The model is called CityNeRF and grows from NeRF, which I previously covered on my channel. NeRF is one of the first models using radiance fields and machine learning to construct 3D models out of images. But NeRF is not that efficient and works for a single scale. Here, CityNeRF is applied to satellite and ground-level images at the same time to produce various 3D model scales for any viewpoint. In simple words, they bring NeRF to city-scale. But how?
- Short Video Explanation:
- Short read: CityNeRF: 3D Modelling at City Scale!
- Paper: CityNeRF: Building NeRF at City Scale
- Click here for the code (will be released soon)
ClipCap: CLIP Prefix for Image Captioning [38]
Weâve seen AI generate images from other images using GANs. Then, there were models able to generate questionable images using text. In early 2021, DALL-E was published, beating all previous attempts to generate images from text input using CLIP, a model that links images with text as a guide. A very similar task called image captioning may sound really simple but is, in fact, just as complex. It is the ability of a machine to generate a natural description of an image. Itâs easy to simply tag the objects you see in the image but it is quite another challenge to understand whatâs happening in a single 2-dimensional picture, and this new model does it extremely well...
- Short Video Explanation:
- Short read: New SOTA Image Captioning: ClipCap
- Paper: ClipCap: CLIP Prefix for Image Captioning
- Click here for the code
- Click here for the Colab Demo
If you would like to read more papers and have a broader view, here is another great repository for you covering 2020: 2020: A Year Full of Amazing AI papers- A Review and feel free to subscribe to my weekly newsletter and stay up-to-date with new publications in AI for 2022!
Tag me on Twitter @Whats_AI or LinkedIn @Louis (What's AI) Bouchard if you share the list!
Paper references
[1] A. Ramesh et al., Zero-shot text-to-image generation, 2021. arXiv:2102.12092
[2] Lewis, Kathleen M et al., (2021), VOGUE: Try-On by StyleGAN Interpolation Optimization.
[3] Taming Transformers for High-Resolution Image Synthesis, Esser et al., 2020.
[4] Thinking Fast And Slow in AI, Booch et al., (2020), https://arxiv.org/abs/2010.06002.
[5] Odei Garcia-Garin et al., Automatic detection and quantification of floating marine macro-litter in aerial images: Introducing a novel deep learning approach connected to a web application in R, Environmental Pollution, https://doi.org/10.1016/j.envpol.2021.116490.
[6] Rematas, K., Martin-Brualla, R., and Ferrari, V., âShaRF: Shape-conditioned Radiance Fields from a Single Viewâ, (2021), https://arxiv.org/abs/2102.08860
[7] Drew A. Hudson and C. Lawrence Zitnick, Generative Adversarial Transformers, (2021)
[8] Sandra Bryant et al., âWe Asked Artificial Intelligence to Create Dating Profiles. Would You Swipe Right?â, (2021), UNSW Sydney blog.
[9] Liu, Z. et al., 2021, âSwin Transformer: Hierarchical Vision Transformer using Shifted Windowsâ, arXiv preprint https://arxiv.org/abs/2103.14030v1
[10] Zhang, Y., Chen, W., Ling, H., Gao, J., Zhang, Y., Torralba, A. and Fidler, S., 2020. Image gans meet differentiable rendering for inverse graphics and interpretable 3d neural rendering. arXiv preprint arXiv:2010.09125.
[11] Yuille, A.L., and Liu, C., 2021. Deep nets: What have they ever done for vision?. International Journal of Computer Vision, 129(3), pp.781â802, https://arxiv.org/abs/1805.04025.
[12] Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N. and Kanazawa, A., 2020. Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image, https://arxiv.org/pdf/2012.09855.pdf
[13] Nguyen & Drealan et al. (2021) A Portable, Self-Contained Neuroprosthetic Hand with Deep Learning-Based Finger Control: https://arxiv.org/abs/2103.13452
[14] Pandey et al., 2021, Total Relighting: Learning to Relight Portraits for Background Replacement, doi: 10.1145/3450626.3459872, https://augmentedperception.github.io/total_relighting/total_relighting_paper.pdf.
[15] Gengshan Yang et al., (2021), LASR: Learning Articulated Shape Reconstruction from a Monocular Video, CVPR, https://lasr-google.github.io/.
[16] Richter, Abu AlHaija, Koltun, (2021), "Enhancing Photorealism Enhancement", https://intel-isl.github.io/PhotorealismEnhancement/.
[17] DeepFakeHop: Chen, Hong-Shuo, et al., (2021), âDefakeHop: A Light-Weight High-Performance Deepfake Detector.â ArXiv abs/2103.06929.
[18] Liang, Jie and Zeng, Hui and Zhang, Lei, (2021), "High-Resolution Photorealistic Image Translation in Real-Time: A Laplacian Pyramid Translation Network", https://export.arxiv.org/pdf/2105.09188.pdf.
[19] Peihao Zhu et al., (2021), Barbershop, https://arxiv.org/pdf/2106.01505.pdf.
[20] Praveen Krishnan, Rama Kovvuri, Guan Pang, Boris Vassilev, and Tal Hassner, Facebook AI, (2021), âTextStyleBrush: Transfer of text aesthetics from a single exampleâ.
[21] Holynski, Aleksander, et al. âAnimating Pictures with Eulerian Motion Fields.â Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
[22] Michael Niemeyer and Andreas Geiger, (2021), "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields", Published in CVPR 2021.
[23] Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.D.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G. and Ray, A., 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
[24] Apple, âRecognizing People in Photos Through Private On-Device Machine Learningâ, (2021), https://machinelearning.apple.com/research/recognizing-people-photos
[25] Meng, C., Song, Y., Song, J., Wu, J., Zhu, J.Y. and Ermon, S., 2021. Sdedit: Image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073.
[26] Wang, S.Y., Bau, D. and Zhu, J.Y., 2021. Sketch Your Own GAN. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 14050-14060).
[27] âTesla AI Dayâ, Tesla, August 19th 2021, https://youtu.be/j0z4FweCy4M
[28] Patashnik, Or, et al., (2021), âStyleclip: Text-driven manipulation of StyleGAN imagery.â, https://arxiv.org/abs/2103.17249
[29] Stepan Tulyakov*, Daniel Gehrig*, Stamatios Georgoulis, Julius Erbach, Mathias Gehrig, Yuanyou Li, Davide Scaramuzza, TimeLens: Event-based Video Frame Interpolation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 2021, http://rpg.ifi.uzh.ch/docs/CVPR21_Gehrig.pdf
[30] Haim, N., Feinstein, B., Granot, N., Shocher, A., Bagon, S., Dekel, T., & Irani, M. (2021). Diverse Generation from a Single Video Made Possible, https://arxiv.org/abs/2109.08591.
[31] Ravuri, S., Lenc, K., Willson, M., Kangin, D., Lam, R., Mirowski, P., Fitzsimons, M., Athanassiadou, M., Kashem, S., Madge, S. and Prudden, R., 2021. Skillful Precipitation Nowcasting using Deep Generative Models of Radar, https://www.nature.com/articles/s41586-021-03854-z
[32] Petermann, D., Wichern, G., Wang, Z., & Roux, J.L. (2021). The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks. https://arxiv.org/pdf/2110.09958.pdf.
[33] RĂźckert, D., Franke, L. and Stamminger, M., 2021. ADOP: Approximate Differentiable One-Pixel Point Rendering, https://arxiv.org/pdf/2110.06635.pdf.
[34] a) CLIPDraw: exploring text-to-drawing synthesis through language-image encoders
b) StyleCLIPDraw: Schaldenbrand, P., Liu, Z. and Oh, J., 2021. StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis.
[35] Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L. and Timofte, R., 2021. SwinIR: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1833â1844).
[36] Ling, H., Kreis, K., Li, D., Kim, S.W., Torralba, A. and Fidler, S., 2021, May. EditGAN: High-Precision Semantic Image Editing. In Thirty-Fifth Conference on Neural Information Processing Systems.
[37] Xiangli, Y., Xu, L., Pan, X., Zhao, N., Rao, A., Theobalt, C., Dai, B. and Lin, D., 2021. CityNeRF: Building NeRF at City Scale.
[38] Mokady, R., Hertz, A. and Bermano, A.H., 2021. ClipCap: CLIP Prefix for Image Captioning. https://arxiv.org/abs/2111.09734