收集 CVPR 最新的成果,包括论文、代码和demo视频等,欢迎大家推荐!Collect the latest CVPR (Conference on Computer Vision and Pattern Recognition) results, including papers, code, and demo videos, etc., and welcome recommendations from everyone!


🌟 CVPR 2023 持续更新最新论文/paper和相应的开源代码/code!


注:欢迎各位大佬提交issue,分享CVPR 2022论文/paper和开源项目!共同完善这个项目






🔨 目录 |Table of Contents(点击直接跳转)



Run, Don't Walk: Chasing Higher FLOPS for Faster Neural Networks



Spring: A High-Resolution High-Detail Dataset and Benchmark for Scene Flow, Optical Flow and Stereo

Human-Art: A Versatile Human-Centric Dataset Bridging Natural and Artificial Scenes


Diffusion Model

Unifying Layout Generation with a Decoupled Diffusion Model

DR2: Diffusion-based Robust Degradation Remover for Blind Face Restoration

LayoutDM: Discrete Diffusion Model for Controllable Layout Generation

Controllable Mesh Generation Through Sparse Latent Point Diffusion Models

Decomposed Diffusion Models for High-Quality Video Generation

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

Leapfrog Diffusion Model for Stochastic Trajectory Prediction

Conditional Image-to-Video Generation with Latent Flow Diffusion Models





Nerflets: Local Radiance Fields for Efficient Structure-Aware 3D Scene Representation from 2D Supervisio

NeRFLiX: High-Quality Neural View Synthesis by Learning a Degradation-Driven Inter-viewpoint MiXer

PartNeRF: Generating Part-Aware Editable 3D Shapes without 3D Supervision

StyleRF: Zero-shot 3D Style Transfer of Neural Radiance Fields

SINE: Semantic-driven Image-based NeRF Editing with Prior-guided Editing Field


Knowledge Distillation

Paper title: Generic-to-Specific Distillation of Masked Autoencoders

X$^3$KD: Knowledge Distillation Across Modalities, Tasks and Stages for Multi-Camera 3D Object Detection


多模态 / Multimodal

PolyFormer: Referring Image Segmentation as Sequential Polygon Generation

Multimodal Industrial Anomaly Detection via Hybrid Fusion

Hidden Gems: 4D Radar Scene Flow Learning Using Cross-Modal Supervision

AMIGO: Sparse Multi-Modal Graph Transformer with Shared-Context Processing for Representation Learning of Giga-pixel Images

Multimodal Prompting with Missing Modalities for Visual Recognition

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

Virtual Sparse Convolution for Multimodal 3D Object Detection

LoGoNet: Towards Accurate 3D Object Detection with Local-to-Global Cross-Modal Fusion

Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning

Align and Attend: Multimodal Summarization with Dual Contrastive Losses

Multimodal Feature Extraction and Fusion for Emotional Reaction Intensity Estimation and Expression Classification in Videos with Transformers

Self-Supervised Learning for Multimodal Non-Rigid 3D Shape Matching

Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval


Contrastive Learning

Twin Contrastive Learning with Noisy Labels

TranSG: Transformer-Based Skeleton Graph Prototype Contrastive Learning with Structure-Trajectory Prompted Reconstruction for Person Re-Identification

MobileVOS: Real-Time Video Object Segmentation Contrastive Learning meets Knowledge Distillation

Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning

Actionlet-Dependent Contrastive Learning for Unsupervised Skeleton-Based Action Recognition

Dynamic Graph Enhanced Contrastive Learning for Chest X-ray Report Generation

CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning

MaskCon: Masked Contrastive Learning for Coarse-Labelled Dataset


胶囊网络 / Capsule Network


图像分类 / Image Classification

Fine-Grained Classification with Noisy Labels

Task-specific Fine-tuning via Variational Information Bottleneck for Weakly-supervised Pathology Whole Slide Image Classification

Boosting Verified Training for Robust Image Classifications via Abstraction

Curvature-Balanced Feature Manifold Learning for Long-Tailed Classification


目标检测/Object Detection

Towards Domain Generalization for Multi-view 3D Object Detection in Bird-Eye-View

Virtual Sparse Convolution for Multimodal 3D Object Detection

LoGoNet: Towards Accurate 3D Object Detection with Local-to-Global Cross-Modal Fusion

NIFF: Alleviating Forgetting in Generalized Few-Shot Object Detection via Neural Instance Feature Forging

Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection

Bi3D: Bi-domain Active Learning for Cross-domain 3D Object Detection

Uni3D: A Unified Baseline for Multi-dataset 3D Object Detection

Lite DETR : An Interleaved Multi-Scale Encoder for Efficient DETR

PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection

Weakly Supervised Monocular 3D Object Detection using Multi-View Projection and Direction Consistency

Active Teacher for Semi-Supervised Object Detection

MSF: Motion-guided Sequential Fusion for Efficient 3D Object Detection from Point Cloud Sequences

MixTeacher: Mining Promising Labels with Mixed Scale Teacher for Semi-Supervised Object Detection

DiGeo: Discriminative Geometry-Aware Learning for Generalized Few-Shot Object Detection

VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking

Benchmarking Robustness of 3D Object Detection to Common Corruptions in Autonomous Driving

CAPE: Camera View Position Embedding for Multi-View 3D Object Detection

STDLens: Model Hijacking-resilient Federated Learning for Object Detection

MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer

Dense Distinct Query for End-to-End Object Detection

OcTr: Octree-based Transformer for 3D Object Detection

Consistent-Teacher: Towards Reducing Inconsistent Pseudo-targets in Semi-supervised Object Detection


目标跟踪/Object Tracking

Referring Multi-Object Tracking

Unsupervised Contour Tracking of Live Cells by Mechanical and Cycle Consistency Losses

VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking

Visual Prompt Multi-Modal Tracking

MotionTrack: Learning Robust Short-term and Long-term Motions for Multi-Object Tracking


3D Object Tracking


轨迹预测/Trajectory Prediction

IPCC-TP: Utilizing Incremental Pearson Correlation Coefficient for Joint Multi-Agent Trajectory Prediction

Trajectory-Aware Body Interaction Transformer for Multi-Person Pose Forecasting

Leapfrog Diffusion Model for Stochastic Trajectory Prediction



Interactive Segmentation as Gaussian Process Classification

Foundation Model Drives Weakly Incremental Learning for Semantic Segmentation

PolyFormer: Referring Image Segmentation as Sequential Polygon Generation

ISBNet: a 3D Point Cloud Instance Segmentation Network with Instance-aware Sampling and Box-aware Dynamic Convolution

Self-Supervised Image-to-Point Distillation via Semantically Tolerant Contrastive Loss

Delivering Arbitrary-Modal Semantic Segmentation

Conflict-Based Cross-View Consistency for Semi-Supervised Semantic Segmentation

Token Contrast for Weakly-Supervised Semantic Segmentation

Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models

MP-Former: Mask-Piloted Transformer for Image Segmentation

Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos

InstMove: Instance Motion for Object-centric Video Segmentation

DynaMask: Dynamic Mask Selection for Instance Segmentation

MobileVOS: Real-Time Video Object Segmentation Contrastive Learning meets Knowledge Distillation

MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving

FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation

SIM: Semantic-aware Instance Mask Generation for Box-Supervised Instance Segmentation

Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation

Generative Semantic Segmentation

Reliability in Semantic Segmentation: Are We on the Right Track?

Less is More: Reducing Task and Model Complexity for 3D Point Cloud Semantic Segmentation

Explicit Visual Prompting for Low-Level Structure Segmentations

Two-shot Video Object Segmentation

Focused and Collaborative Feedback Integration for Interactive Image Segmentation

Orthogonal Annotation Benefits Barely-supervised Medical Image Segmentation


弱监督语义分割/Weakly Supervised Semantic Segmentation


医学图像分割/Medical Image Segmentation


视频目标分割/Video Object Segmentation


交互式视频目标分割/Interactive Video Object Segmentation


Visual Transformer

Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors

ProxyFormer: Proxy Alignment Assisted Point Cloud Completion with Missing Part Sensitive Transformer

Visual Atoms: Pre-training Vision Transformers with Sinusoidal Waves

MP-Former: Mask-Piloted Transformer for Image Segmentation

TranSG: Transformer-Based Skeleton Graph Prototype Contrastive Learning with Structure-Trajectory Prompted Reconstruction for Person Re-Identification

BiFormer: Vision Transformer with Bi-Level Routing Attention

Making Vision Transformers Efficient from A Token Sparsification View

Rotation-Invariant Transformer for Point Cloud Matching

Graph Transformer GANs for Graph-Constrained House Generation

PSVT: End-to-End Multi-person 3D Pose and Shape Estimation with Progressive Video Transformers

Multimodal Feature Extraction and Fusion for Emotional Reaction Intensity Estimation and Expression Classification in Videos with Transformers

Dual-path Adaptation from Image to Video Transformers

Patch-Mix Transformer for Unsupervised Domain Adaptation: A Game Perspective

POTTER: Pooling Attention Transformer for Efficient Human Mesh Recovery

MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer

MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models

Spherical Transformer for LiDAR-based 3D Recognition

OcTr: Octree-based Transformer for 3D Object Detection

Text with Knowledge Graph Augmented Transformer for Video Captioning

MAGVLT: Masked Generative Vision-and-Language Transformer


深度估计/Depth Estimation

Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation

Fully Self-Supervised Depth Estimation from Defocus Clue


图像、视频检索 / Image Retrieval/Video retrieval

Data-Free Sketch-Based Image Retrieval

CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not


超分辨率/Super Resolution

OPE-SR: Orthogonal Position Encoding for Designing a Parameter-free Upsampling Module in Arbitrary-scale Image Super-Resolution

Super-Resolution Neural Operator

Local Implicit Normalizing Flow for Arbitrary-Scale Image Super-Resolution

Towards High-Quality and Efficient Video Super-Resolution via Spatial-Temporal Data Overfitting


图像去噪/Image Denoising - 1 篇

Masked Image Training for Generalizable Deep Image Denoising


图像编辑/Image Editing

CoralStyleCLIP: Co-optimized Region and Layer Selection for Image Editing


图像压缩/Image Compression

Context-Based Trit-Plane Coding for Progressive Image Compression


人脸识别/Face Recognition

Attribute-preserving Face Dataset Anonymization via Latent Code Optimization

Graphics Capsule: Learning Hierarchical 3D Face Representations from 2D Images

Sibling-Attack: Rethinking Transferable Adversarial Attacks against Face Recognition


人脸检测/Face Detection


人脸活体检测/Face Anti-Spoofing


人脸重建/Face Reconstruction

DR2: Diffusion-based Robust Degradation Remover for Blind Face Restoration

A Hierarchical Representation Network for Accurate and Detailed Face Reconstruction from In-The-Wild Images


视频动作检测/Video Action Detection

TriDet: Temporal Action Detection with Relative Boundary Modeling


手语翻译/Sign Language Translation

Continuous Sign Language Recognition with Correlation Network

Natural Language-Assisted Sign Language Recognition


行人重识别/Person Re-identification

TranSG: Transformer-Based Skeleton Graph Prototype Contrastive Learning with Structure-Trajectory Prompted Reconstruction for Person Re-Identification


Talking Face

SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation


人体姿态估计/Human Pose Estimation

PoseExaminer: Automated Testing of Out-of-Distribution Robustness in Human Pose and Shape Estimation

Mutual Information-Based Temporal Difference Learning for Human Pose Estimation in Video

SLOPER4D: A Scene-Aware Dataset for Global 4D Human Pose Estimation in Urban Environments

Self-Correctable and Adaptable Inference for Generalizable Human Pose Estimation

3D Human Mesh Estimation from Virtual Markers

Rigidity-Aware Detection for 6D Object Pose Estimation

Object Pose Estimation with Statistical Guarantees: Conformal Keypoint Detection and Geometric Uncertainty Propagation



Improving GAN Training via Feature Space Shrinkage

Scaling up GANs for Text-to-Image Synthesis

Graph Transformer GANs for Graph-Constrained House Generation

Cross-GAN Auditing: Unsupervised Identification of Attribute Level Similarities and Differences between Pretrained Generative Models


人脸年龄估计/Age Estimation


人脸表情识别/Facial Expression Recognition


手势姿态估计(重建)/Hand Pose Estimation( Hand Mesh Recovery)

Im2Hands: Learning Attentive Implicit Representation of Interacting Two-Hand Shapes

ACR: Attention Collaboration-based Regressor for Arbitrary Two-Hand Reconstruction


3D Reconstruction

Unsupervised 3D Shape Reconstruction by Part Retrieval and Assembly

MobileBrick: Building LEGO for 3D Reconstruction on Mobile Devices

HairStep: Transfer Synthetic to Real Using Strand and Depth Maps for Single-View 3D Hair Modeling

NeuDA: Neural Deformable Anchor for High-Fidelity Implicit Surface Reconstruction

Structural Multiplane Image: Bridging Neural View Synthesis and 3D Reconstruction


视频插帧/Frame Interpolation

Extracting Motion and Appearance via Inter-Frame Attention for Efficient Video Frame Interpolation


3D点云/3D point cloud

ISBNet: a 3D Point Cloud Instance Segmentation Network with Instance-aware Sampling and Box-aware Dynamic Convolution

Self-Supervised Image-to-Point Distillation via Semantically Tolerant Contrastive Loss

Neural Intrinsic Embedding for Non-rigid Point Cloud Matching

ACL-SPC: Adaptive Closed-Loop system for Self-Supervised Point Cloud Completion

PointCert: Point Cloud Classification with Deterministic Certified Robustness Guarantees

SCPNet: Semantic Scene Completion on Point Cloud

Parameter is Not All You Need: Starting from Non-Parametric Networks for 3D Point Cloud Analysis

PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection

Frequency-Modulated Point Cloud Rendering with Easy Editing

MSF: Motion-guided Sequential Fusion for Efficient 3D Object Detection from Point Cloud Sequences

Rotation-Invariant Transformer for Point Cloud Matching

Deep Graph-based Spatial Consistency for Robust Non-rigid Point Cloud Registration

Less is More: Reducing Task and Model Complexity for 3D Point Cloud Semantic Segmentation

Novel Class Discovery for 3D Point Cloud Semantic Segmentation

Unsupervised Deep Probabilistic Approach for Partial Point Cloud Registration


Anomaly Detection

Diversity-Measurable Anomaly Detection



PA&DA: Jointly Sampling PAth and DAta for Consistent NAS

Generic-to-Specific Distillation of Masked Autoencoders

Backdoor Attacks Against Deep Image Compression via Adaptive Frequency Trigger

Turning a CLIP Model into a Scene Text Detector

Adversarial Attack with Raindrops

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

DART: Diversify-Aggregate-Repeat Training Improves Generalization of Neural Networks

Neural Video Compression with Diverse Contexts

Learning to Retain while Acquiring: Combating Distribution-Shift in Adversarial Data-Free Knowledge Distillation

Efficient and Explicit Modelling of Image Hierarchies for Image Restoration

Quality-aware Pre-trained Models for Blind Image Quality Assessment

Renderable Neural Radiance Map for Visual Navigation

Single Image Backdoor Inversion via Robust Smoothed Classifiers

Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training

Zero-Shot Text-to-Parameter Translation for Game Character Auto-Creation

MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering

Disentangling Orthogonal Planes for Indoor Panoramic Room Layout Estimation with Cross-Scale Distortion Awareness

Neuro-Modulated Hebbian Learning for Fully Test-Time Adaptation

Towards Trustable Skin Cancer Diagnosis via Rewriting Model's Decision

Geometric Visual Similarity Learning in 3D Medical Image Self-supervised Pre-training

Demystifying Causal Features on Adversarial Examples and Causal Inoculation for Robust Network by Adversarial Instrumental Variable Regression

UniDexGrasp: Universal Robotic Dexterous Grasping via Learning Diverse Proposal Generation and Goal-Conditioned Policy

Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners

Zero-shot Object Counting

EcoTTA: Memory-Efficient Continual Test-time Adaptation via Self-distilled Regularization

Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering

Intrinsic Physical Concepts Discovery with Object-Centric Predictive Models

Visual Exemplar Driven Task-Prompting for Unified Perception in Autonomous Driving

Diverse 3D Hand Gesture Prediction from Body Dynamics by Bilateral Hand Disentanglement

Learning Common Rationale to Improve Self-Supervised Representation for Fine-Grained Visual Recognition Problems

Hierarchical discriminative learning improves visual representations of biomedical microscopy

A Meta-Learning Approach to Predicting Performance and Data Requirements

DejaVu: Conditional Regenerative Learning to Enhance Dense Prediction

Detecting Human-Object Contact in Images

MACARONS: Mapping And Coverage Anticipation with RGB Online Self-Supervision

Masked Images Are Counterfactual Samples for Robust Fine-tuning

UniHCP: A Unified Model for Human-Centric Perceptions

PyramidFlow: High-Resolution Defect Contrastive Localization using Pyramid Normalizing Flow

CapDet: Unifying Dense Captioning and Open-World Detection Pretraining

DistilPose: Tokenized Pose Regression with Heatmap Distillation

DeepMAD: Mathematical Architecture Design for Deep Convolutional Neural Network

Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves Generalization

Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation Using Scene Object Spectrum Grounding

Guiding Pseudo-labels with Uncertainty Estimation for Test-Time Adaptation

Learning Discriminative Representations for Skeleton Based Action Recognition

MOSO: Decomposing MOtion, Scene and Object for Video Prediction

RM-Depth: Unsupervised Learning of Recurrent Monocular Depth in Dynamic Scenes

A Light Weight Model for Active Speaker Detection

Where We Are and What We're Looking At: Query Based Worldwide Image Geo-localization Using Hierarchies and Scenes

CUDA: Convolution-based Unlearnable Datasets

Masked Image Modeling with Local Multi-Scale Reconstruction

Revisiting Rotation Averaging: Uncertainties and Robust Losses

Text-Visual Prompting for Efficient 2D Temporal Video Grounding

MVImgNet: A Large-scale Dataset of Multi-view Images

Neuron Structure Modeling for Generalizable Remote Physiological Measurement

3D Cinemagraphy from a Single Image

HumanBench: Towards General Human-centric Perception with Projector Assisted Pretraining

TrojDiff: Trojan Attacks on Diffusion Models with Diverse Targets

Modality-Agnostic Debiasing for Single Domain Generalization

Upcycling Models under Domain and Category Shift

Prototype-based Embedding Network for Scene Graph Generation

MSINet: Twins Contrastive Search of Multi-Scale Interaction for Object ReID

Improving Table Structure Recognition with Visual-Alignment Sequential Coordinate Modeling

Progressive Open Space Expansion for Open-Set Model Attribution

Interventional Bag Multi-Instance Learning On Whole-Slide Pathological Images

Three Guidelines You Should Know for Universally Slimmable Self-Supervised Learning

Adaptive Data-Free Quantization

Learning Distortion Invariant Representation for Image Restoration from A Causality Perspective

Dynamic Neural Network for Multi-Task Learning Searching across Diverse Network Topologies

Universal Instance Perception as Object Discovery and Retrieval

Iterative Geometry Encoding Volume for Stereo Matching

Regularized Vector Quantization for Tokenized Image Synthesis

Semi-supervised Hand Appearance Recovery via Structure Disentanglement and Dual Adversarial Discrimination

CASP-Net: Rethinking Video Saliency Prediction from an Audio-VisualConsistency Perceptual Perspective

DeltaEdit: Exploring Text-free Training for Text-Driven Image Manipulation

Diversity-Aware Meta Visual Prompting

Blind Video Deflickering by Neural Filtering with a Flawed Atlas

Non-Contrastive Unsupervised Learning of Physiological Signals from Video

DAA: A Delta Age AdaIN operation for age estimation via binary code transformer

You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos

NEF: Neural Edge Fields for 3D Parametric Curve Reconstruction from Multi-view Images

I$^2$-SDF: Intrinsic Indoor Scene Reconstruction and Editing via Raytracing in Neural SDFs

V2V4Real: A Real-world Large-scale Dataset for Vehicle-to-Vehicle Cooperative Perception

Bi-directional Distribution Alignment for Transductive Zero-Shot Learning

Skinned Motion Retargeting with Residual Perception of Motion Semantics & Geometry

Lana: A Language-Capable Navigator for Instruction Following and Generation

Rethinking Optical Flow from Geometric Matching Consistent Perspective

Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring

Hubs and Hyperspheres: Reducing Hubness and Improving Transductive Few-shot Learning with Hyperspherical Embeddings

A New Benchmark: On the Utility of Synthetic Data with Blender for Bare Supervised Learning and Downstream Domain Adaptation

Achieving a Better Stability-Plasticity Trade-off via Auxiliary Networks in Continual Learning

TBP-Former: Learning Temporal Bird's-Eye-View Pyramid for Joint Perception and Prediction in Vision-Centric Autonomous Driving

Adversarial Counterfactual Visual Explanations

A Dynamic Multi-Scale Voxel Flow Network for Video Prediction

TeSLA: Test-Time Self-Learning With Automatic Adversarial Augmentation

Video Dehazing via a Multi-Range Temporal Alignment Network with Physical Prior

LOCATE: Localize and Transfer Object Parts for Weakly Supervised Affordance Grounding

On the Effects of Self-supervision and Contrastive Alignment in Deep Multi-view Clustering

3D Concept Learning and Reasoning from Multi-View Images

Picture that Sketch: Photorealistic Image Generation from Abstract Sketches

Coreset Sampling from Open-Set for Fine-Grained Self-Supervised Learning

Boosting Semi-Supervised Learning by Exploiting All Unlabeled Data

Feature Alignment and Uniformity for Test Time Adaptation

EqMotion: Equivariant Multi-agent Motion Prediction with Invariant Interaction Reasoning

Trainable Projected Gradient Method for Robust Fine-tuning

Partial Network Cloning

Divide and Conquer: Answering Questions with Object Factorization and Compositional Reasoning

Uncertainty-Aware Optimal Transport for Semantically Coherent Out-of-Distribution Detection

DeAR: Debiasing Vision-Language Models with Additive Residuals

3DQD: Generalized Deep 3D Shape Prior via Part-Discretized Diffusion Process

Sharpness-Aware Gradient Matching for Domain Generalization

Extracting Class Activation Maps from Non-Discriminative Features as well

Make Landscape Flatter in Differentially Private Federated Learning

Computationally Budgeted Continual Learning: What Does Matter?

TWINS: A Fine-Tuning Framework for Improved Transferability of Adversarial Robustness and Generalization

Efficient Map Sparsification Based on 2D and 3D Discretized Grids

ProphNet: Efficient Agent-Centric Motion Forecasting with Anchor-Informed Proposals

Joint Visual Grounding and Tracking with Natural Language Specification

Automatic evaluation of herding behavior in towed fishing gear using end-to-end training of CNN and attention-based networks

Learning A Sparse Transformer Network for Effective Image Deraining

Context De-confounded Emotion Recognition

Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective

The Treasure Beneath Multiple Annotations: An Uncertainty-aware Edge Detector

Propagate And Calibrate: Real-time Passive Non-line-of-sight Tracking

Detecting Everything in the Open World: Towards Universal Object Detection

Data-efficient Large Scale Place Recognition with Graded Similarity Supervision

Abstract Visual Reasoning: An Algebraic Approach for Solving Raven's Progressive Matrices

Learning a 3D Morphable Face Reflectance Model from Low-cost Data

Full or Weak annotations? An adaptive strategy for budget-constrained annotation campaigns

ALOFT: A Lightweight MLP-like Architecture with Dynamic Low-frequency Transform for Domain Generalization

Visibility Constrained Wide-band Illumination Spectrum Design for Seeing-in-the-Dark

Human Pose as Compositional Tokens

Equiangular Basis Vectors

HRDFuse: Monocular 360°Depth Estimation by Collaboratively Learning Holistic-with-Regional Depth Distributions

Boundary Unlearning

One-to-Few Label Assignment for End-to-End Dense Detection

Fix the Noise: Disentangling Source Feature for Controllable Domain Translation

PRISE: Demystifying Deep Lucas-Kanade with Strongly Star-Convex Constraints for Multimodel Image Alignment

Sketch2Saliency: Learning to Detect Salient Objects from Human Drawings

Polynomial Implicit Neural Representations For Large Diverse Datasets

Persistent Nature: A Generative Model of Unbounded 3D Worlds

MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training

NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations

Egocentric Audio-Visual Object Localization

Improving Generalization with Domain Convex Game

Visual-Language Prompt Tuning with Knowledge-guided Context Optimization

TAPS3D: Text-Guided 3D Textured Shape Generation from Pseudo Supervision

A Bag-of-Prototypes Representation for Dataset-Level Applications

CrOC: Cross-View Online Clustering for Dense Visual Representation Learning

Transforming Radiance Field with Lipschitz Network for Photorealistic 3D Scene Stylization

Exploring Structured Semantic Prior for Multi Label Recognition with Incomplete Labels

Marching-Primitives: Shape Abstraction from Signed Distance Function

CP$^3$: Channel Pruning Plug-in for Point-based Networks

Box-Level Active Detection

Robust Generalization against Photon-Limited Corruptions via Worst-Case Sharpness Minimization

CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching

PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360$^{\circ}$

Human Guided Ground-truth Generation for Realistic Image Super-resolution

SIEDOB: Semantic Image Editing by Disentangling Object and Background

Hierarchical Semantic Contrast for Scene-aware Video Anomaly Detection

Top-Down Visual Attention from Analysis by Synthesis

Semantic Ray: Learning a Generalizable Semantic Field with Cross-Reprojection Attention

Backdoor Defense via Adaptively Splitting Poisoned Dataset

LightPainter: Interactive Portrait Relighting with Freehand Scribble

Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline

Don't FREAK Out: A Frequency-Inspired Approach to Detecting Backdoor Poisoned Samples in DNNs

Learning a Practical SDR-to-HDRTV Up-conversion using New Dataset and Degradation Models

Open Set Action Recognition via Multi-Label Evidential Learning

Dense Network Expansion for Class Incremental Learning

VecFontSDF: Learning to Reconstruct and Synthesize High-quality Vector Fonts via Signed Distance Functions

Correlational Image Modeling for Self-Supervised Visual Pre-Training

An Extended Study of Human-like Behavior under Adversarial Training

RaBit: Parametric Modeling of 3D Biped Cartoon Characters with a Topological-consistent Dataset

Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding

BiCro: Noisy Correspondence Rectification for Multi-modality Data via Bi-directional Cross-modal Similarity Consistency

Balanced Spherical Grid for Egocentric View Synthesis

Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

Re-thinking Federated Active Learning based on Inter-class Diversity

Learning a Depth Covariance Function

Positive-Augmented Constrastive Learning for Image and Video Captioning Evaluation

Music-Driven Group Choreography

Beyond Appearance: a Semantic Controllable Self-Supervised Learning Framework for Human-Centric Visual Tasks
