• Stars
    star
    277
  • Rank 148,875 (Top 3 %)
  • Language
    Python
  • Created almost 3 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Making Structure-from-Motion (COLMAP) more robust to symmetries and duplicated structures

SfM disambiguation with COLMAP

About

Structure-from-Motion generally fails when the scene exhibits symmetries and duplicated structures. In this repository, we implement several state-of-the-art algorithms that aim at addressing this problem, we integrate them into COLMAP, and we extensively analyze their performance. We hope that this effort can ease further research on this problem.

We focus on filtering out incorrect matches between images prior to SfM. The filtering process is done by reimplementing the ideas from Distinguishing the indistinguishable: Exploring structural ambiguities via geodesic context by Yan et al. (CVPR 2017) and Global Structure-from-Motion by Similarity Averaging by Cui et al. (ICCV 2015). We also include the experiment results from Improving Structure from Motion with Reliable Resectioning by Kataria et al. (3DV 2020) based on their implementation. We refer to these three papers as Yan's method, Cui's method, and Kataria's method respectively. This repository uses COLMAP and hloc for feature extraction and matching, and COLMAP for geometric verification and sparse reconstruction.

TLDR: No method consistently works well over all datasets with a single set of hyperparameters. Tuning the parameters for large scenes is difficult and time-consuming for all three methods.

Drop an email to Lixin Xue if you are interested in this problem and want to chat!

teaser
Results on the Alexander Nevsky Cathedral Dataset

Conclusions

Based on our experiments, we have the following observations:

  • Duplicate structures in images often lead to an excessive number of image matches and non-converging bundle adjustment. These will lengthen the reconstruction time significantly. Removing the wrong matches or initializing the poses correctly can significantly speed up the reconstruction process.
  • With a correct initial pair of images and a perfect next view selection, colmap might still output a reconstruction with many wrongly registered images. Even Kataria's method to initialize poses based on reliable matches was still insufficient to disambiguate some datasets. Therefore, a less noisy pose graph is necessary for a correct reconstruction.
  • Both Yan's method and Cui's method need some scene-specific parameter tuning. Kataria's method has better generalization ability, though still fails on several datasets even though we tune the parameters for it. No method consistently works well over all datasets with a single set of parameters. Tuning the parameters for certain scenes (especially large scale ones) is difficult and time-consuming for all three methods.

Installation

[Click to expand]
# python 3.7 is required for the 'capture_output' keyword in `subprocess.run`
# which is only used in the notebooks
conda create -n sfm python=3.7 -y

# install [colmap](https://colmap.github.io/install.html).
# install [hloc](https://github.com/cvg/Hierarchical-Localization#installation)

# library for the plot of match graphs
sudo apt-get install graphviz graphviz-dev
pip install pygraphviz
conda install -c anaconda networkx -y
# for interactive control of parameters in the notebooks
conda install -c conda-forge jupyterlab ipywidgets -y
# can skip this if using colmap for visualization of 3D models
conda install -c open3d-admin -c conda-forge open3d -y

# install this library in development mode for further modifications
python -m pip install -e .

We also provide the datasets we use via google drive. Please download it and then unzip it under the folder datasets to have the following layout:

|---datasets
    |---heinly2014
        |---...
    |---yan2017
        |---...

Pipelines

[Click to expand]

We provide two jupyter notebooks as examples for the complete pipelines of Yan's method and Cui's method. These two methods share a similar pipeline by first computing a score for each image pair and then removing wrong matches based on the score. After that, the filtered matches are passed to the incremental reconstruction stage. In Yan's method, they use raw matches to compute tracks and use the percentage of shared unique tracks between two images as the score for an image pair. While in Cui's method, they do a local reconstruction for every image and use the missing correspondence idea to create a score for every image pair.

pipeline yan pipeline cui
Similar Pipelines for Yan's method and Cui's method

1. Correspondence

First, we can extract features from images using colmap or hloc. We provide the following features with their corresponding keywords in the parentheses:

  1. colmap SIFT with default parameters (sift_default)
  2. colmap sparser SIFT features with the first octave set to be 0 (sift_sparse)
  3. SuperPoint (superpoint)
  4. D2-Net (d2net)
  5. R2D2 (r2d2)
  6. DISK (disk, still a pull request in hloc)

For SIFT features extracted by colmap, we use the exhaustive nearest neighbor matching provided by colmap.

For learned features extracted by hloc, we use the exhaustive nearest neighbor matching provided by hloc. Specifically for the SuperPoint feature, we can also use the trained SuperGlue model for matching.

Then, we use colmap matches_importer to perform geometric verification (compute two-view geometries from the matches) with different RANSAC parameters (check colmap_matching_options in options/matching_options.py).

2. Disambiguation

Next, we can use Yan's method or Cui's method to compute the scores for all the matches. After that, we can choose to use a threshold filter, a top k filter, or a percentile filter to remove suspicious matches. We create a new database with the filtered matches and recompute two-view geometries.

We choose to pre-filter matches rather than post-process the reconstructed model as done in the paper Correcting for Duplicate Scene Structure in Sparse 3D Reconstruction by Heinly et al based on the observations stated in the section Conclusions.

3. Reconstruction

Lastly, we use colmap mapper to reconstruct the whole scene incrementally. Depending on the dataset, you can choose to fix the intrinsics from EXIF or not (check options/mapper_options.py).

Yan's Method Overview

[Click to expand]

Distinguishing the Indistinguishable: Exploring Structural Ambiguities via Geodesic Context. CVPR 2017.

By Qingan Yan, Long Yang, Ling Zhang, Chunxia Xiao.

Basic Steps

The key idea is that geodesic neighbors capturing the same instance of duplicate structures usually share more matches than images of different instances of duplicate structures. With this in mind, they:

  1. generate tracks from raw matches;
  2. select the iconic set of images to summarize the whole scene by maximizing an objective function that favors completeness (#observed tracks) and penalizes repetitiveness (#tracks occurring in more than one image);
  3. split the tracks covered by the images in the iconic set into two parts: those appearing only in one image of the iconic set are defined as unique tracks, while the other tracks appearing more than once in the images of the iconic set are defined as confusing tracks.
  4. define a score for match_ij: len(unique_ij) / max(len(unique_i), len(unique_j)), i.e. the percentage of common unique tracks shared by the two images

Differences between the Original Implementation and the Paper

Our implementation follows the original implementation shared by the author. However, there are some differences between the author's code and the paper:

  1. In the actual implementation, the non-iconic images are also used as the backbone of the path network. Therefore, the match graph is not "a bipartite graph with nodes respectively are iconic images and non-iconic images". The construction of the iconic set is only for the construction of the unique tracks and confusing tracks.
  2. In the original paper, the match will be kept as long as two images share enough unique tracks. However, the percentage of unique tracks is also used to filter out matches in the implementation.
  3. The $\alpha$ in formula (3) is set to 0 in the code while it is required to be larger than 0 in the paper. As a consequence, the stopping criterion for the construction of the iconic set is different: in the paper the iconic set will be fixed once adding any one of the images will not increase the objective function (3). Instead, the author provides a coverage threshold to stop the expansion of the iconic set once the objective is larger than this threshold. This change is necessary since the objective function will be monotonically increasing with $\alpha = 0$.

Parameters Tuning

The original implementation has two parameters (coverage_thres and score_thres) to tune. Here is the comment from the author:

The coverage controls how many iconic images will be selected. As for small-scale indoor scenes, a large value between 0.7 and 0.9 is recommended; otherwise, for large-scale unstructured datasets, the value around 0.6 would be enough.

Parameter score_thres defines whether an image pair is acceptable. As for small-scale scenes, similarly, a large threshold (around 0.3) is recommended; otherwise, for large-scale outdoor scenes, score_thres between 0.04 and 0.1 would be a good choice.

For instance, for the Alexander Nevsky Cathedral dataset, we use coverage = 0.6 and score_thres = 0.1 to achieve well-registered 3D point clouds.

In our implementation, we expose another 4 parameters to tune:

  • track_degree: the minimal length of a track to be taken into account. Increasing it will discard more short tracks.
  • alpha: the $\alpha$ in formula (3) of the paper. Increasing it will require the images in the iconic set to be more distinctive.
  • minimal_views: the minimal number of shared tracks for a match to be valid. Increasing it means fewer matches will be valid.
  • ds: the data structure used to store the list of matches. You can leave it unchanged (default largearray) as the default value is a good tradeoff between speed and memory. For large datasets with thousands of images like berliner_dom (1618 images), it is necessary to use the smallarray data structure or limit the maximum number of keypoints in an image. In this case, it would be extremely slow (more than 8 hours for berliner_dom) due to a large number of images and the inefficient data structure.

Cui's Method Overview

[Click to expand]

Global Structure-from-Motion by Similarity Averaging. ICCV 2015.

By Zhaopeng Cui and Ping Tan.

Basic Steps

In this paper, the authors utilize the missing correspondences cue to remove wrong matches.

  1. For each image and its direct neighbors, compute a two-view reconstruction by triangulation with the baseline set to 1.
  2. Merge these two-view reconstructions into one local reconstruction by solving a linear system based on the scale consistency in the stellar graph.
  3. Project the merged 3D points to each neighbor and calculate its missing correspondence score.

For more details, please check Section 4 of the paper.

Missing Correspondence Score

In the paper, there are two types of projected 3D points in the field of view of the image: the observed keypoints (matched features) in this image and the ones observed in other images but not in this image (missing features).

The author takes the bounding boxes of matched features and calculates the percentage of missing points in the bounding boxes as the missing correspondence score. For example, these two images below display the projected 3D points from image 0 into image 1 and image 16, respectively.

projected points from image 0 to image 1 projected points from image 0 to image 17

However, this formulation is not stable since the bounding box is extremely sensitive to outliers: one outlier might enlarge the box to the entire image frame. In addition, the bounding box doesn't take into account the perspective transformation between images, where a square in one image would not be a square in another image.

Therefore, we propose a more fine-grained score by using a small square box around every keypoints instead of a big bounding box around all the keypoints. Below is the visualization of the new masks of the images shown above. With this modification, the score is more reliable and distinctive for the disambiguation.

projected points from image 0 to image 1 projected points from image 0 to image 17

We further try to take the distance between matched features and missing features into account by using a Gaussian mask around the keypoints:

projected points from image 0 to image 1 projected points from image 0 to image 17

This scoring method requires a smaller threshold since the score is typical around ~0.1. We did not experiment much with this scoring method, so we are not sure if it is better than the previous version.

Parameters

  • score_version: we provide three score formulations as stated above. Version 1 is the method using a bounding box around keypoints presented in the paper. Version 2 is using a uniform square around every keypoint. Version 3 is using a Gaussian square around every keypoint.
  • min_num_valid_depths: minimal number of correct reconstructed depths for a 3D point to be valid (The depth consistency check in the paper).
  • max_num_neighbors: maximal number of neighbors used for the reconstruction. It is used to avoid the inefficiency caused by dense clusters of images.
  • square_radius: the size of the square around each keypoint for the score versions 2 and 3.
  • parallel: whether to use multiple threads for the computation of the scores. The statistics of the runtime is not accurate in parallel mode.
  • plot: whether to plot the masks and the image with projected 3D points. For large datasets, it is advisable not to plot as plotting will have a significant overhead.

Visualizations

To avoid excessive loading time for the repo, please check this for visualizations.

Codebase Structure

[Click to expand]
|---datasets
    |---heinly2014
        |---...
    |---yan2017
        |---...
|---disambiguation
    |---geodesic_consistency        # code for Yan's method
    |---mmissing_correspondences    # code for Cui's method
    |---options     # parameters for features/matching/mapper
    |---utils       # some helper functions
    |---calculate_geodesic_consistency_scores.py        # interface for calculating match scores based on Yan's method
    |---calculate_missing_correspondences_scores.py     # interface for calculating match scores based on Cui's method
    |---extract_match_features.py                       # interface for extract and match features
|---reliable_resectioning
    |---src         # modified colmap source files for Kataria's method
|---results
    |---${dataset_name}
        |---${feature_type}_${matching_type}_${geometric_verification_type}
            |---plots_${parameters}             # plots for missing correspondences in Cui's method
            |---sparse                          # reconstruction using colmap without disambiguation
            |---sparse_yan_${parameters}        # reconstruction using Yan's method
            |---sparse_cui_${parameters}        # reconstruction using Cui's method
            |---db_yan_${parameters}.db         # storing matches filtered with Yan's method
            |---db_cui_${parameters}.db         # storing matches filtered with Cui's method
            |---${dataset_name}.db              # storing unfiltered matches
            |---scores_yan_${parameters}.npy    # socres for matches using Yan's method
            |---scores_cui_${parameters}.npy    # scores for matches using Cui's method
            |---...
|---notebooks
    |---$steet_${method_name}.ipynb   # example for running the codebase and tuning the parameters on the street dataset.
|---scripts
    |---disambiguate_yan.py     # example of using Yan's method for scores
    |---disambiguate_cui.py     # example of using Cui's method for scores
    |---filter_matches.py       # example of filtering matches based on scores
    |---match_features.py       # example of extracting and matching features

Datasets

[Click to expand]

We mainly use the datasets from Yan's repo and Heinly's website, where they include some datasets from Roberts et al. and Jiang et al.. We packed the cleaned-up version of these datasets (with images renamed and features removed) into a zip file for downloads.

To experiment with other datasets, you can place new datasets under yan2017 or heinly2014 with the following structure:

|---datasets
    |---heinly2014
        |---${your_dataset_name}
            |---images
                |--- *.[jpg/png/...]

then you can use your ${your_dataset_name} as argument dataset_name to run the code on new datasets.

Literature Review

[Click to expand]

Here are some relevant papers and their summaries:

  • Pose Graph: nodes are images, edges are epipolar geometries

  • Visibility Graph: nodes are images and tracks(3D points), edges are visibility

  • Some other papers

    • [Roberts CVPR 2011]: missing correspondences + timestamp cues
    • [Cohen CVPR 2012]: detect symmetries and enforce them as constraints in optimization
    • [Ceylan TOG 2013]: user-specified pattern to detect repeated patterns for facade
    • [Heinly 3DV 2014]: similar idea but faster compared to [Heinly ECCV 2014]
    • [Kataria 3DV 2020]: use track length to adjust match score for next view selection; only use reliable 3D points for image registration
  • Common techniques: Minimum Spanning Tree

Acknowledgement

This work was done by Lixin Xue, a Master's student at ETH Zurich, under the supervision of Paul‑Edouard Sarlin and Mihai Dusmanu. We would like to thank Qingan Yan for sharing the original implementation and datasets, Jared Heinly for sharing the datasets, and Rajbir Kataria for helping out with the setup of his work.

More Repositories

1

LightGlue

LightGlue: Local Feature Matching at Light Speed (ICCV 2023)
Python
3,261
star
2

Hierarchical-Localization

Visual localization made easy with hloc
Python
3,054
star
3

nice-slam

[CVPR'22] NICE-SLAM: Neural Implicit Scalable Encoding for SLAM
Python
1,415
star
4

pixel-perfect-sfm

Pixel-Perfect Structure-from-Motion with Featuremetric Refinement (ICCV 2021, Best Student Paper Award)
C++
1,317
star
5

pixloc

Back to the Feature: Learning Robust Camera Localization from Pixels to Pose (CVPR 2021)
JavaScript
745
star
6

glue-factory

Training library for local feature detection and matching
Python
702
star
7

limap

A toolbox for mapping and localization with line features.
C++
676
star
8

GlueStick

Joint Deep Matcher for Points and Lines 🖼️💥🖼️ (ICCV 2023)
Jupyter Notebook
543
star
9

SOLD2

Joint deep network for feature line detection and description
Jupyter Notebook
541
star
10

DeepLSD

Implementation of the paper "DeepLSD: Line Segment Detection and Refinement with Deep Image Gradients"
Jupyter Notebook
453
star
11

pyceres

Factor graphs with Ceres in Python
C++
233
star
12

visloc-iccv2021

ETH-Microsoft dataset for the ICCV 2021 visual localization challenge
203
star
13

EMAP

[CVPR'24] 3D Neural Edge Reconstruction
Python
147
star
14

nicer-slam

[3DV'24 Best Paper Honorable Mention] NICER-SLAM: Neural Implicit Scene Encoding for RGB SLAM
Python
146
star
15

nerf-on-the-go

[CVPR'24] NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild
Python
124
star
16

VP-Estimation-with-Prior-Gravity

Vanishing Point Estimation in Uncalibrated Images with Prior Gravity Direction (ICCV 2023)
C++
90
star
17

glace

[CVPR 2024] GLACE: Global Local Accelerated Coordinate Encoding
Python
68
star
18

px-ros-pkg

A repository for PIXHAWK open source code running on ROS
C
50
star
19

LabelMaker

Jupyter Notebook
50
star
20

raybender

Fast CPU rendering in Python using the Intel® Embree backend
C++
38
star
21

pcdmeshing

Point cloud meshing with CGAL
C++
38
star
22

implicit_dist

C++
23
star
23

hololens_ros

C#
12
star
24

depthsplat

DepthSplat: Connecting Gaussian Splatting and Depth
5
star
25

spot_pose_estimation

Python
2
star