New paper: Large Scale Photometric Bundle Adjustment

Myself and Olly Woodford published a new paper: Large Scale Photometric Bundle Adjustment, PDF here, at BMCV 2020.

This work presents a fully photometric formulation for bundle adjustment. Starting from a classical system (such as COLMAP), the system performs a structure and pose refinement, where the cost function is essentially the normalised correlation cost of patches reprojected into the source images.

Abstract

Direct methods have shown promise on visual odometry and SLAM, leading to greater accuracy and robustness over feature-based methods. However, offline 3-d reconstruction from internet images has not yet benefited from a joint, photometric optimization over dense geometry and camera parameters. Issues such as the lack of brightness constancy, and the sheer volume of data, make this a more challenging task. Thiswork presents a framework for jointly optimizing millions of scene points and hundreds of camera poses and intrinsics, using a photometric cost that is invariant to local lighting changes. The improvement in metric reconstruction accuracy that it confers over feature-based bundle adjustment is demonstrated on the large-scale Tanks & Temples benchmark. We further demonstrate qualitative reconstruction improvements on an in ternet photo collection, with challenging diversity in lighting and camera intrinsics.

Realtime AR world transformations with occlusions

This is me, my team and collaborators have been working on recently. World transforming AR specifically the floor.

You can see occlusions, such as the pillars occluding the floor effect, but we have more sophisticated occlusion handling too:

You can’t tell from this video that the occlusion handling is dynamic, so if the postbox managed to move, the occlusions would stay up to date. And here’s a gallery of nice shots

If you have Snapchat and want to try it for yourself, here are the snapcodes:

BMVC 2019

I went to BMVC this year and had a great time, and saw lots of interesting papers and talked to a lot of interesting people. BMVC was my first conference in 2003, and it has changed fair bit since then. I remember some of the hushed, awed tones about how was that it was getting really international because there were two speakers from American and one all the way from China. Now it really is a big international conference, that just happens to be located in the UK each year. I think the best bits of the fundamental character haven’t changed.

On the minus side, I made a bunch of notes and then lost them so I’m having to go on memory and have almost certainly forgotten some that stood out. So here’s a somewhat random selection of papers that caught my eye as interesting for various reasons.

But first, here’s a video of Cardiff Science Library vomiting rainbows:

A random selection of interesting papers

Keynote

Dissecting Neural Nets
Prof. Antonio Torralba (MIT)

That keynote was very interesting and Prof. Torralba is a fantastic presenter and the results were very intersting. Unfortunately I can’t find the video to link to.

Geometric vision

Whenever there’s a paper not about deep learning there’s always a cluster of people people who’s student days are long past hovering around commenting about how it’s nice to see something that isn’t deep learning. I also like to refer to this type of vision as “geometric” since it involves geometry rather than “traditional” or (even worse) “old-fashioned”.

26. A Simple Direct Solution to the Perspective-Three-Point Problem
Gaku Nakano (NEC Corporation)

The paper is a new solution to the P3P problem. Given the age of the field and number of existing solutions, it’s surprising that there are actually new ones. It’s a surprisingly tricky problem as anyone who’d tried to derive a solution will know and it’s interesting to see there are still new insights to be had.

If you hand a vision system to a computer vision researcher, the first thing they will do is try and break it. These days that’s even publishable!

Non deep image features are still widely used for solving geometric problems especially if efficiency is key. While it’s not surprising, it had never occurred to me that they could be attacked just like neural nets can be attacked.

27. Adversarial Examples for Handcrafted Features
[Supplementary]
Muhammad Latif Anjum (NUST); Zohaib Ali (NUST); Wajahat Hussain (NUST – SEECS)

Much like the attacks on DNNs, the differences aren’t visually apparent. Speaking of adversarial attacks, I found this paper and poster enjoyable and easy to follow, with good results:

210. Robust Synthesis of Adversarial Visual Examples Using a Deep Image Prior
Thomas Gittings (University of Surrey); Steve Schneider (University of Surrey); John Collomosse (University of Surrey)

I didn’t know that was a thing

I like papers that have “towards” in the title. It’s an admission in the title that the results aren’t spectacular or and they aren’t aceing the current benchmarks, but they’re tackling a hard problem in a new way. That’s a good goal for research, not engineering polished solutions, but tackling new problems or bringing new insight to bear.

In this case, they are dealing with point clouds of the sort that might be the result of structure from motion but where the original images aren’t available. Turns out it’s possible to do semantic segmentation of those clouds.

252. Towards Weakly Supervised Semantic Segmentation in 3D Graph-Structured Point Clouds of Wild Scenes
Haiyan Wang (City University of New York); Xuejian Rong (City University of New York); Liang Yang (City University of New York); YingLi Tian (City University of New York)

Realtime semantic segmentation

There’s a lot of interest in realtime techniques  which I like. A lot of it comes from the self driving car industry and all of these are tested on Cityscapes. I’m more interested it from the perspective of running on a phone, but there’s a lot of common ground and so these are well worth a closer look.

253. Fast-SCNN: Fast Semantic Segmentation Network
Rudra Poudel (Tosihiba Research Europe, Ltd.); Stephan Liwicki (Toshiba Research Europe, Ltd.); Roberto Cipolla (University of Cambridge)

259. DABNet: Depth-wise Asymmetric Bottleneck for Real-time Semantic Segmentation
Gen Li (Sungkyunkwan University); Joongkyu Kim (Sungkyunkwan University)

260. Feature Pyramid Encoding Network for Real-time Semantic Segmentation
Mengyu Liu (University of Manchester); Hujun Yin (University of Manchester)

Benchmarks are useful, but I feel that over reliance on them can essentially lead to reverse engineering the datasets. I’ve certainly noticed in my own work that networks that give stellar results on ImageNet don’t do nearly so well when images that aren’t of the sort one posts to the internet (i.e. worse, less well composed, more cluttered, worse lighting and focus etc).

I think all good benchmarks are doomed to eventually become more of a hindrance than a help because of all the focus that they draw. This isn’t to disparage the benchmarks, at all, I think it’s simply part of the cycle of research. I wonder when we’ll reach that point with Cityscapes.

Domain transformation

The key idea here is that (for object detection from a car), a data volume aligned with the ground plane and front of the car is more semantically meaningful than a 2D image. So they transform RESNet features into that cube using a simple technique and do the deep learning there. Sound idea with good results.

285. Orthographic Feature Transform for Monocular 3D Object Detection
Thomas Roddick (University of Cambridge); Alex Kendall (University of Cambridge); Roberto Cipolla (University of Cambridge)

Binary networks

Binarised networks have an appealing minimalism, especially from a hardware and wire-format compression point of view. Unfortunately they’re not differentiable. This paper makes judicious use of carefully inserted weighting factors and derivatives of effectively a blurred binary activation function to introduce differentiability.

19. Accurate and Compact Convolutional Neural Networks with Trained Binarization
Zhe Xu (City University of Hong Kong); Ray Cheung (City University of Hong Kong)

A different approach to deep features

I couldn’t decide how much I like this paper because I kept vacillating about the core idea. Then I realised that in itself makes it a good paper because it’s made me think a lot about the problem. It was very well presented and the core idea is simple and intriguing.

32. Matching Features without Descriptors: Implicitly Matched Interest Points
[Supplementary]
Titus Cieslewski (University of Zurich & ETH Zurich); Michael Bloesch (Deepmind); Davide Scaramuzza (University of Zurich & ETH Zurich)

I like that the features are defined purely by matchability and localisation. I also like that they do not have to do things like have precisely (or at most one) feature per 8×8 (etc) window of the image, and they have a simple structure without auxiliary losses, and an overall simple training procedure.

This is also one of the things I like about BMVC: the results presented in the paper don’t present it as the new leading feature detector, in fact it’s not even near the top of the pack of the ones they compare to. However they’re tackling it in a new and interesting way and I there is a great deal of value in such ideas being shared and discussed even if they’re not (yet?) as good as the competitors.

14 Years

I’ve been working on model based 3D tracking on and off for quite a while now.

Year 1 (2005)

This was my main contribution to the field of 3D tracking. To my knowledge, it was the joint first (there was another paper from my lab mate using a different technique) real time tacking system that processed the entire image frame. Both techniques were much more robust than the ones that went before. My one also debuted an early version of the FAST corner detector (I didn’t put that page there).

You can see the tracking works because the model (rendered as purple lines) stays stuck to the image.  The tracker operated in real time, well field rate, which was 50Hz fields of 756×288 pixels of analogue video from some sort of Pulnix camera, captured on a BT878 card of some sort on a dual PIII at 850 MHz (running Redhat of some description). It wasn’t mobile (I had two 21″ CRT monitors), so I wasn’t watching the screen as I was capturing video; I found a long spool of thin 75 ohm co-ax which is why it had any kind of mobility. It, somewhat unexpectedly, tracked almost until I put the camera down on the table at the end. It was a bit of an anticlimactic finish, but I didn’t expect it to work quite so well.

Year 14 (2019)

This is the project I’ve been working on recently (landmarkers). It’s nice to see technology move from a proof of concept, academic curiosity to a robust production system usable in the wild by people who aren’t computer vision researchers. Also, I didn’t do the graphics in this one which is why it looks rather cooler than a bunch of purple lines.

Linear 3D triangulation

I came across this 3D linear triangular method in TheiaSFM:

bool TriangulateNView(const std::vector<Matrix3x4d>& poses,
const std::vector<Vector2d>& points,
Vector4d* triangulated_point)
CHECK_EQ(poses.size(), points.size());

Matrix4d design_matrix = Matrix4d::Zero();
for (int i = 0; i < points.size(); i++) {
const Vector3d norm_point = points[i].homogeneous().normalized();
const Eigen::Matrix cost_term =
poses[i].matrix() -
norm_point * norm_point.transpose() * poses[i].matrix();
design_matrix = design_matrix + cost_term.transpose() * cost_term;
}
*triangulated_point = eigen_solver.eigenvectors().col(0);
return eigen_solver.info() == Eigen::Success;
}

I was aware of the DLT (direct linear transform), but it didn't look like any formulation I've seen before. It's actually pretty neat. Let's say you're trying to find an unknown homogeneous point in 3D, $\mathbf{X} = [X, Y, Z, 1]$. What we have is $N$ poses, $P$, represented as $3\times 4$ matrices and the corresponding 2D coordinates represented as homogeneous points in $\mathbb R^3$. The 2D points are written as $\mathbf{x} = [ x, y, 1]$.

Since we're triangulating the 3D point, and we have homogeneous coordinate (i.e. $\alpha \mathbf{x} \equiv \mathbf{x}$) then for all $i$ we should have:
$\alpha_i \mathbf{x}_i \approx P_i \mathbf X$
given an scale factor $\alpha$.

Now let's pick apart the code above. Let's call design_matrix $D$ and cost_term $C$. On line 12, we have:
$\displaystyle D = \sum_{i=1}^{N} C_i^\top C_i$
And line 15 we’re finding the eigenvector corresponding to the smallest eigenvalue of D (SelfAdjointSolver produces them in a sorted order), i.e.
$\mathbf{X} \approx \displaystyle \underset{\mathbf{v}, |\mathbf{v}|=1}{\text{argmin}}\ \mathbf{v}^\top D \mathbf{v}$

We can rewrite $D = \mathbf{C}^\top\mathbf{C}$ where:
$\mathbf{C} = \left[ \begin{matrix} C_1\\ C_2\\ \vdots\\ C_N\\ \end{matrix}\right]$, which substituting in above gives:
$\mathbf{X} \approx \displaystyle \underset{\mathbf{v}, |\mathbf{v}|=1}{\text{argmin}}\ \|\mathbf{C v}\|_2^2$,
which is of course the right singular vector corresponding to the smallest singular value of $C$. Using eigen decomposition is much more efficient the size is $O(1)$, not $O(N)$, but probably at the penalty of less numerical precision.

Either way we’re trying to find the approximate nullspace of $\mathbf{C}$, which means finding something that’s roughly in the null space of all the $C_i$s. But why?

On lines 8–11, we have:
$C_i = P_i - \mathbf{\hat{x}\hat{x}^\top}P_i$,
and we’re claiming $\mathbf{X}$ is about in the null space. Let’s see what happens when we multiply by it:
$(P_i - \mathbf{\hat{x}\hat{x}^\top}P_i) \mathbf{X} = P_i \mathbf{X} -\mathbf{\hat{x}\hat{x}^\top}P_i \mathbf{X}\\$
Now, substituring in the first equation we have all the way back at the top gives:
$\approx \alpha \mathbf{x} - \alpha\mathbf{\hat{x}\hat{x}^\top x} = \alpha \mathbf{x} - \alpha\mathbf{\hat{x} |x|} = \alpha \mathbf{x} - \alpha\mathbf{x} = 0$
taa daa!

So there you go! If there is no noise, $\mathbf{X}$ is in the right null space of $C_i$ and so the null space of $\mathbf C$ and of course $D$. If there is noise, then it’s closest to being in the null space of all of $C_i$ measured in a least squares sense.

Note though that it’s just an algebraic error, not a reprojection error so it will give slightly oddly scaled results which will give more weight to some points than others. It is however a rather elegant solution.