New paper: Large Scale Photometric Bundle Adjustment

Myself and Olly Woodford published a new paper: Large Scale Photometric Bundle Adjustment, PDF here, at BMCV 2020.

This work presents a fully photometric formulation for bundle adjustment. Starting from a classical system (such as COLMAP), the system performs a structure and pose refinement, where the cost function is essentially the normalised correlation cost of patches reprojected into the source images.


Direct methods have shown promise on visual odometry and SLAM, leading to greater accuracy and robustness over feature-based methods. However, offline 3-d reconstruction from internet images has not yet benefited from a joint, photometric optimization over dense geometry and camera parameters. Issues such as the lack of brightness constancy, and the sheer volume of data, make this a more challenging task. Thiswork presents a framework for jointly optimizing millions of scene points and hundreds of camera poses and intrinsics, using a photometric cost that is invariant to local lighting changes. The improvement in metric reconstruction accuracy that it confers over feature-based bundle adjustment is demonstrated on the large-scale Tanks & Temples benchmark. We further demonstrate qualitative reconstruction improvements on an in ternet photo collection, with challenging diversity in lighting and camera intrinsics.

BMVC 2019

I went to BMVC this year and had a great time, and saw lots of interesting papers and talked to a lot of interesting people. BMVC was my first conference in 2003, and it has changed fair bit since then. I remember some of the hushed, awed tones about how was that it was getting really international because there were two speakers from American and one all the way from China. Now it really is a big international conference, that just happens to be located in the UK each year. I think the best bits of the fundamental character haven’t changed.

On the minus side, I made a bunch of notes and then lost them so I’m having to go on memory and have almost certainly forgotten some that stood out. So here’s a somewhat random selection of papers that caught my eye as interesting for various reasons.

But first, here’s a video of Cardiff Science Library vomiting rainbows:


A random selection of interesting papers


Dissecting Neural Nets
Prof. Antonio Torralba (MIT)

That keynote was very interesting and Prof. Torralba is a fantastic presenter and the results were very intersting. Unfortunately I can’t find the video to link to.

Geometric vision

Whenever there’s a paper not about deep learning there’s always a cluster of people people who’s student days are long past hovering around commenting about how it’s nice to see something that isn’t deep learning. I also like to refer to this type of vision as “geometric” since it involves geometry rather than “traditional” or (even worse) “old-fashioned”.

26. A Simple Direct Solution to the Perspective-Three-Point Problem
Gaku Nakano (NEC Corporation)

The paper is a new solution to the P3P problem. Given the age of the field and number of existing solutions, it’s surprising that there are actually new ones. It’s a surprisingly tricky problem as anyone who’d tried to derive a solution will know and it’s interesting to see there are still new insights to be had.

Adversarial attacks

If you hand a vision system to a computer vision researcher, the first thing they will do is try and break it. These days that’s even publishable!

Non deep image features are still widely used for solving geometric problems especially if efficiency is key. While it’s not surprising, it had never occurred to me that they could be attacked just like neural nets can be attacked.

27. Adversarial Examples for Handcrafted Features
Muhammad Latif Anjum (NUST); Zohaib Ali (NUST); Wajahat Hussain (NUST – SEECS)

Much like the attacks on DNNs, the differences aren’t visually apparent. Speaking of adversarial attacks, I found this paper and poster enjoyable and easy to follow, with good results:

210. Robust Synthesis of Adversarial Visual Examples Using a Deep Image Prior
Thomas Gittings (University of Surrey); Steve Schneider (University of Surrey); John Collomosse (University of Surrey)

I didn’t know that was a thing

I like papers that have “towards” in the title. It’s an admission in the title that the results aren’t spectacular or and they aren’t aceing the current benchmarks, but they’re tackling a hard problem in a new way. That’s a good goal for research, not engineering polished solutions, but tackling new problems or bringing new insight to bear.

In this case, they are dealing with point clouds of the sort that might be the result of structure from motion but where the original images aren’t available. Turns out it’s possible to do semantic segmentation of those clouds.

252. Towards Weakly Supervised Semantic Segmentation in 3D Graph-Structured Point Clouds of Wild Scenes
Haiyan Wang (City University of New York); Xuejian Rong (City University of New York); Liang Yang (City University of New York); YingLi Tian (City University of New York)

Realtime semantic segmentation

There’s a lot of interest in realtime techniques  which I like. A lot of it comes from the self driving car industry and all of these are tested on Cityscapes. I’m more interested it from the perspective of running on a phone, but there’s a lot of common ground and so these are well worth a closer look.

253. Fast-SCNN: Fast Semantic Segmentation Network
Rudra Poudel (Tosihiba Research Europe, Ltd.); Stephan Liwicki (Toshiba Research Europe, Ltd.); Roberto Cipolla (University of Cambridge)

259. DABNet: Depth-wise Asymmetric Bottleneck for Real-time Semantic Segmentation
Gen Li (Sungkyunkwan University); Joongkyu Kim (Sungkyunkwan University)

260. Feature Pyramid Encoding Network for Real-time Semantic Segmentation
Mengyu Liu (University of Manchester); Hujun Yin (University of Manchester)

Benchmarks are useful, but I feel that over reliance on them can essentially lead to reverse engineering the datasets. I’ve certainly noticed in my own work that networks that give stellar results on ImageNet don’t do nearly so well when images that aren’t of the sort one posts to the internet (i.e. worse, less well composed, more cluttered, worse lighting and focus etc).

I think all good benchmarks are doomed to eventually become more of a hindrance than a help because of all the focus that they draw. This isn’t to disparage the benchmarks, at all, I think it’s simply part of the cycle of research. I wonder when we’ll reach that point with Cityscapes.

Domain transformation

The key idea here is that (for object detection from a car), a data volume aligned with the ground plane and front of the car is more semantically meaningful than a 2D image. So they transform RESNet features into that cube using a simple technique and do the deep learning there. Sound idea with good results.

285. Orthographic Feature Transform for Monocular 3D Object Detection
Thomas Roddick (University of Cambridge); Alex Kendall (University of Cambridge); Roberto Cipolla (University of Cambridge)

Binary networks

Binarised networks have an appealing minimalism, especially from a hardware and wire-format compression point of view. Unfortunately they’re not differentiable. This paper makes judicious use of carefully inserted weighting factors and derivatives of effectively a blurred binary activation function to introduce differentiability.

19. Accurate and Compact Convolutional Neural Networks with Trained Binarization
Zhe Xu (City University of Hong Kong); Ray Cheung (City University of Hong Kong)

A different approach to deep features

I couldn’t decide how much I like this paper because I kept vacillating about the core idea. Then I realised that in itself makes it a good paper because it’s made me think a lot about the problem. It was very well presented and the core idea is simple and intriguing.

32. Matching Features without Descriptors: Implicitly Matched Interest Points
Titus Cieslewski (University of Zurich & ETH Zurich); Michael Bloesch (Deepmind); Davide Scaramuzza (University of Zurich & ETH Zurich)

I like that the features are defined purely by matchability and localisation. I also like that they do not have to do things like have precisely (or at most one) feature per 8×8 (etc) window of the image, and they have a simple structure without auxiliary losses, and an overall simple training procedure.

This is also one of the things I like about BMVC: the results presented in the paper don’t present it as the new leading feature detector, in fact it’s not even near the top of the pack of the ones they compare to. However they’re tackling it in a new and interesting way and I there is a great deal of value in such ideas being shared and discussed even if they’re not (yet?) as good as the competitors.