FishFusion: Native surround view automotive fusion, mapping
and training
Project Motivation
Fisheye cameras are commonly deployed on vehicles in a surround-view configuration, with one camera at both the front and rear of the vehicle and on each wing mirror. Such camera networks have traditionally been deployed for viewing applications. However, surround-view cameras are increasingly focused on near-field sensing, which can be used for low-speed applications such as parking or traffic jam assistance functions.
With such a surround-view camera network, we can accomplish a dense 360-degree near-field perception. The wide field of view of the fisheye image comes with the side effect of strong radial distortion. Objects at different angles from the optical axis look quite different, making the object detection task a challenge. A common practice is to rectify distortions in the image using a fourth-order polynomial model or unified camera model. However, undistortion comes with resampling distortion artifacts, especially at the periphery. In particular, the negative impact on computer vision due to the introduction of spurious frequency components is understood. In addition, other minor impacts include reduced field of view and non-rectangular image due to invalid pixels. Although semantic segmentation is an easier solution on fisheye images, object detection annotation costs are much lower. In general, there is limited work on fisheye perception. The figure below shows an attempt to address the output representation of objects in fisheye space.
From: Hazem Rashed, Eslam Mohamed, Ganesh Sistu, Varun Ravi Kumar, Ciaran Eising, Ahmad El-Sallab and Senthil Yogamani. Generalized Object Detection on Fisheye Cameras for Autonomous Driving: Dataset, Representations and Baseline, WACV 2021
Of course, surround view cameras are not the only cameras that provide a 360-degree perception of the vehicles surrounding. As demonstrated in the first figure below, several sensor types can provide such perception. Sensor Fusion is therefore a topic of deep interest in the context of ADAS and autonomous driving. The second figure below shows an example of early/mid-level fusion, in this case showing the fusion of cameras and laser scanners, though the same approach can be applied. However, other fusion modalities will be examined. In general, there are three high-level fusion strategies:
• early fusion, where the raw modalities are combined ahead of feature extraction.
• intermediate fusion, the features, respective to each modality, are concatenated before classification.
• late fusion, where the modality-wise classification results are combined.
Late fusion is the classic strategy – all the sensors complete all the algorithmic processing
Project Description
Given the surround view sensing, maps are classically accumulated and tracked based on Kalman Filters (or their Extended variants) of Particle Filters. The vehicle trajectory (Figure 9) is then often estimated using classical algorithms, such as the A* algorithm. However, these tend to fail in cases of uncertainty, either causing a recalculation of the vehicle trajectory, or a cancellation of the manoeuvre (e.g., a pothole or undetected speed-bump can cause a misalignment of the actual and estimated host vehicle location).
To overcome this uncertainty, we propose to use a take two approaches:
• Replace the Kalman/Particle filter-based temporal mapping of the vehicle’s surroundings with LSTM neural networks
• Replace the trajectory estimation with Graph Neural Networks
• Combine long and short-term trajectory goals to improve the adaptability of the trajectory estimation to undetected sources of error
Finally, all the proposed methods require significant data and training effort. It is desirable to build an algorithm that works globally. However, there are limitations on what data can be moved between different jurisdictions, for example, USA – Europe – China. This brings up an interesting question of federated learning: can we train one neural network on data on multiple machines located potentially in different continents and based upon data gathered in those different locations? For example, can we partially train a network in China, based on data gathered in China, without moving the data out of China? And then complete the training in Europe based on European data? The key thing is that at all times, only a single network is being trained, thus we have a globally distributed training of the network.
A related question is whether we can reuse old datasets, with limited annotations, to train new, multi-task networks. For example, can we use a dataset that is only annotated for pedestrians to (partially) train a network that is designed to output multiple class types?
Thus, we have four sub-projects/work packages:
• SP1 - The adaptation of CNNs and Transformers to work natively on fisheye cameras
• SP2 - Early and intermediate fusion of surround view computer vision outputs with other sensor modalities
• SP3 - Surround-view temporal mapping and trajectory estimation using LSTMs and GNNs
• SP4 - Federated/Distribute neural network training