Vision based language tasks in Autonomous Driving
Many studies suggest that deep neural control networks are likely to be a key component of self-driving vehicles. These networks are trained on large datasets to imitate human actions, but they lack semantic understanding of agents’ movements in a given scenario. This information makes the system perform better and explainable to end-users in self-driving vehicles. It will help increase public confidence in self-driving technology.
With this knowledge of visual common-sense, the vehicle control module can be improved to an extent where an end-user should be able to instruct an autonomous car to complete certain actions.
To do so, we are looking at an approach where we have a joint model for learning task-agnostic visual grounding from paired visiolinguistic data. This approach would be extending the BERT language model to jointly reason about text and images. As can be seen in the diagram, we intend to tackle the image and question given in two separate approaches and then combine them into a deep-learning model that will be able to satisfy our goal to answer the question.
For this, we are looking at creating a dataset that is entirely in the context of driving by looking at a large-scale naturalistic driving dataset collected in San Francisco Bay using a single front-view camera (done by Honda) and are in the process of making annotations for the images collected. Once the dataset is ready, we will be using that to train and fine-tune our model developed.