Estimating Position & Velocity in 3D Space From Monocular Video Sequences Using a Deep Neural Network

Arturo Marban, Vignesh Srinivasan, Wojciech Samek, Josep Fernandez, Alicia Casals; Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 1460-1469

Abstract


This work describes a regression model based on Convolutional Neural Networks (CNN) and Long-Short Term Memory (LSTM) networks for tracking objects from monocular video sequences. The target application being pursued is Vision-Based Sensor Substitution (VBSS). In particular, the tool-tip position and velocity in 3D space of a pair of surgical robotic instruments (SRI) are estimated for three surgical tasks, namely suturing, needle-passing and knot-tying. The CNN extracts features from individual video frames and the LSTM network processes these features over time and continuously outputs a 12-dimensional vector with the estimated position and velocity values. A series of analyses and experiments are carried out in the regression model to reveal the benefits and drawbacks of different design choices...

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Marban_2017_ICCV,
author = {Marban, Arturo and Srinivasan, Vignesh and Samek, Wojciech and Fernandez, Josep and Casals, Alicia},
title = {Estimating Position & Velocity in 3D Space From Monocular Video Sequences Using a Deep Neural Network},
booktitle = {Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops},
month = {Oct},
year = {2017}
}