Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

In this paper we develop a multi-modal video analysis algorithm to predict where a sonographer should look next. Our approach uses video and expert knowledge, defined by gaze tracking data, which is acquired during routine first-trimester fetal ultrasound scanning. Specifically, we propose a spatio-temporal convolutional LSTMU-Net neural network (cLSTMU-Net) for video saliency prediction with stochastic augmentation. The architecture design consists of a U-Net based encoder-decoder network and a cLSTM to take into account temporal information. We compare the performance of the cLSTMU-Net alongside spatial-only architectures for the task of predicting gaze in first trimester ultrasound videos. Our study dataset consists of 115 clinically acquired first trimester US videos and a total of 45, 666 video frames. We adopt a Random Augmentation strategy (RA) from a stochastic augmentation policy search to improve model performance and reduce over-fitting. The proposed cLSTMU-Net using a video clip of 6 frames outperforms the baseline approach on all saliency metrics: KLD, SIM, NSS and CC (2.08, 0.28, 4.53 and 0.42 versus 2.16, 0.27, 4.34 and 0.39).

Original publication




Conference paper

Publication Date





Fetal ultrasound, U-Net, convolutional LSTM, first trimester, gaze tracking, stochastic augmentation, video saliency prediction