Dual Representation Learning From Fetal Ultrasound Video and Sonographer Audio
Gridach M., Alsharid M., Jiao J., Drukker L., Papageorghiou AT., Alison Noble J.
This paper tackles the challenging problem of real-world data self-supervised representation learning from two modalities: fetal ultrasound (US) video and the corresponding speech acquired when a sonographer performs a pregnancy scan. We propose to transfer knowledge between the different modalities, even though the sonographer's speech and the US video may not be semantically correlated. We design a network architecture capable of learning useful representations such as of anatomical features and structures while recognising the correlation between an US video scan and the sonographer's speech. We introduce dual representation learning from US video and audio, which consists of two concepts: Multi-Modal Contrastive Learning and Multi-Modal Similarity Learning, in a latent feature space. Experiments show that the proposed architecture learns powerful representations and transfers well for two downstream tasks. Furthermore, we experiment with two different datasets for pretraining which differ in size and length of video clips (as well as sonographer speech) to show that the quality of the sonographer's speech plays an important role in the final performance.