Video-TransUNet: Temporally Blended Vision Transformer for CT VFSS Instance Segmentation

  title={Video-TransUNet: Temporally Blended Vision Transformer for CT VFSS Instance Segmentation},
  author={Cheng Zeng and Xinyu Yang and Majid Mirmehdi and Alberto M. Gambaruto and Tilo Burghardt},
We propose Video-TransUNet, a deep architecture for instance segmentation in medical CT videos constructed by integrating temporal feature blending into the TransUNet deep learning framework. In particular, our approach amalgamates strong frame representation via a ResNet CNN backbone, multi-frame feature blending via a Temporal Context Module (TCM), non-local attention via a Vision Transformer, and reconstructive capabilities for multiple targets via a UNet-based convolutional-deconvolutional… 

Figures and Tables from this paper



TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

It is argued that Transformers can serve as strong encoders for medical image segmentation tasks, with the combination of U-Net to enhance finer details by recovering localized spatial information, and empirical results suggest that the Transformer-based architecture presents a better way to leverage self-attention compared with previous CNN-based self-Attention methods.

T-CNN: Tubelets With Convolutional Neural Networks for Object Detection From Videos

A deep learning framework that incorporates temporal and contextual information from tubelets obtained in videos, which dramatically improves the baseline performance of existing still-image detection frameworks when they are applied to videos is proposed, called T-CNN.

UNet++: A Nested U-Net Architecture for Medical Image Segmentation

This paper presents UNet++, a new, more powerful architecture for medical image segmentation where the encoder and decoder sub-networks are connected through a series of nested, dense skip pathways, and argues that the optimizer would deal with an easier learning task when the feature maps from the decoder and encoder networks are semantically similar.

U-Net: Convolutional Networks for Biomedical Image Segmentation

It is shown that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks.

Automated Bolus Detection in Videofluoroscopic Images of Swallowing Using Mask-RCNN

A computer-aided method to automate bolus detection and tracking in videofluoroscopic images during X-ray based diagnostic swallowing examinations is developed and indicated robust detection results that can help to improve the speed and accuracy of a clinical decisionmaking process.

Automatic Detection of the Pharyngeal Phase in Raw Videos for the Videofluoroscopic Swallowing Study Using Efficient Data Collection and 3D Convolutional Networks †

A novel approach that uses 3D convolutional networks to detect the pharyngeal phase in raw VFSS videos without manual annotations is presented and it is concluded that the proposed method greatly reduces the examination time of the VF SS images with a low miss rate.

Attention U-Net: Learning Where to Look for the Pancreas

A novel attention gate (AG) model for medical imaging that automatically learns to focus on target structures of varying shapes and sizes is proposed to eliminate the necessity of using explicit external tissue/organ localisation modules of cascaded convolutional neural networks (CNNs).

PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume

PWC-Net has been designed according to simple and well-established principles: pyramidal processing, warping, and the use of a cost volume, and outperforms all published optical flow methods on the MPI Sintel final pass and KITTI 2015 benchmarks.

Automatic hyoid bone detection in fluoroscopic images using deep learning

This study proposes a single shot multibox detector, a deep convolutional neural network, which is employed to detect and classify the location of the hyoid bone in a frame, and shows that it can outperform other auto-detection algorithms.