Densely connected multidilated convolutional networks for dense prediction tasks

  title={Densely connected multidilated convolutional networks for dense prediction tasks},
  author={Naoya Takahashi and Yuki Mitsufuji},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  • Naoya TakahashiYuki Mitsufuji
  • Published 21 November 2020
  • Computer Science
  • 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Tasks that involve high-resolution dense prediction require a modeling of both local and global patterns in a large input field. Although the local and global structures often depend on each other and their simultaneous modeling is important, many convolutional neural network (CNN)-based approaches interchange representations in different resolutions only a few times. In this paper, we claim the importance of a dense simultaneous modeling of multiresolution representation and propose a novel… 

Figures and Tables from this paper

Multi-scale Spatial Representation Learning via Recursive Hermite Polynomial Networks

Recursive Hermite Polynomial Networks are proposed, which recursively constructs sub-scale representations to avoid the artifacts caused by naively applying the dilation convolution, and reveal its superiority over state-of-the-art alternatives on a variety of image recognition tasks.

KUIELab-MDX-Net: A Two-Stream Neural Network for Music Demixing

A two-stream neural network for music demixing, called KUIELab-MDX-Net, which shows a good balance of performance and required resources and blends results from two streams to generate the final estimation.

Spatial Data Augmentation with Simulated Room Impulse Responses for Sound Event Localization and Detection

An impulse response simulation framework (IRS) that augments spatial characteristics using simulated room impulse responses (RIR) and an ablation study to discuss the contribution and need for each component within the IRS.

Ensemble of ACCDOA- and EINV2-based Systems with D3Nets and Impulse Response Simulation for Sound Event Localization and Detection

This ACCDOA-based system with efficient network architecture called RD3Net and data augmentation techniques outperformed state-of-the-art SELD systems in terms of localization and locationdependent detection and proposes impulse response simulation (IRS), which generates simulated multi-channel signals.

End-to-End Complex-Valued Multidilated Convolutional Neural Network for Joint Acoustic Echo Cancellation and Noise Suppression

This paper exploits the offset-compensating property of complex time-frequency masks and proposes an end-to-end complex-valued neural network architecture that utilized the multi-resolution nature of the D3Net to eliminate the need for pooling, allowing feature extraction using large receptive fields without any loss of output resolution.

What Makes Sound Event Localization and Detection Difficult? Insights from Error Analysis

Experimental results indicate polyphony as the main challenge in SELD, due to the difference inulty in detecting all sound events of interest, and the SELD systems tend to make fewer errors for the polyphonic scenario that is dominant in the training set.

Amicable Examples for Informed Source Separation

  • Naoya TakahashiYuki Mitsufuji
  • Computer Science
    ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2022
This work improves the performance of a pre-trained separation model that does not use any side-information and proposes multi-model multi-purpose learning that control the effect of the perturbation on different models individually.

Fusing CNNs and Transformers for Deformable Medical Image Registration

  • Dashuai Hu
  • Computer Science
    2022 2nd International Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology (CEI)
  • 2022
A novel parallel architecture combining CNNs and Transformers, FTNet, to tackle deformable image registration, which achieves superior registration accuracy against other state-of-the-art methods while maintaining desired diffeomorphic properties of deformation fields.

Robust One-Shot Singing Voice Conversion

A robust one-shot SVC (ROSVC) that performs any-to-any SVC robustly even on such distorted singing voices using less than 10s of a reference voice is proposed.

Conditioned Source Separation by Attentively Aggregating Frequency Transformations With Self-Conditioning

It is shown that the conditioned U-Net employing the enhanced LaSAFT blocks outperforms the previous model and performs the audio-query–based separation with a slight modification.



Densely Connected Convolutional Networks

The Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion, and has several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.

Deep High-Resolution Representation Learning for Visual Recognition

The superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, is shown, suggesting that the HRNet is a stronger backbone for computer vision problems.

PSANet: Point-wise Spatial Attention Network for Scene Parsing

The point-wise spatial attention network (PSANet) is proposed to relax the local neighborhood constraint and achieves top performance on various competitive scene parsing datasets, including ADE20K, PASCAL VOC 2012 and Cityscapes, demonstrating its effectiveness and generality.

Pyramid Scene Parsing Network

This paper exploits the capability of global context information by different-region-based context aggregation through the pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet) to produce good quality results on the scene parsing task.

Deep Residual Learning for Image Recognition

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

Multi-Scale Context Aggregation by Dilated Convolutions

This work develops a new convolutional network module that is specifically designed for dense prediction, and shows that the presented context module increases the accuracy of state-of-the-art semantic segmentation systems.

U-Net: Convolutional Networks for Biomedical Image Segmentation

It is shown that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks.

Fully convolutional networks for semantic segmentation

The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.

Mmdenselstm: An Efficient Combination of Convolutional and Recurrent Neural Networks for Audio Source Separation

A novel architecture that integrates long short-term memory (LSTM) in multiple scales with skip connections to efficiently model long-term structures within an audio context is proposed and yields better results than those obtained using ideal binary masks for a singing voice separation task.

PSConv: Squeezing Feature Pyramid into One Compact Poly-Scale Convolutional Layer

The proposed convolution operation, named Poly-Scale Convolution (PSConv), mixes up a spectrum of dilation rates and tactfully allocate them in the individual convolutional kernels of each filter regarding a single convolutionAL layer.