• Corpus ID: 221088971

Acoustic event search with an onomatopoeic query: measuring distance between onomatopoeic words and sounds

  title={Acoustic event search with an onomatopoeic query: measuring distance between onomatopoeic words and sounds},
  author={Shota Ikawa and Kunio Kashino},
As a means of searching for desired audio signals stored in a database, we consider using a string of an onomatopoeic word, namely a word that imitates a sound, as a query, which allows the user to specify the desired sound by verbally mimicking the sound or typing the sound word, or the word containing sounds similar to the desired sound. However, it is generally difficult to realize such a system based on text similarities between the onomatopoeic query and the onomatopoeic tags associated… 

Figures and Tables from this paper

Environmental Sound Extraction Using Onomatopoeic Words
Experimental results indicate that the proposed environmental-sound-extraction method can extract only the target sound corresponding to the onomatopoeic word and performs better than conventional methods that use sound-event classes to specify the targetsound.
Neural Audio Captioning Based on Conditional Sequence-to-Sequence Model
An audio captioning system that describes non-speech audio signals in the form of natural language that can generate a sentence describing sounds, rather than an object label or onomatopoeia, is proposed.
Onomatopoeia, which is a character sequence that phonetically imitates a sound, is effective in expressing characteristics of sound such as duration, pitch, and timbre. We propose an
Audio Retrieval with Natural Language Queries: A Benchmark Study
This work employs three challenging new benchmarks to establish baselines for cross-modal text-audio and audio-text retrieval, and introduces the SOUNDDESCS benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AUDIOCAPS and CLOTHO.
On Metric Learning for Audio-Text Cross-Modal Retrieval
An extensive evaluation of popular metric learning objectives on the AudioCaps and Clotho datasets shows that NT-Xent loss adapted from self-supervised learning shows stable performance across different datasets and training settings, and outperforms the popular triplet-based losses.
Environmental Sound Extraction Using Onomatopoeia
Experimental results indicate that the proposed environmental-soundextraction method can extract only the target sound corresponding to onomatopoeia and performs better than conventional methods that use sound-event classes to specify the targetSound to be extracted.


Generating Sound Words from Audio Signals of Acoustic Events with Sequence-to-Sequence Model
  • Shota Ikawa, K. Kashino
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
The method is based on an end-to-end, sequence- to-sequence framework to solve the audio segmentation problem to find an appropriate segment of audio signals along time that corresponds to a sequence of phonemes, and the ambiguity problem, where multiple words may correspond to the same sound, depending on the situations or listeners.
The Acoustic Sound Field Dictation with Hidden Markov Model Based on an Onomatopeia
In this study, we realized acoustic sound field dictation which is effective for the security systems because it can quickly find an abnormal sound on the basis of text information from a captured
Classification of sound clips by two schemes: Using onomatopoeia and semantic labels
Using the recently proposed framework for latent perceptual indexing of audio clips, we present classification of whole clips categorized by two schemes: high-level semantic labels and the mid-level
Sequence to Sequence Learning with Neural Networks
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Automatic transformation of environmental sounds into sound-imitation words based on Japanese syllable structure
Three-stagearchitecture of automatictransformation of environ-mentalsounds to sound-imitationwords; segmenting soundsig-nalstosyllables, identifying syllablestructureasmora, and rec-ognizing mora as phonemes is presented.
Audio Set: An ontology and human-labeled dataset for audio events
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
Content-Based Classification, Search, and Retrieval of Audio
The audio analysis, search, and classification engine described here reduces sounds to perceptual and acoustical features, which lets users search or retrieve sounds by any one feature or a combination of them, by specifying previously learned classes based on these features.
Visualizing Video Sounds With Sound Word Animation to Enrich User Experience
The results of the user study show that the animated sound words can effectively and naturally visualize the dynamics of sound while clarifying the position of the sound source as well as contribute to making video-watching more enjoyable and increasing the visual impact of videos.
Adam: A Method for Stochastic Optimization
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
Qualitatively, the proposed RNN Encoder‐Decoder model learns a semantically and syntactically meaningful representation of linguistic phrases.