Audio-Based Activities of Daily Living (ADL) Recognition with Large-Scale Acoustic Embeddings from Online Videos

@article{Liang2019AudioBasedAO,
  title={Audio-Based Activities of Daily Living (ADL) Recognition with Large-Scale Acoustic Embeddings from Online Videos},
  author={Dawei Liang and Edison Thomaz},
  journal={Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies},
  year={2019},
  volume={3},
  pages={1 - 18}
}
  • Dawei Liang, Edison Thomaz
  • Published 19 October 2018
  • Computer Science
  • Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
Over the years, activity sensing and recognition has been shown to play a key enabling role in a wide range of applications, from sustainability and human-computer interaction to health care. While many recognition tasks have traditionally employed inertial sensors, acoustic-based methods offer the benefit of capturing rich contextual information, which can be useful when discriminating complex activities. Given the emergence of deep learning techniques and leveraging new, large-scale… 

Figures and Tables from this paper

Automated Class Discovery and One-Shot Interactions for Acoustic Activity Recognition
TLDR
This work built an end-to-end system for self-supervised learning of events labelled through one-shot interaction, and shows that the system can accurately and automatically learn acoustic events across environments, while adhering to users' preferences for non-intrusive interactive behavior.
LASO: Exploiting Locomotive and Acoustic Signatures over the Edge to Annotate IMU Data for Human Activity Recognition
TLDR
This paper proposes LASO, a multimodal approach for automated data annotation from acoustic and locomotive information, and uses pre-trained audio-based activity recognition models for labeling the IMU data while handling the acoustic noises.
Automated detection of foreground speech with wearable sensing in everyday home environments: A transfer learning approach
TLDR
A transfer learning-based approach to detect foreground speech of users wearing a smartwatch based on knowledge transfer from general-purpose speaker representations derived from public datasets that performs comparably to a fully supervised model.
Audiovisual Classification of Group Emotion Valence Using Activity Recognition Networks
TLDR
The results show that using activity recognition pretraining offers performance advantages for group-emotion recognition and that audio is essential to improve the accuracy and robustness of video-based recognition.
Transferring Voice Knowledge for Acoustic Event Detection: An Empirical Study
TLDR
A dual-branch neural network architecture is developed for the joint learning of voice and acoustic features during an AED process and thorough empirical studies are conducted to examine the performance on the public AudioSet with different types of inputs.
Cross-Dataset Activity Recognition via Adaptive Spatial-Temporal Transfer Learning
TLDR
An Adaptive Spatial-Temporal Transfer Learning (ASTTL) approach to tackle both of the above two challenges in cross-dataset HAR and can be used for both source domain selection and accurate activity transfer.
Vid2Doppler: Synthesizing Doppler Radar Data from Videos for Training Privacy-Preserving Activity Recognition
TLDR
This work sets out to create a software pipeline that converts videos of human activities into realistic, synthetic Doppler radar data, and shows how this cross-domain translation can be successful through a series of experimental results.
IMU2Doppler: Cross-Modal Domain Adaptation for Doppler-based Activity Recognition Using IMU Data
TLDR
This paper uses off-the-shelf smartwatch IMU datasets to train an activity recognition system for mmWave radar sensor with minimally labeled data and demonstrates that the approach outperforms the baseline in every single scenario.
Ok Google, What Am I Doing?
TLDR
This work explores how off-the-shelf conversational assistants can be enhanced with acoustic-based human activity recognition by leveraging the short interval after a voice command is given to the device.
Streamlining Action Recognition in Autonomous Shared Vehicles with an Audiovisual Cascade Strategy
TLDR
This work proposes the processing of audio and visual data in a cascade pipeline for in-vehicle action recognition, showing an interesting accuracy-acceleration trade-off when compared with a parallel pipeline with late fusion, presenting potential for industrial applications on embedded devices.
...
1
2
3
...

References

SHOWING 1-10 OF 47 REFERENCES
DeepEar: robust smartphone audio sensing in unconstrained acoustic environments using deep learning
TLDR
This paper presents DeepEar -- the first mobile audio sensing framework built from coupled Deep Neural Networks (DNNs) that simultaneously perform common audio sensing tasks and shows DeepEar is feasible for smartphones by building a cloud-free DSP-based prototype that runs continuously, using only 6% of the smartphone's battery daily.
Towards scalable activity recognition: adapting zero-effort crowdsourced acoustic models
TLDR
This work investigates two adapting approaches: a semi-supervised learning to combine crowd-sourced data and unlabeled user data, and an active-learning to query the user for labeling samples where the crowd- sourced based model fails to recognize.
Ubicoustics: Plug-and-Play Acoustic Activity Recognition
TLDR
This work describes a novel, real-time, sound-based activity recognition system that starts by taking an existing, state-of-the-art sound labeling model, which is then tuned to classes of interest by drawing data from professional sound effect libraries traditionally used in the entertainment industry.
Environmental audio scene and activity recognition through mobile-based crowdsourcing
TLDR
A crowdsourcing framework that models the combination of scene, event, and phone context to overcome environmental audio recognition issues is proposed and found that audio scenes, events, andPhone context are classified with 85.2, 77.6, and 88.9% accuracy.
Combining crowd-generated media and personal data: semi-supervised learning for context recognition
TLDR
This work uses a semi-supervised Gaussian mixture model to combine labeled data from the crowd-generated database and unlabeled personal recording data to train a personalized model for context recognition of users' mobile phones.
CNN architectures for large-scale audio classification
TLDR
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
Recognizing Daily Life Context Using Web-Collected Audio Data
TLDR
Crowd-sourced textual descriptions related to individual sound samples were used in a configurable recognition system to model 23 sound context categories to model daily life contexts from web-collected audio data.
Audio Set Classification with Attention Model: A Probabilistic Perspective
This paper investigates the Audio Set classification. Audio Set is a large scale weakly labelled dataset (WLD) of audio clips. In WLD only the presence of a label is known, without knowing the
Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification
TLDR
It is shown that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a “shallow” dictionary learning model with augmentation.
Audio Set: An ontology and human-labeled dataset for audio events
TLDR
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
...
1
2
3
4
5
...