Detection of anomalous sound events in audio surveillance is a challenging task when applied to realistic settings. Part of the difficulty stems from properly defining the `normal' behavior of a crowd or an environment (e.g. airport, train station, sport field). By successfully capturing the heterogeneous nature of sound events in an acoustic environment, we can use it as a reference against which anomalous behavior can be detected in continuous audio recordings. The current study proposes a methodology for representing sound classes using a hierarchical network of convolutional features and mixture of temporal trajectories (MTT). The framework couples unsupervised and supervised learning and provides a robust scheme for detection of abnormal sound events in a subway station. The results reveal the strength of the proposed representation in capturing non-trivial commonalities within a single sound class and variabilities across different sound classes as well as high degree of robustness in noise.