Speech recognition scores of machine decrease significantly in comparison to humans in difficult environments , e.g. when the noise exhibits nonstationary characteristics. Thus, standard speech features as the Mel Frequency Cepstral Coefficients (MFCCs) or RelAtive SpectrAl (RASTA) features  show good performance in clean conditions but strongly deteriorate in the presence of noise. However, spectro-temporal features achieved promising results in such situations [3, 4]. Unlike standard features, they are able to detect for instance steady formant transitions in the spectro-temporal representation. Most of them use Gabor filters , whereas we developed features inspired by a hierarchical system for visual object recognition  . We refer to them as Hierarchical Spectro-Temporal (HIST) features with their extraction scheme depicted in Fig. 1 .