Assistance is currently a pivotal research area in robotics, with huge societal potential. Since assistant robots directly interact with people, finding natural and easy-touse user interfaces is of fundamental importance. This paper describes a flexible multimodal interface based on speech and gesture modalities in order to control our mobile robot named Jido. The vision system uses a stereo head mounted on a pan-tilt unit and a bank of collaborative particle filters devoted to the upper human body extremities to track and recognize pointing/symbolic mono but also bi-manual gestures. Such framework constitutes our first contribution, as it is shown, to give proper handling of natural artifacts (selfocclusion, camera out of view field, hand deformation) when Electronic supplementary material The online version of this article (doi:10.1007/s10514-011-9263-y) contains supplementary material, which is available to authorized users. B. Burger ( ) · F. Lerasle CNRS, LAAS, 7 avenue du Colonel Roche, 31077 Toulouse Cedex, France e-mail: email@example.com F. Lerasle e-mail: firstname.lastname@example.org B. Burger · I. Ferrané IRIT, Université de Toulouse, 118 route de Narbonne, 31062 Toulouse Cedex, France I. Ferrané e-mail: email@example.com I. Ferrané · F. Lerasle Université de Toulouse, UPS, INSA, INP, ISAE; UT1, UTM, LAAS, 31077 Toulouse Cedex, France G. Infantes Onera, 2 avenue Edouard Belin, 31055 Toulouse Cedex 4, France e-mail: firstname.lastname@example.org performing 3D gestures using one or the other hand even both. A speech recognition and understanding system based on the Julius engine is also developed and embedded in order to process deictic and anaphoric utterances. The second contribution deals with a probabilistic and multi-hypothesis interpreter framework to fuse results from speech and gesture components. Such interpreter is shown to improve the classification rates of multimodal commands compared to using either modality alone. Finally, we report on successful live experiments in human-centered settings. Results are reported in the context of an interactive manipulation task, where users specify local motion commands to Jido and perform safe object exchanges.