This paper presents a real-time framework that combines depth data and infrared laser speckle pattern (ILSP) images, captured from a Kinect device, for static hand gesture recognition to interact with CAVE applications. At the startup of the system, background removal and hand position detection are performed using only the depth map. After that, tracking is started using the hand positions of the previous frames in order to seek for the hand centroid of the current one. The obtained point is used as a seed for a region growing algorithm to perform hand segmentation in the depth map. The result is a mask that will be used for hand segmentation in the ILSP frame sequence. Next, we apply motion restrictions for gesture spotting in order to mark each image as a ‘Gesture’ or ‘Non-Gesture’. The ILSP counterparts of the frames labeled as “Gesture” are enhanced by using mask subtraction, contrast stretching, median filter, and histogram equalization. The result is used as the input for the feature extraction using a scale invariant feature transform algorithm (SIFT), bag-of-visual-words construction and classification through a multi-class support vector machine (SVM) classifier. Finally, we build a grammar based on the hand gesture classes to convert the classification results in control commands for the CAVE application. The performed tests and comparisons show that the implemented plugin is an efficient solution. We achieve state-of-the-art recognition accuracy as well as efficient object manipulation in a virtual scene visualized in the CAVE.