Efficient and effective handling of video documents depends on the availability of indexes. Manual indexing is unfeasible for large video collections. Video combines different types of data from different modalities. Using information from multiple modalities may result in a more robust and accurate video retrieval. Therefore, effective indexing for video retrieval requires a multimodal approach in which either the most appropriate modality is selected or the different modalities are used in collaborative fashion. This paper presents a new metric access method -- Slim2-tree -- which combines information from multiple modalities within a single index structure for video retrieval. Experimental studies on a large real dataset show the video similarity search performance of the proposed technique. Additionally, we present experiments comparing our method against state-of-the-art of multimodal solutions. Comparative test results demonstrate that our technique improves the performance of video similarity queries.