Automatic image tagging is important yet challenging due to the semantic gap and the lack of learning examples to model a tag's visual diversity. Meanwhile, social user tagging is creating rich multimedia content on the web. In this paper, we propose to combine the two tagging approaches in a search-based framework. For an unlabeled image, we first retrieve its visual neighbors from a large user-tagged image database. We then select relevant tags from the result images to annotate the unlabeled image. To tackle the unreliability and sparsity of user tagging, we introduce a joint-modality tag relevance estimation method which efficiently addresses both textual and visual clues. Experiments on 1.5 million Flickr photos and 10 000 Corel images verify the proposed method.