Nowadays, in order to achieve an immersive experience, virtual reality systems usually require vivid 3D models and a good understanding of particular scenes. The limitations of separately optimizing image segmentation and 3D modeling from images have gradually been seen by more and more researchers, so plenty of novel methods on how to combine them for a better result begin to be put forward widely. In this paper, we propose a new hybrid framework to generate semantic 3D dense models from monocular images. Based on the available hierarchical CRFs model, we make full use of the correlation between voxels and their corresponding pixels from different images. Naturally, valuable information from 3D space can be added as one of the important energy items in the model. Either pixels, segments or voxles are all regarded as a node in the huge graph we build. Our ultimate goal is to realize a joint optimization for both 3D dense reconstruction and image segmentation. Experiments have been done on four real challenging datasets and all of the results prove the efficiency of our proposed hybrid framework.