Writing-in-the-air (WIA) system provides a novel input experience using the fingertip as a virtual pen based on the color and depth information from only one Kinect camera. We present a new fingertip detection and tracking framework for the robust and realtime fingertip position estimation and further improve the air-writing character recognition accuracy. Firstly, we propose a new physical constraint and an adaptive threshold with the mode temporal consistency in order to classify various hand poses into two modes, i.e., the side-mode and frontal-mode. In the side-mode, a new choose-to-trust algorithm (CTTA) is proposed for the hand segmentation. The final segmentation result is generated by selecting a more trustable color or depth model-based segmentation result according to the fingertip-palm relationship. In the frontal-mode, we propose to estimate the fingertip position by a joint detection-tracking algorithm that successfully incorporates the temporal and physical constraints. By using three new features defined by the joint detection-tracking algorithm, the fingertip position is determined by a multi-objective optimization strategy. We have collected two large fingertip writing data set with different difficulties. According to our experiments in both data sets, our proposed framework has the best accuracy on the fingertip position estimation by comparing with four popular methods. More importantly, the final character recognition rate increases significantly and reaches 100% in the first five candidates for all types of characters.