Most of the research in multi-modal affect detection has been done in laboratory environment. Little work has been done for in situ affect detection. In this paper, we investigate affect detection in natural environment using sensors available in smart phones. We use facial expression and energy expenditure of a person to classify a person's affective state by continuously capturing fine grained accelerometer data for energy and camera image for facial expression and measure the performance of the system. We have deployed our system in natural environment and have provided special attention on annotation for the training data validating the 'ground truth'. We have found important correlation between facial image and energy which validates Russell's two dimensional theory of emotion using arousal and valence space. In this paper, we have presented initial findings in multi-modal affect detection.