This paper addresses the problem of recognizing and localizing coherent activities of a group of people, called collective activities, in video. Related work has argued the benefits of capturing long-range and higher-order dependencies among video features for robust recognition. To this end, we formulate a new deep model, called Hierarchical Random Field (HiRF). HiRF models only hierarchical dependencies between model variables. This effectively amounts to modeling higher-order temporal dependencies of video features. We specify an efficient inference of HiRF that iterates in each step linear programming for estimating latent variables. Learning of HiRF parameters is specified within the max-margin framework. Our evaluation on the benchmark New Collective Activity and Collective Activity datasets, demonstrates that HiRF yields superior recognition and localization as compared to the state of the art.