This paper tackles the challenge of interactively retrieving visual scenes within surveillance sequences acquired with fixed camera. Contrarily to today's solutions, we assume that no a-priori knowledge is available so that the system must progressively learn the target scenes thanks to interactive labelling of a few frames by the user. The proposed method is based on very low-cost features extraction and integrates relevance feedback, multiple-instance SVM classification and active learning. Each of these 3 steps runs iteratively over the session, and takes advantage of the progressively increasing training set. Repeatable experiments on both simulated and real data demonstrate the efficiency of the approach and show how it allows reaching high retrieval performances.