With the wide use of monitoring systems, there emerges an ever increasing amount of surveillance videos. Sequential browsing of such videos from the database is time consuming and tedious for users, and thus cannot take full advantage of the rich information contained in video data. In this paper, a general framework for semantic video mining and retrieval is proposed. The framework detects and retrieves semantic events from surveillance videos. It starts by tracking and modeling the trajectories of semantic objects in videos. After that, some general user-interested semantic events are modeled. The goal is to retrieve these semantic events by analyzing the spatiotemporal trajectory sequences. However, since individual users may have their own subjective query targets, these event models may be too general to capture the subjectivity of each individual user. Therefore, in this paper, the mining and retrieval phase is designed to dynamically learn the user's interest by interacting with the user. This technique is called the Relevance Feedback (RF) which is commonly used for Content-based Image Retrieval, but seldom applied to the field of semantic video mining. Due to the spatiotemporal nature of video events, substantial extensions to RF, especially its associated learning mechanisms, are needed to apply it to semantic video mining. The learning framework proposed in this paper bases its structure on the neural network for time series data, which is usually adopted for prediction purposes, and we tailor it to suit the specific needs of spatiotemporal video event mining. In this paper, transportation surveillance videos are used to demonstrate the design details.