We are interested in descriptors for dynamic textures that supports both further analysis and resynthesis applications. These tasks demand that the description encodes appearance and motion separably. This paper shows that a tree hierarchy which is built from nested image regions can be acquired via analysis of the first few frames of a video sequence. The tree is stable over space and time and leads to a measurably improved performance on the generic application of tracking the dynamic texture itself. The claim is supported by experimental data taken from a range of dynamic textures including trees and flowers. We conclude that both appearance and motion are better described using the stable structure rather than by a sequence of equivalent hierarchies each optimised for a single frame, because motion data helps to reduce clutter artefacts from trees built for static images.