The Shiraz project included an evaluation component: two ‘glass-box’ evaluations have been performed during the project as well as a black-box evaluation at the end of the project. The evaluations were based on the use of a bilingual tagged test corpus of 3000 sentences. Evaluation tools were developed in order to automate the evaluation process. The glass-box evaluations included the evaluation of components of the MT system, and in particular the Persian morphological analyzer, the dictionary and the parser. The evaluation of the translations themselves (black-box evaluations) were performed manually on a subset of the test corpus. This paper outlines the problems encountered in trying to use these evaluations for development and testing purposes as well as traditional ‘off-line’ evaluations.