Link prediction is a popular area for publication. Papers appear in virtually every conference on data mining or network science with new methods. We argue that the practical performance potential of these methods is generally unknown because of challenges endemic to evaluation in many link prediction contexts. We demonstrate that current methods of evaluation are inadequate and can lead to woefully errant conclusions about practical performance potential. We argue for the use of precision-recall threshold curves and associated areas in lieu of receiver operating characteristic curves due to the extreme imbalance of the link prediction classification problem. We provide empirical examples of how current methods lead to questionable conclusions, how the fallacy of these conclusions is illuminated by methods we propose, and suggest a fair and consistent framework for link prediction evaluation for longitudinal and non-longitudinal network data sets.