Fault-tolerance is becoming increasingly important as we enter the era of exascale computing. Increasing the number of cores results in a smaller mean time between failures, and consequently, higher probability of errors. Among the different software fault tolerance techniques, checkpoint/restart is the most commonly used method in supercomputers, the de-facto standard for large-scale systems. Although there exist several checkpoint/restart implementations for CPUs, only a handful have been proposed for GPUs even though more than 60 supercomputers in the TOP 500 list are heterogeneous CPU-GPU systems.In this paper, we propose a scalable application-level checkpoint/restart scheme, called cudaCR for long-running kernels on NVIDIA GPUs. Our proposed scheme is able to capture GPU state inside the kernel and roll back to the previous state within the same kernel, unlike state-of-the-art approaches. We evaluate cudaCR on application benchmarks with different characteristics such as dense matrix multiply, stencil computation, and k-means clustering on a Tesla K40 GPU. We observe that cudaCR can fully restore state with low overheads in both time (less than 10% in best case) and memory requirements after applying a number of different optimizations (storage gain 54% for dense matrix multiply, 31% for k-means, and 4% for stencil computation). Looking forward, we identify new optimizations to further reduce the overhead to make cudaCR highly scalable.