High-speed data transfer is vital to data-intensive computing that often requires moving large data volumes efficiently within a local data center and among geographically dispersed facilities. Effective utilization of the abundant resources in modern multicore environments for data transfer remains a persistent challenge, particularly, for Non-Uniform Memory Access (NUMA) systems wherein the locality of data accessing is an important factor. This requires rethinking how to exploit parallel access to data and to optimize the storage and network I/Os. We address this challenge and present a novel design of asynchronous processing and resource-aware task scheduling in the context of high-throughput data replication. Our software allocates multiple sets of threads to different stages of the processing pipeline, including storage I/O and network communication, based on their capacities. Threads belonging to each stage follow an asynchronous model, and attain high performance via multiple locality-aware and peer-aware mechanisms, such as task grouping, buffer sharing, affinity control and communication protocols. Our design also integrates high performance features to enhance the scalability of data transfer in several scenarios, e.g., file-level sorting, block-level asynchrony, and thread-level pipelining. Our experiments confirm the advantages of our software under different types of workloads and dynamic environments with contention for shared resources, including a 28-160 percent increase in bandwidth for transferring large files, 1.7-66 times speed-up for small files, and up to 108 percent larger throughput for mixed workloads compared with three state of the art alternatives, <italic>GridFTP </italic>, <italic>BBCP</italic> and <italic>Aspera</italic>.