Generation and Validation of Empirically

Abstract

F ELIX HERN ANDEZ-CAMPOS: Generation and Validation of Empiri ally-Derived TCP Appli ation Workloads. (Under the dire tion of Kevin Je ay) This dissertation proposes and evaluates a new approa h for generating realisti traÆ in networking experiments. The main problem solved by our approa h is generating losedloop traÆ onsistent with the behavior of the entire set of appli ations in modern traÆ mixes. Unlike earlier approa hes, whi h des ribed individual appli ations in terms of the spe i semanti s of ea h appli ation, we des ribe the sour e behavior driving ea h onne tion in a generi manner using the a-b-t model. This model provides an intuitive but detailed way of des ribing sour e behavior in terms of onne tion ve tors that apture the sizes and ordering of appli ation data units, the quiet times between them, and whether data ex hange is sequential or on urrent. This is onsistent with the view of traÆ from TCP, whi h does not on ern itself with appli ation semanti s. The a-b-t model also satis es a ru ial property: given a pa ket header tra e olle ted from an arbitrary Internet link, we an algorithmi ally infer the sour e-level behavior driving ea h onne tion, and ast it into the notation of the model. The result of pa ket header pro essing is a olle tion of a-b-t onne tion ve tors, whi h an then be replayed in software simulators and testbed experiments to drive network sta ks. Su h a replay generates syntheti traÆ that fully preserves the feedba k loop between the TCP endpoints and the state of the network, whi h is essential in experiments where network ongestion an o ur. By onstru tion, this type of traÆ generation is fully reprodu ible, providing a solid foundation for omparative empiri al studies. Our experimental work demonstrates the high quality of the generated traÆ , by dire tly omparing tra es from real Internet links and their sour e-level tra e replays for a ri h set of iii metri s. Su h omparison requires the areful measurement of network parameters for ea h onne tion, and their reprodu tion together with the orresponding sour e behavior. Our nal ontribution onsists of two resampling methods for introdu ing ontrolled variability in network experiments and for generating losed-loop traÆ that a urately mat hes a target o ered load. iv ACKNOWLEDGMENTS First of all, I must thank Kevin Je ay and Don Smith for their guidan e and en ouragement throughout my do toral program. Their patien e and friendship have been invaluable all these years. I also thank them, together with other fa ulty and student members of the Distributed and Real-Time Systems group (DiRT), for building a phenomenal infrastru ture for Internet measurement and experimental networking resear h. DiRT students have greatly ontributed to my do toral experien e, most espe ially Jay Aikat and David Ott. My ommittee members and other ollaborators have ontributed tremendously to my efforts. I am spe ially in debt with Steve Marron and Andrew Nobel, who have greatly enri hed the statisti al side of my work. In this regard, being part of SAMSI's \Network Modeling for the Internet" program and of the inter-dis iplinary Internet study group at UNC gave me superb opportunities to widen my understanding of Internet resear h. I must also thank UNC's Department of Computer S ien e as whole, in luding fa ulty, students and sta , for reating an outstanding resear h and tea hing environment. Overall, my years at UNC were an in redible positive experien e. I thank the National S ien e Foundation, IBM, Cis o, Intel, Sun Mi rosystems and others for supporting this work. I am spe ially grateful to the Computer Measurement Group (CMG) for their do toral fellowship. Finally, I thank my family for their support. Their onstant example of hard-work, and their respe t for intelle tual endeavors has motivated me during my entire life. My wife's help with the editing of this manus ript was invaluable, as was her onstant en ouragement during my graduate studies. More than anybody else, my parents gave me my passion for knowledge, and it is to them that I dedi ate this do toral dissertation. v TABLE OF CONTENTS LIST OF TABLES x LIST OF FIGURES xi LIST OF ABBREVIATIONS xxiii 1 Introdu tion 1 1.1 Abstra t Sour e-Level Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Sour e-Level Tra e Replay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3 Tra e Resampling and Load S aling . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.4 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2 Related Work 22 2.1 Pa ket-Level TraÆ Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 Sour e-Level TraÆ Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2.1 Web TraÆ Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.2.2 Non-Web TraÆ Sour e-level Modeling . . . . . . . . . . . . . . . . . . . 35 2.2.3 Beyond Single Appli ation Modeling . . . . . . . . . . . . . . . . . . . . . 38 2.3 S aling O ered Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.4 Implementing TraÆ Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3 Abstra t Sour e-level Modeling 45 3.1 The Sequential a-b-t Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 vi 3.1.1 Client/Server Appli ations . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.1.2 Beyond Client/Server Appli ations . . . . . . . . . . . . . . . . . . . . . . 57 3.2 The Con urrent a-b-t Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.3 Abstra t Sour e-Level Measurement . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3.1 From TCP Sequen e Numbers to Appli ation Data Units . . . . . . . . . 63 3.3.2 Logi al Order of Data Segments . . . . . . . . . . . . . . . . . . . . . . . 67 3.3.3 Data Analysis Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.4 Validation using Syntheti Appli ations . . . . . . . . . . . . . . . . . . . . . . . 77 3.5 Analysis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.5.1 Variability A ross Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.5.2 Time-of-Day Variability and Workload Dire tionality . . . . . . . . . . . . 95 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4 Network-Level Parameters and Metri s 104 4.1 Network-level Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.1.1 Round-Trip Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.1.2 Re eiver Window Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.1.3 Loss Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.2 Network-level Metri s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.2.1 Aggregate Throughput Time Series . . . . . . . . . . . . . . . . . . . . . . 130 4.2.2 Throughput Marginals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 4.2.3 Throughput Self-Similarity and Long-Range Dependen e . . . . . . . . . 149 4.2.4 Time Series of A tive Conne tions . . . . . . . . . . . . . . . . . . . . . . 157 4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 5 Generating TraÆ 165 5.1 Replaying Tra es at the Sour e-Level . . . . . . . . . . . . . . . . . . . . . . . . . 165 5.1.1 Tra e Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 5.1.2 Condu ting Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 5.1.3 Data Colle tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 vii 5.2 Validation of Sour e-level Tra e Replay . . . . . . . . . . . . . . . . . . . . . . . 173 5.2.1 Leipzig-II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 5.2.2 UNC 1 PM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 5.2.3 Abilene-I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 6 Reprodu ing TraÆ 195 6.1 Beyond Comparing Conne tion Ve tors . . . . . . . . . . . . . . . . . . . . . . . 196 6.2 Sour e-level Replay of Leipzig-II . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 6.2.1 Time Series of Byte Throughput . . . . . . . . . . . . . . . . . . . . . . . 200 6.2.2 Time Series of Pa ket Throughput . . . . . . . . . . . . . . . . . . . . . . 203 6.2.3 Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 6.2.4 Long-Range Dependen e . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 6.2.5 Time Series of A tive Conne tions . . . . . . . . . . . . . . . . . . . . . . 214 6.3 Sour e-level Replay of UNC 1 PM . . . . . . . . . . . . . . . . . . . . . . . . . . 217 6.3.1 Time Series of Byte Throughput . . . . . . . . . . . . . . . . . . . . . . . 217 6.3.2 Time Series of Pa ket Throughput . . . . . . . . . . . . . . . . . . . . . . 220 6.3.3 Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 6.3.4 Long-Range Dependen e . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 6.3.5 Time Series of A tive Conne tions . . . . . . . . . . . . . . . . . . . . . . 229 6.4 Mid-Chapter Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 6.4.1 Observations on Byte Throughput . . . . . . . . . . . . . . . . . . . . . . 230 6.4.2 Observations on Pa ket Throughput . . . . . . . . . . . . . . . . . . . . . 232 6.4.3 Observations on A tive Conne tions . . . . . . . . . . . . . . . . . . . . . 233 6.5 Sour e-level Replay of UNC 1 AM . . . . . . . . . . . . . . . . . . . . . . . . . . 235 6.5.1 Time Series of Byte Throughput . . . . . . . . . . . . . . . . . . . . . . . 235 6.5.2 Time Series of Pa ket Throughput . . . . . . . . . . . . . . . . . . . . . . 235 6.5.3 Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 6.5.4 Long-Range Dependen e . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 viii 6.5.5 Time Series of A tive Conne tions . . . . . . . . . . . . . . . . . . . . . . 242 6.6 Sour e-level Replay of UNC 7:30 PM . . . . . . . . . . . . . . . . . . . . . . . . . 243 6.6.1 Time Series of Byte Throughput . . . . . . . . . . . . . . . . . . . . . . . 243 6.6.2 Time Series of Pa ket Throughput . . . . . . . . . . . . . . . . . . . . . . 244 6.6.3 Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 6.6.4 Long-Range Dependen e . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 6.6.5 Time Series of A tive Conne tions . . . . . . . . . . . . . . . . . . . . . . 251 6.7 Sour e-level Replay of Abilene-I . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 6.7.1 Time Series of Byte Throughput . . . . . . . . . . . . . . . . . . . . . . . 252 6.7.2 Time Series of Pa ket Throughput . . . . . . . . . . . . . . . . . . . . . . 253 6.7.3 Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 6.7.4 Long-Range Dependen e . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 6.7.5 Time Series of A tive Conne tions . . . . . . . . . . . . . . . . . . . . . . 259 6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 7 Tra e Resampling and Load S aling 261 7.1 Poisson Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 7.1.1 Basi Poisson Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 7.1.2 Byte-Driven Poisson Resampling . . . . . . . . . . . . . . . . . . . . . . . 271 7.2 Blo k Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 8 Con lusions and Future Work 288 8.1 Empiri al Modeling of TraÆ Mixes . . . . . . . . . . . . . . . . . . . . . . . . . 289 8.2 Re ning and Extending our Modeling . . . . . . . . . . . . . . . . . . . . . . . . 291 8.3 Assessing Realism in Syntheti TraÆ . . . . . . . . . . . . . . . . . . . . . . . . 294 8.4 In orporating Additional Network-Level Parameter . . . . . . . . . . . . . . . . . 296 8.5 Flexible TraÆ Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 BIBLIOGRAPHY 300 ix LIST OF TABLES 3.1 Breakdown of the TCP onne tions found in ve tra es. . . . . . . . . . . . . . . 82 4.1 Estimated Hurst parameters and their on den e intervals for the pa ket throughput time series of ve tra es. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 4.2 Estimated Hurst parameters and their on den e intervals for the byte throughput time series of ve tra es. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 6.1 Estimated Hurst parameters and their on den e intervals for the byte throughput time series of Leipzig-II and its four types of sour e-level tra e replay. . . . . 212 6.2 Estimated Hurst parameters and their on den e intervals for the pa ket throughput time series of Leipzig-II and its four types of sour e-level tra e replay. . . . . 215 6.3 Estimated Hurst parameters and their on den e intervals for the byte throughput time series of UNC 1 PM and its four types of sour e-level tra e replay. . . . 225 6.4 Estimated Hurst parameters and their on den e intervals for the pa ket throughput time series of UNC 1 PM and its four types of sour e-level tra e replay. . . . 228 6.5 Estimated Hurst parameters and their on den e intervals for the byte throughput time series of UNC 1 AM and its four types of sour e-level tra e replay. . . . 240 6.6 Estimated Hurst parameters and their on den e intervals for the pa ket throughput time series of UNC 1 AM and its four types of sour e-level tra e replay. . . . 241 6.7 Estimated Hurst parameters and their on den e intervals for the byte throughput time series of UNC 7:30 PM and its four types of sour e-level tra e replay. . 248 6.8 Estimated Hurst parameters and their on den e intervals for the pa ket throughput time series of UNC 7:30 PM and its four types of sour e-level tra e replay. . 250 6.9 Estimated Hurst parameters and their on den e intervals for the byte throughput time series of Abilene-I and its four types of sour e-level tra e replay. . . . . 257 6.10 Estimated Hurst parameters and their on den e intervals for the pa ket throughput time series of Abilene-I and its four types of sour e-level tra e replay. . . . . 258 7.1 Estimated Hurst parameters and their on den e intervals for the onne tion arrival time series of UNC 1 PM and UNC 1 AM, and their Poisson arrival ts. . 275 7.2 Estimated Hurst parameters and their on den e intervals for ve subsamplings obtained from the onne tion arrival time series of UNC 1 PM and UNC 1 AM . 284 x LIST OF FIGURES 1.1 Network traÆ seen from di erent levels. . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An a-b-t diagram illustrating a persistent HTTP onne tion. . . . . . . . . . . . 8 1.3 A diagram illustrating the intera tion between two BitTorrent peers. . . . . . . . 10 1.4 Overview of Sour e-level Tra e Replay. . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1 An a-b-t diagram representing a typi al ADU ex hange in HTTP version 1.0. . . 48 3.2 An a-b-t diagram illustrating a persistent HTTP onne tion. . . . . . . . . . . . 49 3.3 An a-b-t diagram illustrating an SMTP onne tion. . . . . . . . . . . . . . . . . . 53 3.4 Three a-b-t diagrams representing three di erent types of NNTP intera tions. . . 54 3.5 An a-b-t diagram illustrating a server push from a web am using a persistent HTTP onne tion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.6 An a-b-t diagram illustrating I e ast audio streaming in a TCP onne tion. . . . 58 3.7 Three a-b-t diagrams of onne tions taking part in the intera tion between an FTP lient and an FTP server. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.8 An a-b-t diagram illustrating an NNTP onne tion in \stream-mode", whi h exhibits data ex hange on urren y. . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.9 An a-b-t diagram illustrating the intera tion between two BitTorrent peers. . . . 60 3.10 A rst set of TCP segments for the onne tion ve tor in Figure 3.1: lossless example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.11 A se ond set of TCP segments for the onne tion ve tor in Figure 3.1: lossy example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.12 Distributions of ADU sizes for the testbed experiments with syntheti appli ations. 79 3.13 Distributions of quiet time durations for the testbed experiments with syntheti appli ations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.14 Distributions of ADU sizes for the testbed experiments with syntheti appli ations. 81 3.15 Distributions of quiet time durations for the testbed experiments with syntheti appli ations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.16 Bodies of the A and B distributions for Abilene-I, Leipzig-II and UNC 1 PM. . . 86 xi 3.17 Tails of the A and B distributions for Abilene-I, Leipzig-II and UNC 1 PM. . . . 86 3.18 Bodies of the A and B distributions with per-byte probabilities for Abilene-I, Leipzig-II and UNC 1 PM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.19 Bodies of the E distributions for Abilene-I, Leipzig-II and UNC 1 PM. . . . . . . 88 3.20 Bodies of the E distributions with per-byte probabilities for Abilene-I, Leipzig-II and UNC 1 PM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.21 Tails of the E distributions for Abilene-I, Leipzig-II and UNC 1 PM. . . . . . . . 89 3.22 Average size of the epo hs in ea h onne tion ve tor as a fun tion of the number of epo hs for Abilene-I, Leipzig-II and UNC 1 PM. . . . . . . . . . . . . . . . . . 90 3.23 Average of the median size of the ADUs in ea h onne tion ve tor as a fun tion of the number of epo hs for Abilene-I, Leipzig-II and UNC 1 PM. . . . . . . . . . 90 3.24 Average of the median size of the ADUs in ea h onne tion ve tor as a fun tion of the number of epo hs, for Leipzig-II. . . . . . . . . . . . . . . . . . . . . . . . . 91 3.25 Average of the median size of the ADUs in ea h onne tion ve tor as a fun tion of the number of epo hs for Abilene-I. . . . . . . . . . . . . . . . . . . . . . . . . 91 3.26 Bodies of the TA and TB distributions for Abilene-I, Leipzig-II and UNC 1 PM. 92 3.27 Tails of the TA and TB distributions for Abilene-I, Leipzig-II and UNC 1 PM. . 92 3.28 Distribution of the durations of the quiet times between the nal ADU and onne tion termination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.29 Bodies of the A and B distributions for the on urrent onne tions in Abilene-I, Leipzig-II and UNC 1 PM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.30 Tails of the A and B distributions for the on urrent onne tions in Abilene-I, Leipzig-II and UNC 1 PM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.31 Bodies of the TA and TB distributions for the on urrent onne tions in AbileneI, Leipzig-II and UNC 1 PM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.32 Tails of the TA and TB distributions for the on urrent onne tions in Abilene-I, Leipzig-II and UNC 1 PM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.33 Bodies of the A distributions for UNC 1 AM, UNC 1 PM and UNC 7:30 PM. . . 96 3.34 Bodies of the B distributions for UNC 1 AM, UNC 1 PM and UNC 7:30 PM. . . 96 3.35 Bodies of the TB distributions for UNC 1 AM, UNC 1 PM and UNC 7:30 PM. . 97 3.36 Tails of the TB distributions for UNC 1 AM, UNC 1 PM and UNC 7:30 PM. . . 97 3.37 Bodies of the TA distributions for three UNC tra es. . . . . . . . . . . . . . . . . 98 xii 3.38 Tails of the TA distributions for three UNC tra es. . . . . . . . . . . . . . . . . . 98 4.1 A set of TCP segments illustrating RTT estimation from onne tion establishment.109 4.2 Two sets of TCP segments illustrating RTT estimation ambiguities in the presen e of loss and early retransmission in onne tion establishment. . . . . . . . . . 110 4.3 A set of TCP segments illustrating RTT estimation using the sum of two OSTTs.111 4.4 A set of TCP segments illustrating the impa t of delayed a knowledgments on OSTTs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.5 Comparison of RTT estimators for a syntheti tra e: no loss and enabled delayed a knowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.6 Comparison of RTT estimators for a syntheti tra e: no loss and disabled delayed a knowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.7 Comparison of RTT estimators for a syntheti tra e: xed loss rate of 1% for all onne tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.8 Comparison of RTT estimators for a syntheti tra e: loss rates uniformly distributed between 0% and 10%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.9 A set of TCP segments illustrating an invalid OSTT sample due to the intera tion between loss and umulative a knowledgments. . . . . . . . . . . . . . . . . . . . 118 4.10 Comparison of RTT estimators for a syntheti tra e: loss rates uniformly distributed between 0% and 10%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.11 Comparison of RTT estimators for syntheti tra es: xed loss rate of 1%; real RTTs up to 4 se onds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.12 Bodies of the RTT distributions for the ve tra es. . . . . . . . . . . . . . . . . . 120 4.13 Bodies of the RTT distributions with per-byte probabilities for the ve tra es. . . 120 4.14 Comparison of the sum-of-minima and sum-of-medians RTT estimators for UNC 1 PM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.15 Comparison of the sum-of-minima and sum-of-medians RTT estimators for LeipzigII. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.16 Bodies of the distributions of maximum re eiver window sizes for the ve tra es. 123 4.17 Bodies of the distributions of maximum re eiver window sizes with per-byte probabilities for the ve tra es. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.18 Measured loss rates from experiments with 1% loss rates applied only on one dire tion or on both dire tions of the TCP onne tions. . . . . . . . . . . . . . . 126 4.19 Bodies of the distributions of loss rates for the ve tra es. . . . . . . . . . . . . 129 xiii 4.20 Bodies of the distributions of loss rates with per-byte probabilities for the ve tra es. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.21 Breakdown of the byte throughput time series for Leipzig-II inbound. . . . . . . 131 4.22 Breakdown of the pa ket throughput time series for Leipzig-II inbound. . . . . . 131 4.23 Breakdown of the byte throughput time series for Leipzig-II outbound. . . . . . 133 4.24 Breakdown of the pa ket throughput time series for Leipzig-II outbound. . . . . 133 4.25 Breakdown of the byte throughput time series for Leipzig-II outbound. . . . . . 134 4.26 Breakdown of the pa ket throughput time series for Leipzig-II outbound. . . . . 134 4.27 Breakdown of the byte throughput time series for Abilene-I Ipls/Clev. . . . . . . 135 4.28 Breakdown of the pa ket throughput time series for Abilene-I Ipls/Clev. . . . . 135 4.29 Breakdown of the byte throughput time series for Abilene-I Clev/Ipls. . . . . . . 137 4.30 Breakdown of the pa ket throughput time series for Abilene-I Clev/Ipls. . . . . 137 4.31 Breakdown of the byte throughput time series for UNC 1 PM inbound. . . . . . 138 4.32 Breakdown of the pa ket throughput time series for UNC 1 PM inbound. . . . . 138 4.33 Breakdown of the byte throughput time series for UNC 1 PM outbound. . . . . 138 4.34 Breakdown of the pa ket throughput time series for UNC 1 PM outbound. . . . 138 4.35 Breakdown of the byte throughput time series for the three UNC tra es. . . . . 140 4.36 Breakdown of the pa ket throughput time series for the three UNC tra es. . . . 140 4.37 Byte throughput marginals of Leipzig-II inbound, its normal distribution t, the marginal distribution of its Poisson arrival t, and the normal distribution t of this Poisson arrival t. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 4.38 Pa ket throughput marginals of Leipzig-II inbound, its normal distribution t, the marginal distribution of its Poisson arrival t, and the normal distribution t of this Poisson arrival t. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 4.39 Byte throughput marginals of UNC 1 PM outbound, its normal distribution t, the marginal distribution of its Poisson arrival t, and the normal distribution t of this Poisson arrival t. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 4.40 Pa ket throughput marginals of UNC 1 PM outbound, its normal distribution t, the marginal distribution of its Poisson arrival t, and the normal distribution t of this Poisson arrival t. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 xiv 4.41 Quantile-quantile plots with simulation envelops for the marginal distribution of Leipzig-II inbound. The top four plots show byte throughput, while the four bottom plots show pa ket throughput. . . . . . . . . . . . . . . . . . . . . . . . 146 4.42 Quantile-quantile plots with simulation envelops for the marginal distribution of UNC 1 PM outbound. The top four plots show byte throughput, while the four bottom plots show pa ket throughput. . . . . . . . . . . . . . . . . . . . . . . . 147 4.43 Wavelet spe tra of the pa ket throughput time series for Leipzig-II inbound and its Poisson arrival t. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 4.44 Wavelet spe tra of the byte throughput time series for Leipzig-II inbound and its Poisson arrival t. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 4.45 Wavelet spe tra of the pa ket throughput time series for Abilene-I. . . . . . . . 154 4.46 Wavelet spe tra of the byte throughput time series for Abilene-I. . . . . . . . . . 154 4.47 Wavelet spe tra of the pa ket throughput time series for UNC 1 PM. . . . . . . 155 4.48 Wavelet spe tra of the byte throughput time series for UNC 1 PM. . . . . . . . 155 4.49 Breakdown of the a tive onne tions time series for Leipzig-II. . . . . . . . . . . 157 4.50 Impa t of the de nition of a tive onne tion on Leipzig-II. . . . . . . . . . . . . 157 4.51 Breakdown of the a tive onne tions time series for Abilene-I. . . . . . . . . . . 158 4.52 Impa t of the de nition of a tive onne tion on Abilene-I. . . . . . . . . . . . . . 158 4.53 Breakdown of a tive onne tions time series for UNC 1 PM using both de nitions of a tive onne tion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 4.54 Impa t of the time-of-day on the a tive onne tions time series for the three UNC tra es. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.1 Overview of Sour e-level Tra e Replay. . . . . . . . . . . . . . . . . . . . . . . . . 166 5.2 Diagram of the network testbed where the experiments of this dissertation were ondu ted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 5.3 End-host ar hite ture of the traÆ generation system. . . . . . . . . . . . . . . . 169 5.4 Bodies and tails of the A distributions for Leipzig-II and its sour e-level tra e replays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 5.5 Bodies and tails of the B distributions for Leipzig-II and its sour e-level tra e replays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 5.6 Bodies and tails of the E distributions for Leipzig-II and its sour e-level tra e replays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 xv 5.7 Bodies and tails of the TA distributions for Leipzig-II and its sour e-level tra e replays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 5.8 Bodies and tails of the TB distributions for Leipzig-II and its sour e-level tra e replays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 5.9 Bodies of the round-trip time and re eiver window size distributions for Leipzig-II and its sour e-level tra e replays. . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 5.10 Bodies the loss rate distributions for Leipzig-II and its sour e-level tra e replays, with probabilities omputed per onne tion (left) and per byte (right). . . . . . . 181 5.11 Bodies and tails of the A distributions for UNC 1 PM and its sour e-level tra e replays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 5.12 Bodies and tails of the B distributions for UNC 1 PM and its sour e-level tra e replays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 5.13 Bodies and tails of the E distributions for UNC 1 PM and its sour e-level tra e replays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 5.14 Bodies and tails of the TA distributions for UNC 1 PM and its sour e-level tra e replays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 5.15 Bodies and tails of the TB distributions for UNC 1 PM and its sour e-level tra e replays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 5.16 Bodies of the round-trip time and re eiver window size distributions for UNC 1 PM and its sour e-level tra e replays. . . . . . . . . . . . . . . . . . . . . . . . . 186 5.17 Bodies of the loss rate distributions for UNC 1 PM and its sour e-level tra e replays, with probabilities omputed per onne tion (left) and per byte (right). . 187 5.18 Bodies and tails of the A distributions for Abilene-I and its sour e-level tra e replays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 5.19 Bodies and tails of the B distributions for Abilene-I and its sour e-level tra e replays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 5.20 Bodies and tails of the E distributions for Abilene-I and its sour e-level tra e replays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 5.21 Bodies and tails of the TA distributions for Abilene-I and its sour e-level tra e replays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 5.22 Bodies and tails of the TB distributions for Abilene-I and its sour e-level tra e replays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 5.23 Bodies of the round-trip time and re eiver window size distributions for Abilene-I and its sour e-level tra e replays. . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 xvi 5.24 Bodies of the loss rate distributions for Abilene-I and its sour e-level tra e replays, with probabilities omputed per onne tion (left) and per byte (right). . . 191 6.1 Byte throughput time series for Leipzig-II inbound and its four types of sour elevel tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 6.2 Byte throughput time series for Leipzig-II outbound and its four types of sour elevel tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 6.3 Pa ket throughput time series for Leipzig-II inbound and its four types of sour elevel tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 6.4 Pa ket throughput time series for Leipzig-II outbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 6.5 Byte throughput marginals for Leipzig-II inbound and its four types of sour elevel tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 6.6 Byte throughput marginals for Leipzig-II outbound and its four types of sour elevel tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 6.7 Pa ket throughput marginals for Leipzig-II inbound and its four types of sour elevel tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 6.8 Pa ket throughput marginals for Leipzig-II outbound and its four types of sour elevel tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 6.9 Wavelet spe tra of the byte throughput time series for Leipzig-II inbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . 212 6.10 Wavelet spe tra of the byte throughput time series for Leipzig-II outbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . 212 6.11 Wavelet spe tra of the pa ket throughput time series for Leipzig-II inbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . 215 6.12 Wavelet spe tra of the pa ket throughput time series for Leipzig-II outbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . 215 6.13 A tive onne tion time series for Leipzig-II and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 6.14 Byte throughput time series for UNC 1 PM inbound and its four types of sour elevel tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 6.15 Byte throughput time series for UNC 1 PM outbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 6.16 Pa ket throughput time series for UNC 1 PM inbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 xvii 6.17 Pa ket throughput time series for UNC 1 PM outbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 6.18 Byte throughput marginals for UNC 1 PM inbound and its four types of sour elevel tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 6.19 Byte throughput marginals for UNC 1 PM outbound and its four types of sour elevel tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 6.20 Pa ket throughput marginals for UNC 1 PM inbound and its four types of sour elevel tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 6.21 Pa ket throughput marginals for UNC 1 PM outbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 6.22 Wavelet spe tra of the byte throughput time series for UNC 1 PM inbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . 225 6.23 Wavelet spe tra of the byte throughput time series for UNC 1 PM outbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . 225 6.24 Wavelet spe tra of the pa ket throughput time series for UNC 1 PM inbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . 228 6.25 Wavelet spe tra of the pa ket throughput time series for UNC 1 PM outbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . 228 6.26 A tive onne tion time series for UNC 1 PM and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 6.27 Byte throughput time series for UNC 1 AM inbound and its four types of sour elevel tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 6.28 Byte throughput time series for UNC 1 AM outbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 6.29 Pa ket throughput time series for UNC 1 AM inbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 6.30 Pa ket throughput time series for UNC 1 AM outbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 6.31 Byte throughput marginals for UNC 1 AM inbound and its four types of sour elevel tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 6.32 Byte throughput marginals for UNC 1 AM outbound and its four types of sour elevel tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 6.33 Pa ket throughput marginals for UNC 1 AM inbound and its four types of sour elevel tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 xviii 6.34 Pa ket throughput marginals for UNC 1 AM outbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 6.35 Wavelet spe tra of the byte throughput time series for UNC 1 AM inbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . 240 6.36 Wavelet spe tra of the byte throughput time series for UNC 1 AM outbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . 240 6.37 Wavelet spe tra of the pa ket throughput time series for UNC 1 AM inbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . 241 6.38 Wavelet spe tra of the pa ket throughput time series for UNC 1 AM outbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . 241 6.39 A tive onne tion time series for UNC 1 AM and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 6.40 Byte throughput time series for UNC 7:30 PM inbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 6.41 Byte throughput time series for UNC 7:30 PM outbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 6.42 Pa ket throughput time series for UNC 7:30 PM inbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 6.43 Pa ket throughput time series for UNC 7:30 PM outbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 6.44 Byte throughput marginals for UNC 7:30 PM inbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 6.45 Byte throughput marginals for UNC 7:30 PM outbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 6.46 Pa ket throughput marginals for UNC 7:30 PM inbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 6.47 Pa ket throughput marginals for UNC 7:30 PM outbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 6.48 Wavelet spe tra of the byte throughput time series for UNC 7:30 PM inbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . 248 6.49 Wavelet spe tra of the byte throughput time series for UNC 7:30 PM outbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . 248 6.50 Wavelet spe tra of the pa ket throughput time series for UNC 7:30 PM inbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . 250 xix 6.51 Wavelet spe tra of the pa ket throughput time series for UNC 7:30 PM outbound and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . 250 6.52 A tive onne tion time series for UNC 7:30 PM and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 6.53 Byte throughput time series for Abilene-I Clev/Ipls and its four types of sour elevel tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 6.54 Byte throughput time series for Abilene-I Ipls/Clev and its four types of sour elevel tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 6.55 Pa ket throughput time series for Abilene-I Clev/Ipls and its four types of sour elevel tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 6.56 Pa ket throughput time series for Abilene-I Ipls/Clev and its four types of sour elevel tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 6.57 Byte throughput marginals for Abilene-I Clev/Ipls and its four types of sour elevel tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 6.58 Byte throughput marginals for Abilene-I Ipls/Clev and its four types of sour elevel tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 6.59 Pa ket throughput marginals for Abilene-I Clev/Ipls and its four types of sour elevel tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 6.60 Pa ket throughput marginals for Abilene-I Ipls/Clev and its four types of sour elevel tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 6.61 Wavelet spe tra of the byte throughput time series for Abilene-I Clev/Ipls and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . 257 6.62 Wavelet spe tra of the byte throughput time series for Abilene-I Ipls/Clev and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . 257 6.63 Wavelet spe tra of the pa ket throughput time series for Abilene-I Clev/Ipls and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . 258 6.64 Wavelet spe tra of the pa ket throughput time series for Abilene-I Ipls/Clev and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . 258 6.65 A tive onne tion time series for Abilene-I and its four types of sour e-level tra e replay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 7.1 Bodies of the distributions of onne tion inter-arrivals for UNC 1 PM and 1 AM, and their exponential ts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 7.2 Tails of the distributions of onne tion inter-arrivals for UNC 1 PM and 1 AM, and their exponential ts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 xx 7.3 Bodies of the distributions of onne tion inter-arrivals for Abilene-I and LeipzigII, and their exponential ts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 7.4 Tails of the distributions of onne tion inter-arrivals for Abilene-I and Leipzig-II, and their exponential ts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 7.5 Average o ered load vs. number of onne tions for 1,000 Poisson resamplings of UNC 1 PM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 7.6 Histogram of the average o ered loads in 1,000 Poisson resamplings of UNC 1 PM.268 7.7 Tails of the distributions of onne tion sizes for UNC 1 PM. . . . . . . . . . . . . 270 7.8 Analysis of the a ura y of onne tion-driven Poisson Resampling from 6,000 resamplings of UNC 1 PM (1,000 for ea h target o ered load). . . . . . . . . . . 270 7.9 Comparison of average o ered load vs. number of onne tions for 1,000 onne tiondriven Poisson resamplings and 1,000 byte-driven Poisson resamplings of UNC 1 PM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 7.10 Histogram of the average o ered loads in 1,000 byte-driven Poisson resamplings of UNC 1 PM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 7.11 Analysis of the a ura y of byte-driven Poisson Resampling from 4,000 resamplings of UNC 1 PM (1,000 for ea h target o ered load). . . . . . . . . . . . . . . 273 7.12 Analysis of the a ura y of byte-driven Poisson Resampling using sour e-level tra es replay: replays of three separate resamplings of UNC 1 PM for ea h target o ered load, illustrating the s aling down of load from the original 177.36 Mbps. 274 7.13 Analysis of the a ura y of byte-driven Poisson Resampling using testbed experiments: replay of one resampling of UNC 1 AM for ea h target o ered load, illustrating the s aling up of load from the original 91.65 Mbps. . . . . . . . . . . 274 7.14 Conne tion arrival time series for UNC 1 PM (dashed line) and a Poisson arrival pro ess with the same mean (solid line). . . . . . . . . . . . . . . . . . . . . . . . 275 7.15 Conne tion arrival time series for UNC 1 AM and a Poisson arrivals pro ess with the same mean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 7.16 Wavelet spe tra of the onne tion arrival time series for UNC 1 PM and a Poisson arrival pro ess with the same mean. . . . . . . . . . . . . . . . . . . . . . . . . . 276 7.17 Wavelet spe tra of the onne tion arrival time series for UNC 1 AM and a Poisson arrival pro ess with the same mean. . . . . . . . . . . . . . . . . . . . . . . . . . 276 7.18 Blo k resamplings of UNC 1 PM: impa t of di erent blo k lengths on the wavelet spe trum of the onne tion arrival time series. . . . . . . . . . . . . . . . . . . . . 279 7.19 Blo k resamplings of UNC 1 AM: impa t of di erent blo k lengths on the wavelet spe trum of the onne tion arrival time series. . . . . . . . . . . . . . . . . . . . . 280 xxi 7.20 Blo k resamplings of UNC 1 PM: average o ered load vs. number of onne tion ve tors (left) and orresponding histograms of average o ered loads (right) in 3,000 resamplings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 7.21 Wavelet spe tra of several random subsamplings of the onne tion ve tors in UNC 1 PM (left) and 1 AM (right) . . . . . . . . . . . . . . . . . . . . . . . . . . 283 7.22 Analysis of the a ura y of byte-driven Blo k Resampling using sour e-level tra e replay: replays of two separate resamplings of UNC 1 PM for ea h target o ered load, illustrating the s aling down of load from the original 177.36 Mbps. . . . . 285 7.23 Analysis of the a ura y of byte-driven Blo k Resampling using sour e-level tra e replay: replay of one resampling of UNC 1 AM for ea h target o ered load, illustrating the s aling up of load from the original 91.65 Mbps. . . . . . . . . . . 285 7.24 Wavelet spe tra of the pa ket arrival time series for UNC 1 PM and the sour elevel tra e replays of two blo k resamplings of this tra e. . . . . . . . . . . . . . . 286 7.25 Wavelet spe tra of the pa ket arrival time series for UNC 1 PM and the sour elevel tra e replays of three Poisson resamplings of this tra e. . . . . . . . . . . . . 286 xxii LIST OF ABBREVIATIONS ACK Positive a knowledgment TCP segment ADU Appli ation Data Unit API Appli ation Programming Interfa e AQM A tive Queue Management BGP Border Gateway Proto ol BPF Berkeley Pa ket Filter C.I. Con den e Interval CCDF Complementary Cumulative Distribution Fun tion CDF Cumulative Distribution Fun tion DAG Data A quisition and Generation FIFO First-In First-Out FIN TCP ontrol ag indi ating \no more data from sender". FTP File Transfer Proto ol GB Gigabyte GPS Global Positioning System HTML HyperText Markup Language HTTP HyperText Transfer Proto ol I/O Input/Output ICMP Internet Control Message Proto ol IP Internet Proto ol IRC Internet Relay Chat ISP Internet Servi e Provider K-S Kolmogorov-Smirnov test KB Kilobyte Kpps Kilo pa ket per se ond LRD Long-Range Dependen e xxiii MB Megabyte MIME Multipurpose Internet Mail Extensions MSS Maximum Segment Size MTU Maximum Transmission Unit Mbps Megabit per se ond NNTP Network News Transfer Proto ol OSTT One-Side Transit Time PMA Passive Measurement and Analysis Q-Q Quantile-Quantile RED Random Early Dete tion RFC Request For Comments RST TCP ontrol ag indi ating \ onne tion reset". RTT Round-Trip Time SMTP Simple Mail Transfer Proto ol SSH Se ure Shell SYN Syn hronize TCP ontrol segment SYN-ACK Positive a knowledgement of SYN segment TCP Transport Control Proto ol UDP User Datagram Proto ol UNC University of North Carolina at Chapel Hill URL Universal Resour e Lo ator xxiv CHAPTER 1 Introdu tion As far as the laws of mathemati s refer to reality, they are not ertain; and as far as they are ertain, they do not refer to reality. | Albert Einstein (1879{1955) Humankind annot stand very mu h reality. | T. S. Elliot (1888{1965) Resear h in networking has to deal with the extreme omplexity of many layers of te hnology intera ting with ea h other in frequently unexpe ted ways. As a onsequen e, there is a broad onsensus among resear hers that purely theoreti al analysis is not enough to demonstrate the e e tiveness of network te hnologies. More often than not, areful experimentation in simulators and network testbeds under ontrolled onditions is needed to validate new ideas. Every resear her therefore fa es, at some point or another, the need to design realisti networking experiments, and syntheti network traÆ is a foremost element of these experiments. Syntheti network traÆ represents not only the workload of a omputer network, but also the dire t or indire t target of any optimization. For instan e, ongestion ontrol resear h fo uses on preserving as mu h as possible the ability of a network to transfer data in the fa e of overload. Therefore, evaluating a new ongestion ontrol me hanism in a transport proto ol su h as the Transport Control Proto ol (TCP) [Pos81℄ usually requires onstru ting experiments in whi h a number of network hosts ex hange data using this proto ol in an environment with one or more saturated links. The value of the new me hanism is then expressed as a fun tion of the performan e of these data ex hanges. For example, the new me hanism may be optimized for a hieving a higher overall throughput or a more fair allo ation of bandwidth. A fundamental insight, whi h provides the main motivation for this dissertation, is that the hara teristi s of syntheti traÆ have a dramati impa t on the out ome of networking experiments. For example, a new me hanism that improves the throughput of bulk, long-lasting le transfers in a ongested environment may not improve and may even degrade the response time of the small data ex hanges in web traÆ . This was pre isely the ase of Random Early Dete tion (RED), an A tive Queue Management (AQM) me hanism. The original analysis by Floyd and Ja obson [FJ93a℄ learly demonstrated the bene ts of RED over the basi FirstIn First-Out (FIFO) queuing me hanism for bulk transfers. In this study, RED queues were exposed to a small number (2{4) of large le transfers. However, a later experimental study by Christiansen et al. [CJOS00℄ showed that this rst AQM me hanism degraded the performan e of web traÆ in highly ongested environments. In ontrast to the original evaluation, web traÆ mostly onsists of a very large number of small data transfers, whi h reate a very di erent workload. The emergen e of the web learly hanged the nature of Internet traÆ , and made it ne essary to revisit existing results obtained under di erent workloads. The systemati evaluation of network me hanisms must therefore in lude experiments overing the wide range of traÆ hara teristi s observed on Internet links. It is riti al to provide the resear h ommunity with methods and tools for generating syntheti traÆ as representative as possible of this range of hara teristi s. The on ept of sour e-level modeling introdu ed by Paxson and Floyd [PF95℄ onstitutes a major in uen e on this dissertation. These authors advo ated for building models of the behavior of Internet appli ations (i.e., the sour es of Internet traÆ ), and generating traÆ in networking experiments by driving network sta ks with these appli ation models. The main bene t of this approa h is that traÆ is generated in a losed-loop manner, whi h fully preserves the fundamental feedba k loop between network endpoints and network hara teristi s. For example, a model of web traÆ an be used to generate traÆ using TCP/IP network sta ks, and the generated traÆ will properly rea t to di erent levels of ongestion in networking experiments. In ontrast, open-loop traÆ generation is asso iated to models of the pa ket arrivals on network links, and these models are insensitive to hanges in network onditions, and 2 tied to the original onditions under whi h they were developed. This makes them inappropriate for experimental studies that hange these onditions. The main motivation of our work is to address one important diÆ ulty with sour e-level modeling. In the past, sour e-level modeling has been asso iated with hara terizing the behavior of individual appli ations. While this approa h an result in high-quality models, it is a diÆ ult pro ess that requires a large amount of e ort. As a onsequen e, only a small number of models is available, and they are often outdated. This is in sharp ontrast to the traÆ observed in most Internet links, whi h is driven by ri h traÆ mixes omposed of a large number of appli ations. Sour e-level modeling of individual appli ations does not s ale to modern traÆ mixes, making it very problemati for networking resear hers to ondu t representative experiments with losed-loop traÆ . This dissertation presents a new methodology for generating network traÆ in testbed experiments and software simulations. We make three main ontributions. First, we develop a new sour e-level model of network traÆ , the a-b-t model , for des ribing in a generi and intuitive manner the behavior of the appli ations driving TCP onne tions. Given a pa ket header tra e olle ted at an arbitrary Internet link, we use this model to des ribe ea h TCP onne tion in the tra e in terms of data ex hanges and quiet times, without any knowledge of the a tual semanti s of the appli ation. Our algorithms make it possible to eÆ iently derive empiri al hara terizations of network traÆ , redu ing modeling times from months to hours. The same analysis an be used to in orporate network-level parameters, su h as round-trip times, to the des ription of ea h onne tion, providing a solid foundation for traÆ generation. Se ond, we propose a traÆ generation method, sour e-level tra e replay , where traÆ is generated by replaying the observed behavior of the appli ations as sour es of traÆ . This is therefore a method for generating entire traÆ mixes in a losed-loop manner. One ru ial bene t of our method is that it an be evaluated by dire tly omparing an original tra e and its sour e-level replay. This makes it possible to systemati ally study the realism of syntheti traÆ , in the terms of how well our des ription of the onne tions in the original traÆ mix re e ts the nature of the original traÆ . In addition, this kind of omparison provides a means 3 !" "" "! #! $ % ! & % & " & ' #! $ %! & ( ! % & )) & ! % & *+ , ,. / ( ! 0 1 .233 (4 5 .633 (4 633 ( 7 .63 3 ( 8 & " Figure 1.1: Network traÆ seen from di erent levels. to understand the impa t that the di erent hara teristi s of a traÆ mix have on spe i tra es and on Internet traÆ in general. Third, we propose and study two approa hes for introdu ing variability in the generation pro ess and s aling (up or down) the level of traÆ load in the experiments. These operations greatly in rease the exibility of our approa h, enabling a wide range of experimental investigations ondu ted using our traÆ generation method. 1.1 Abstra t Sour e-Level Modeling This dissertation presents a methodology for generating syntheti network traÆ that addresses some of the main short omings of existing te hniques. Figure 1.1 illustrates the levels of detail at whi h Internet traÆ an be studied, providing a good starting point for framing our dis ussion. We fo us on the traÆ on a single Internet link, su h as the one between the University of North Carolina at Chapel Hill (UNC) and the Internet. We an study the traÆ in this link at di erent levels of detail. The top-most time-line represents traÆ observed in the link between UNC and the Internet as a sequen e of pa ket arrivals. This level of detail 4 is known as the aggregate pa ket arrival level. Here pa kets from many di erent onne tions were interleaved reating a omplex arrival pro ess in the network link. In general, TCP traÆ a ounts for the vast majority of the pa kets on Internet links (usually between 90% and 95%), whi h justi es our fo us on TCP in this work. The se ond time-line depi ts the pa ket arrivals that belonged to a single TCP onne tion. These pa kets were used to send data ba k and forth between two network endpoints, one lo ated at UNC, and the other one somewhere on the Internet. The sour es of these data are appli ations running on the endpoints, whi h rely on the pa ket swit hing servi e provided by the Internet to ommuni ate. Prominent examples of these appli ations are the World Wide Web, email, le sharing, et . Hundreds of di erent appli ations are ommonly found on Internet links. The traÆ observed at an Internet link is therefore the result of multiplexing the ommuni ation of a large number of endpoints driven by a wide range of appli ations. This dissertation onsiders the problem of generating traÆ in networking experiments that preserves both the aggregate-level and the onne tion-level properties of traÆ observed in a real network link. Note that we restri t ourselves to this most basi form of the problem where only a single link is onsidered both for observing traÆ and for reprodu ing it in networking experiments. Our ndings an ertainly be applied to a broader ontext, e.g., multiple links along a path following the \parking lot topology" [PF95℄, links in an ISP, et ., but we hoose to keep to this problem in its most essential form throughout this dissertation. As mentioned before, every onne tion on the Internet is driven by an appli ation ex hanging data between two endpoints. It is therefore possible to examine traÆ at a higher-level, where the ommuni ation is des ribed in terms of appli ation data units (ADUs) rather than network pa kets. This appli ation level is illustrated in the bottom time-line of Figure 1.1, whi h reveals that the sour e of the pa kets in the se ond time-line was the ex hange of data between a web browser and a web server using a TCP onne tion. The time-line shows a rst ADU of 2,500 bytes, whi h arried a request for an HTML page. The way the data is organized within this ADU and its meaning is given by the spe i ation of the HyperText Transfer Proto ol (HTTP) [FGM+97℄, whi h standardizes the ex hange of data between web browsers and web servers. 5 The time-line shows a se ond ADU, sent by the web server to the web browser in response to the rst ADU. It arried the a tual HTML sour e ode of the page requested by the browser. Its size was 4,800 bytes, whi h in luded not only the HTML sour e ode but also an appropriate HTTP header. The time-line shows another pair of ADUs that also orresponded to an HTTP request and an HTTP response, whi h this time arried an image le. Ea h ADU is asso iated to one or more pa kets in the se ond time-line. The amount of data in these ADUs and its meaning was de ided by the appli ation, while the a tual number of pa kets, their sizes, the need for retransmissions, et ., were de ided by lower layers (transport and below). The appli ation level provides the starting point for the traÆ modeling and generation methodology developed in this dissertation. Our approa h to traÆ generation relies on the notion of sour e-level modeling , advo ated by Paxson and Floyd [FP01℄. Rather than dire tly generating pa kets a ording to some tra e or some pa ket arrival model, sour e-level modeling involves simulating the behavior of the appli ations running on the endpoints and allowing lower layers to ontrol the a tual ex hange of pa kets. For example, generating traÆ with a sour e-level model of web traÆ means to simulate web browsers and web servers a ording to statisti al models of web page sizes, the durations of user think times and other sour e-level parameters [Mah97, BC98, SHCJO01℄. Modeling traÆ at the sour e level produ es des riptions of traÆ that are mostly independent of the underlying proto ols and network onditions, so they an be used to drive traÆ generation in experiments that modify these same proto ols and onditions. For this reason, sour e-level models are also known as network-independent model . For example, the size of an HTML page arried in a TCP onne tion does not hange with the degree of ongestion (it always has the same number of hara ters). Therefore, its size is a network-independent property. Lower-level des riptions of traÆ , su h as hara terizations of pa ket arrivals, are network dependent . For example, the rate at whi h the pa kets of a TCP onne tion arrive de reases as the degree of ongestion in reases, sin e TCP uses a ongestion ontrol algorithm that dereases the sending rate as the loss rate in reases. Also, pa ket losses for e TCP endpoints to perform retransmissions. This means that the transmission of the same amount of data at the 6 sour e-level (e.g., an HTML page) at di erent times may require di erent numbers of pa kets to be transferred, depending on the number of lost pa kets. A sour e-level model des ribes the sizes of ADUs, but not the times at whi h a onne tion should lower its sending rate or retransmit a pa ket. For this reason, the same model an be used to generate traÆ under di erent network onditions, su h as low and high levels of ongestion. Endpoints generating traÆ using these models are able to adapt to ea h spe i set of network onditions in the experiments. This preserves the fundamental feedba k loop that exists between endpoints and network onditions. For this reason, this type of traÆ generation is said to be losed-loop. On the ontrary, traÆ generated a ording to lower level models is ne essarily open-loop. For example, t preplay [t pb℄ an be used to reena t the sending of every pa ket re orded in a tra e, whi h results in open-loop traÆ that is insensitive to the underlying network onditions. This traÆ is inappropriate for experiments where network onditions are important, su h as the evaluation of ongestion ontrol me hanisms. In the past, sour e-level modeling has been onsidered a synonym of appli ation modeling, so resear hers have developed a number of appli ation-spe i models in luding models for web traÆ , le transferring and other individual appli ations. This approa h is good if one is interested in the traÆ generated by a single appli ation (or by a handful of appli ations). However, if one is interested in realisti traÆ mixes, appli ation-spe i traÆ modeling has some important short omings. The rst problem is that appli ation spe i modeling does not s ale well to the large number of appli ations that form ontemporary traÆ mixes. For example, the weekly traÆ report from Internet2 [Con04℄ olle ts separate statisti s for more than 80 di erent appli ations that make up Internet2 traÆ . Using existing te hnology, it is simply too timeonsuming to develop and populate individual models for ea h appli ation. Moreover, even if we had the resour es to examine the behavior of all appli ations, many appli ations use proprietary proto ols, so painstaking reverse engineering is needed to understand and model their behavior. In addition, Internet traÆ evolves qui kly, sin e new appli ations and improved versions of the existing ones appear very frequently. This dissertation proposes a more general solution to the sour e-level modeling and the 7 ! "#$ %& $ ' ! "#$ %& $ ( ) * ( ) * ) + ) + ) + Figure 1.2: An a-b-t diagram illustrating a persistent HTTP onne tion. traÆ generation problems. We develop an abstra t model of network data ex hange wherein ea h onne tion is des ribed independently of the semanti s of the appli ation initiating the onne tion. This idea is illustrated in the third time-line of Figure 1.1. Here the ommuni ation is des ribed in generi terms, simply as a sequen e of ADU ex hanges between the two endpoints of the TCP onne tion, without atta hing any meaning to the ADUs. Other generi hara teristi s of traÆ in lude the dire tion in whi h the ADUs are sent, from the onne tion initiator or from the onne tion a eptor, and the duration of quiet times between ADUs, whi h are due to user behavior and pro essing times. These hara teristi s an generally be used to des ribe the behavior of any spe i appli ation. For example, the ADUs of web traÆ are HTTP requests and responses, while the inter-ADU times are user think times and server pro essing times. The ru ial observation is that the sizes of ADUs and the times between them an be measured from the pa ket tra es of two onne tions without knowledge of the behavior of the appli ation driving the onne tion. This makes it possible to onstru t a sour e-level des ription of the entire set of onne tions observed in a measured link, instead of only the onne tions driven by one or a few well-known appli ations. Any tra e of pa kets traversing a network link an be transformed into an abstra t sour e-level tra e, without examining the payload of the pa kets and without instrumenting the endpoints. Our approa h to sour e-level modeling results in an abstra t representation of a TCP onne tion using a notation that we all an a-b-t onne tion ve tor . We also refer to this idea as the a-b-t model , in the sense that it provides a mental model for understanding network traÆ 8 at the sour e level, rather than in the sense of a mathemati al or statisti al model1. The term a-b-t is des riptive of the basi building blo ks of this model: a-type ADUs (a's), whi h are sent from the onne tion initiator to the onne tion a eptor, b-type ADUs (b's), whi h ow in the opposite dire tion, and quiet times (t's), during whi h no data segments are ex hanged. We will make use of these terms to des ribe the sour e-level behavior of TCP onne tions throughout this dissertation. Our a-b-t model has a sequential version and a on urrent version. The sequential version applies to onne tions where the endpoints follow a stri t order in their ex hange of ADUs. In this version, a TCP onne tion is des ribed by a ve tor of epo hs (e1; e2; : : : ; en). Ea h epo h has the form ej = (aj ; taj ; bj ; tbj), where aj is the size of an ADU sent from the onne tion initiator to the onne tion a eptor, bj is the size of an ADU sent in the opposite dire tion, and taj and tbj are inter-ADU quite times (during whi h the endpoints are idle). We all this representation of sour e-level behavior a sequential onne tion ve tor . For example, the onne tion illustrated in Figure 1.2 is represented as ((329; 0; 403; 0:12); (403; 0; 25821; 3:12); (356; 0; 1198; 15:3)) using the sequential a-b-t model. This onne tion has three epo hs, ea h arrying one HTTP request/response pair. The rst epo h has an ADU a1 of size 329 bytes, whi h was sent from the onne tion initiator (a web browser) to the onne tion a eptor (a web server), and an ADU b1 of size 803 bytes, whi h was sent in the opposite dire tion. We also observe some quiet times between the ADUs, su h tb2, whi h had a duration of 3.12 se onds. While Figure 1.2 in ludes labels for HTTP requests, responses and do uments, our a-b-t notation is ompletely generi . We onsider this TCP onne tion sequential be ause only one endpoint sent data to the other one at any point in the lifetime of the onne tion. It is important to iterate that an ADU is not a TCP segment (i.e., TCP pa ket), but an appli ation message that is independent of its 1Our a-b-t model provides however a good foundation for developing mathemati al and statisti al models of traÆ at the sour e-level. This dissertation onsistently follows a non-parametri approa h to traÆ modeling. The only ex eption is the Poisson Resampling method presented in Chapter 7, for whi h we also o er a more powerful non-parametri alternative, blo k resampling.9 ! ! " ! # ! $ ! % " # $ % Figure 1.3: A diagram illustrating the intera tion between two BitTorrent peers. a tual network representation as a link-level pa ket. As su h, an ADU an be of arbitrary size, like the smaller a1 = 329 bytes and the larger b2 = 25; 821 bytes in the previous example. The transferring of a1 would usually involve a single TCP segment, but it is also possible that this segment gets dupli ated, or lost and then retransmitted. In this ase, the TCP endpoint sending a1 would result in the generation of two or more segments arrying this ADU. Our notation would still des ribe this part of the TCP onne tion as a single 329-byte ADU, and not as the sequen e of TCP segments used to transfer the data. Similarly, transferring b2 = 25; 821 bytes requires a minimum of 18 TCP segments in a path without loss and with a regular Maximum Segment Size (MSS) of 1,460 bytes (the one derived from Ethernet's Maximum Transmission Unit (MTU) of 1,500 bytes, after subtra ting 20 bytes for the IP header and 20 bytes for the TCP header). It may require many more segments in a lossy environment, or in a path with a lower MTU. However, these details are irrelevant at the abstra t sour e level, where b2 aptures the need of one of the endpoints to send 25,821 bytes of data, and this need is independent of the way in whi h the data is transferred by the network. Our modeling is therefore networkindependent, whi h makes it suitable for generating losed-loop traÆ . While most TCP onne tions are driven by appli ations that follow a sequential pattern of ADU ex hanges, we an also nd ases in whi h the two endpoints send data to ea h other at the same time. This is illustrated in Figure 1.3 using a BitTorrent [Coh03℄ onne tion, where we an see ADUs whose transmission overlaps in time (i.e., the ADUs are ex hanged on urrently). This pattern is ertainly less ommon that the sequential one, but it is supported in important proto ols like HTTP/1.1 (pipelining), NNTP (streaming mode) and BitTorrent. Our analysis shows that while the fra tion of onne tions with on urrent data ex hanges is usually small, (17.4%), su h on urrent onne tions often arry a signi ant fra tion (15%-35%) of the total 10 bytes seen in a tra e, and hen e modeling these onne tions is riti al if one wants to generate realisti traÆ mixes. To represent on urrent ADU ex hanges, the a tions of ea h endpoint are onsidered to o ur independently of ea h other. Thus ea h endpoint is a separate sour e generating ADUs that appear as a sequen e of epo hs following a unidire tional ow pattern. Formally, this means that we represent ea h onne tion as a pair ( ; ) of onne tion ve tors of the form = ((a1; ta1); (a2; ta2); : : : ; (ana ; tana)) and = ((b1; tb1); (b2; tb2); : : : ; (bnb ; tbnb)); where ai and bi are sizes of ADUs sent from the initiator and from the a eptor of the TCP onne tion respe tively, and tai and tbi are quiet times between the ADUs. We all this representation of sour e-level behavior a on urrent onne tion ve tor . Unlike the sequential version of the a-b-t model, this representation does not apture any ausality between the two dire tions of a TCP onne tion. As a onsequen e, traÆ generated a ording to this version of the model usually exhibits a substantial number of on urrent data ex hanges. The a-b-t model provides a simple yet expressive way of des ribing sour e-level behavior in a generi manner that is not tied to the details of any appli ation. In addition, this non-parametri model was designed to in orporate quantities (ADU sizes, ADU dire tionality, and inter-ADU quiet time duration) that an be extra ted from pa ket header tra es in a eÆ ient, a urate manner. We an easily imagine more omplex and expressive models of TCP onne tions for whi h no eÆ ient data a quisition algorithm exists, or models that deal with hara teristi s of sour e-level behavior that annot be extra ted purely from pa ket headers. In the ase of the ab-t model, we have developed a data a quisition algorithm that relies on TCP sequen e numbers for measuring ADU sizes, and on the pa ket arrival timestamps obtained during tra e olle tion to determine inter-ADU quite times. Our algorithm onstru ts a data stru ture in whi h TCP segments are ordered a ording to their logi al data order , i.e., the order in whi h data must 11 Tmix Traffic Generators Tmix Traffic Generators Trace Partitioning TESTBED Original Packet Header Trace Th i i l t Original Connection Vectors Tc i i l ti t c Trace Analysis Generated Packet Header Trace Th′ t t ′ Replayed Connection Vectors Tc′ l ti t c′ Trace Analysis Figure 1.4: Overview of Sour e-level Tra e Replay. be delivered to the appli ation layer of the re eiving endpoint. In re onstru ting this logi al order for ea h onne tion, we have developed methods for dealing with network pathologies su h as arbitrary segment reordering, dupli ation and retransmission. Furthermore, when the data segments in a TCP onne tion annot be ordered a ording to the logi al data order, we an lassify the onne tion as on urrent with ertainty. Our data stru ture supports both sequential (i.e., bidire tional) and on urrent (i.e., unidire tional) ordering, making it possible to extra t ADU sizes and quiet times with a single pass over the segments of a TCP onne tion found in a tra e. The analysis an be performed in O(sW ) time, where s is the number of data segments in the onne tion and W is the maximum size of the TCP window (whi h bounds the maximum amount of reordering). 1.2 Sour e-Level Tra e Replay Our abstra t sour e-level modeling of TCP onne tion provides a solid foundation for generation traÆ mixes in simulators and network testbeds. We propose to generate traÆ using sour e-level tra e replay , as illustrated in Figure 1.4. Given a pa ket header tra e Th olle ted from some Internet link, we rst use our data a quisition algorithm to analyze the tra e and des ribe its ontent as a olle tion of onne tion ve tors T = f(Ti; Ci)g, where Ti is the relative 12 start time of the i-th TCP onne tion, and Ci is the sequential or on urrent onne tion ve tor orresponding to this onne tion. The basi approa h for generating traÆ a ording to T is to replay every onne tion ve tor Ci. Ea h onne tion ve tor Ci is replayed by starting a TCP onne tion pre isely at Ci's relative start time Ti, and transmitting the measured sequen e of ADUs (aj and bj) separated in time by the inter-ADU measured quiet times (tai and tbi). In this dissertation, we evaluate a spe i implementation of this approa h for FreeBSD network testbeds, where traÆ is generated using a tool we developed alled tmix . The goal of the dire t sour e-level tra e replay of T is to reprodu e the sour e-level hara teristi s of the traÆ in the original link, generating the traÆ in a losed-loop fashion. Closed-loop traÆ generation implies the need to simulate the behavior of appli ations, using regular network sta ks to a tually translate sour e-level behavior into network traÆ . In parti ular, our experiments use an implementation whi h relies on the standard so ket interfa e to reprodu e the data ex hanges in ea h onne tion ve tor. Generating traÆ in this manner is losed-loop in the sense that it preserves the feedba k me hanism in TCP, whi h adapts its behavior to hanges in network onditions, su h as loss and re eiver saturation. In ontrast, pa ket-level tra e replay, the dire t reprodu tion of Th, is an open-loop traÆ generation method in the sense that TCP ontrol algorithms are not used during the generation, and hen e the traÆ does not adapt to network onditions. The evaluation of our methodology onsists of omparing the original tra e Th and the syntheti tra e T 0 h obtained from the sour e-level tra e replay. Validating our traÆ generation method onsists of transforming T 0 h into a set of onne tion ve tors T 0 , using the same method used to transform Th into T . We then ompare the resulting set of onne tion ve tors T 0 with the original T . In prin iple, they should be identi al, sin e T represents the invariant sour elevel hara teristi s of Th. There are however some di eren es that are explained by the nature of the model and our measurement methods. The dire t omparison of Th and T 0 h also provides a way to study the a ura y of our approa h in terms of how well traÆ is des ribed by the a-b-t model. This is however a subtle exer ise. The a tual replay of T , whi h reates T 0 h, ne essarily requires the sele tion of a 13 a set of network-level parameters, su h as round-trip times and TCP re eiver window sizes, for ea h TCP onne tion in the sour e-level tra e replay. The exa t set of generated TCP segments and their arrival times is a dire t fun tion of these parameters. As a onsequen e, if we ondu t a sour e-level tra e replay using arbitrary network-level parameters, we obtain a T 0 h with little resemblan e to the original Th. The replayed a-b-t onne tion ve tors may be a perfe t des ription of the sour e behavior driving the original onne tions, but the generated pa ketlevel tra e T 0 h would still be very di erent from the original Th. To address this diÆ ulty, our replay in orporates network-level parameters individually derived from ea h onne tion in Th. We have also in orporated methods for measuring three important network-level parameters (round-trip time, TCP re eiver window size and loss rate) into our analysis and generation pro edure. While this set of parameters is by no means omplete, it does in lude the main parameters that a e t the average throughput of a TCP onne tion found in a tra e. This enables us to generate traÆ in a losed-loop manner that approximates measured tra es very losely. In orporating network-level properties is important, but it is riti al to understand the main short oming of this approa h. The goal of our work is not to make the generated traÆ T 0 h identi al to the original traÆ Th, whi h ould be a omplished with a simple pa ket-level replay. As mentioned before, pa ket-level replays generate traÆ that does not adapt to hanges in network onditions, resulting in open-loop traÆ . Our goal is to develop a losed-loop traÆ generation method based on a detailed hara terization of sour e behavior. TraÆ generated in a losed-loop manner an adapt to di erent network onditions, whi h are intrinsi when evaluating di erent network me hanisms. Our omparison of Th and T 0 h is only a means to understand the quality of traÆ generation method, where quality is onsidered to be higher as the original tra e is more losely approximated. If enough parameters of the original traÆ are a urately measured and in orporated into the traÆ generation experiment, we expe t to observe a great similarity between Th and T 0 h. On the ontrary, if we are missing some important parameters, we expe t to observe substantial di eren es between tra es. By onstru tion, traÆ generated using sour e-level tra e replay an never be identi al to 14 the original traÆ . The statisti al properties of original pa ket header tra es are the result of multiplexing a large number of onne tions onto a single link, and these onne tions traverse a large number of di erent paths with a variety of network onditions. It is simply not possible to fully hara terize this environment and reprodu e it in a laboratory testbed or in a simulation. This is both be ause of the limitations of passive inferen e from pa ket headers, and be ause of the sto hasti nature of network traÆ . Sour e-level tra e replay an never in orporate every fa tor that shaped Th, and therefore di eren es between Th and T 0 h are unavoidable. Still, nding a lose mat h between an original tra e and its replay, even if they are not identi al, onstitutes strong eviden e of the a ura y of the a-b-t model and the data a quisition and generation methods we have developed. It also demonstrates the feasibility of generating realisti network traÆ in a losed-loop manner that resembles a ri h traÆ mix. 1.3 Tra e Resampling and Load S aling As long as the network setup of a simulation or testbed experiment remains un hanged, the sour e-level tra e replay of a onne tion ve tor tra e T = f(Ti; Ci)g always results in traÆ that is similar to the original tra e. Every replay ontains the same number of TCP onne tions behaving a ording to the same onne tion ve tor spe i ation and starting at the same times. Only tiny variations are introdu ed on the end-systems by hanges in lo k syn hronization, operating system s heduling and interrupt handling, and at swit hes and routers by the sto hasti nature of pa ket multiplexing. Sour e-level tra e replay has therefore two desirable properties: The quality of the syntheti traÆ an be evaluated by dire tly omparing syntheti and original traÆ . This makes it possible to study the a ura y of the analysis methods and the generation system with omplete freedom, using any metri that an be derived from real traÆ . In ontrast, more abstra t methods based on parametri models of traÆ are inherently sto hasti and therefore more diÆ ult to evaluate. For su h methods, it is less obvious whether the observed di eren e between the traÆ generated using the parametri model and the original traÆ from whi h the model derives should be admitted. 15 The generation of the syntheti traÆ is fully reprodu ible. A resear her an expose a olle tion of network proto ols and me hanisms to exa tly the same losed-loop traÆ , whi h provides the right foundation for fair omparative studies. In ontrast, sto hasti variation in the traÆ generated using parametri models is often diÆ ult to ontrol. For example, experiments with models that rely on heavy-tailed distributions onverge very slowly to omparable onditions, as dis ussed by Crovella and Lipsky [CL97℄. While these properties are important, the pra ti e of experimental networking often requires to introdu e ontrolled variability in the generated traÆ for exploring a wider range of s enarios. This motivates the development of methods that manipulate T in order to generate di erent traÆ that still resembles the original one. Furthermore, developing a statisti ally sound way of manipulating T is essential for generating traÆ with di erent levels of o ered load. This manipulation to mat h a target o ered load is a very ommon need in experimental networking resear h. This is be ause the performan e of a network me hanism or proto ol is often a e ted by the amount of traÆ to whi h it is exposed. Therefore, rigorous experimental studies frequently require to generate a omplete range of target loads. In this dissertation, we propose two exible methods for introdu ing variability in traÆ generation experiments. In both ases, the set of onne tion ve tors in T is randomly resampled, resulting in a new set T 0 that preserves the aggregate sour e-level hara teristi s of the original traÆ . In our rst method, Poisson Resampling , we onstru t a new onne tion ve tor tra e T 0 by randomly resampling onne tions from T , and assigning them exponentially distributed inter-arrival times. As a result, onne tions in T 0 arrive a ording to a Poisson pro ess. In the se ond method, Blo k Resampling , we resample blo ks (groups) of onne tions rather than individual onne tions. This method results in a more realisti onne tion arrival pro ess, whi h mat hes the substantial burstiness observed in real tra es. In more te hni al terms, Blo k Resampling preserves the moderate long-range dependen e found in real onne tion arrival pro esses, while Poisson Resampling results in a short-range dependent onne tion arrivals pro ess. This di eren e is demonstrated in our experimental evaluation of the two methods. In addition, the evaluation shows that the duration of the resampling blo k reates a trade16 o between shorter blo ks (whi h in rease the number of distin t resamplings) and long-range dependen e (whi h disappears for short blo ks). Our analysis demonstrates that blo k durations between 1 and 5 minutes o er the best ompromise. Resear hers often need to ondu t a set of experiments with a range of di erent traÆ loads. When using a traditional sour e-level model, e.g., a model of web traÆ , resear hers have to rst ondu t a preliminary experimental study to determine how the parameters of the model, e.g., the number of user equivalents, a e t the generated load [CJOS00, LAJS03, K LH+02℄. This is usually known as the alibration of traÆ generator. Our resampling methods eliminate this ommon need for alibrating traÆ generators, sin e the resampling pro ess an be ontrolled to mat h a spe i target load (i.e., generated load is known a priori). In the ase of Poisson Resampling, this is a omplished by hanging the mean arrival rate of onne tions. In the ase of Blo k Resampling, o ered load is manipulated using blo k thinning (i.e., subsampling) and blo k thi kening (i.e., ombining blo ks). Our work reveals that load s aling annot be based simply on ontrolling the number of onne tions. Su h an approa h frequently results in o ered loads that are far from the target, be ause the number of onne tions in a resample is not strongly orrelated with the o ered load represented by these onne tions. We address this diÆ ulty by developing byte-driven versions of Poisson Resampling and Blo k Resampling, whi h s ale load using a running ount of the total data in the resampled tra e T 0 . Unlike the number of onne tions, the total amount of data in T 0 is strongly orrelated to traÆ load o ered by T 0 . Our experiments on rm that byte-driven resampling is highly a urate, eliminating the ommon need for alibrating traÆ generators. 1.4 Thesis Statement This dissertation onsiders the following thesis: 1. An abstra t sour e-level model an des ribe in detail the entire set of TCP appli ation behaviors observed in real networks. 17 2. Des riptions of abstra t sour e-level behavior an be empiri ally derived from pa ket header tra es in an eÆ ient, a urate manner. 3. TraÆ generation based on this abstra t sour e-level modeling results in syntheti traÆ that is realisti and suitable for experimental networking resear h. 4. The abstra t sour e-level model of a tra e an be manipulated to introdu e statisti ally valid variability in the generated traÆ and also to a urately mat h a target o ered load while preserving appli ation hara teristi s. 1.5 Contributions We highlight the following ontributions from this dissertation: We develop the on ept of abstra t sour e-level modeling and the a-b-t notation for des ribing the sour e-level behavior of entire traÆ mixes. We identify a fundamental dihotomy in sour e-level behavior between onne tions that ex hange data sequentially and onne tions that ex hange data on urrently. Our a-b-t notation in ludes a sequential version and a on urrent version that makes it possible to appropriately des ribe these two types of behaviors. We formulate a formal test of on urren y that an be applied to the pa ket headers of any TCP onne tion, and that does not su er from false positives. This enables us to a urately lassify onne tions as sequential or on urrent. We show that only a small fra tion of TCP onne tions (less than 4% in our tra es) ex hange data on urrently, but that these TCP onne tions a ount for a substantial fra tion (up to 32%) of the total traÆ . We present an eÆ ient algorithm for transforming a pa ket header tra e into a olle tion of sequential and on urrent a-b-t onne tion ve tors. Given a TCP onne tion for whi h we observe s segments and that has a maximum re eiver window size ofW , the asymptoti 18 ost of our algorithm is O(sW ). We demonstrate that this algorithm is a urate using traÆ generated from syntheti appli ations (i.e., with known hara teristi s). We develop sour e-level tra e replay, a losed-loop traÆ generation method that uses a-bt onne tion ve tors as a non-parametri model of network traÆ . One key bene t of this approa h is the possibility of dire tly omparing original and generated traÆ , whi h we use to evaluate the \realism" of our traÆ generation approa h. This omparison requires us to in orporate some network-level parameters (round-trip times, maximum re eiver window sizes, and possibly loss rates) into the traÆ generation. These parameters an be measured from pa ket header tra es. We pay spe ial attention to passive round-trip time estimation in our data a quisition, developing the on ept of One-Side Transit Time and studying the impa t of delayed a knowledgments on passive round-trip time estimation. We implement our traÆ generation method in a network testbed, developing a new distributed traÆ generation tool, tmix . We use this implementation to study the results of a large olle tion of tra e replay experiments, evaluating the need for detailed sour e-level modeling and the impa t of losses on measured network traÆ . Our results demonstrate that detailed sour e-level modeling is often required for a urately approximating real traÆ , whi h demonstrates that sour e-level behavior is a major fa tor shaping Internet traÆ . The most substantial di eren es are observed for the number of a tive onne tions and the number of pa ket arrivals per unit of time. Byte arrivals per unit of time and long-range dependen e do not improve so onsistently with the use of detailed sour e-level modeling. We also show that losses had only a se ondary e e t in our tra es, but they are not negligible when omparing original and generated traÆ . We present two tra e resampling algorithms whi h an be used to derive new tra es from an existing one, preserving its statisti al hara teristi s at the sour e-level. Our omparison of the two methods reveals that the observed long-range dependen e in onne tion arrivals has no apparent impa t on the long-range dependen e of pa ket and byte arrivals. We demonstrate the need for byte-driven rather than onne tion-driven resampling in order to a urately s ale o ered loads, and develop byte-driven versions of our two re19 sampling methods. This approa h eliminates the need for the experimental alibration of traÆ generators (whi h study the relationship between the parameters of the generator and the o ered traÆ load). Our entire methodology makes it possible to ondu t networking experiments with losedloop syntheti traÆ derived from real tra es in an automated manner. This eliminates the need for painstaking parametri modeling. 1.6 Overview Chapter 2 presents a review of the state-of-the-art in syntheti traÆ generation. We rst expand our dis ussion of pa ket-level traÆ generation and data a quisition, and then examine sour e-level traÆ generation more in depth. We review the literature on appli ation-spe i modeling, dis ussing models of web traÆ and other appli ations, and also onsider several approa hes for generating traÆ driven by more than one appli ation. We also dis uss existing methods for ontrolling the traÆ load reated in networking experiments. The hapter nally onsiders some resear h e orts addressing implementation issues. Chapter 3 dis usses abstra t sour e-level modeling, presenting several examples of real appli ations and how their behavior an be des ribed using our a-b-t notation. We also present our measurement algorithm for transforming a pa ket header tra e into a olle tion of sequential and on urrent a-b-t onne tion ve tors. The hapter also in ludes a validation of the measurement method using syntheti appli ations, and a measurement study that examines the statisti al properties of the a-b-t onne tion ve tors extra ted from ve real tra es. Chapter 4 fo uses on network-level measurement. We rst des ribe our methods for measuring round-trip times, window sizes and loss rates, and an evaluation of their a ura y. While this set of parameters is by no means omplete, it does in lude the main parameters that a e t the average throughput of a TCP onne tion found in a tra e. The se ond part of Chapter 4 des ribes the network-level metri s that we onsider in the evaluation of our traÆ generation 20 method: pa ket and byte throughput time series, their marginal distributions, wavelet spe tra, Hurst parameter estimates and time series of a tive onne tions. Chapter 5 des ribes sour e-level tra e replay and our implementation in a network testbed. We present a validation of this implementation using the sour e-level tra e replays of ve tra es. For ea h tra e, we study the a-b-t onne tion ve tors extra ted from the original tra es and those found in replays with and without pa ket losses at the network links. The results demonstrate the a ura y of our approa h, and also un over some diÆ ulties, whi h are in some ases inherent to the a-b-t model and its passive method of data a quisition. Chapter 6 examines the results of several sour e-level tra e replay experiments. Our analysis ompares original tra es and their sour e-level tra e replays using the ri h set of metri s introdu ed in Chapter 4, revealing a remarkably lose approximation. This study also in ludes a omparison of traÆ generated with the a-b-t model and with a simpli ed version that \disables" sour e-level modeling, whi h is shown to perform well for some metri s and poorly for others. As in the previous hapter, we also onsider experiments with and without arti ial losses, showing that loss did not have a dominant impa t on the hara teristi s of the original traÆ . In general, our results provide a strong justi ation of our sour e-level modeling approa h, demonstrating that the losed-loop replay of a-b-t onne tion ve tors losely resembles real traÆ . Chapter 7 presents our two resampling methods, Poisson Resampling and Blo k Resampling. These methods enable the resear her to introdu e ontrolled variability in sour e-level tra e replay experiments, without sa ri ing reprodu ibility. In addition, we onsider the problem of load s aling, i.e., how to ontrol the resampling pro ess to obtain a new tra e with a target o ered load. Our work demonstrates that this task an be a omplished by keeping tra k of the total number of data bytes in the resampled tra e, but not by keeping tra k of the number of onne tions. Our s aling methods eliminate the ommon need for running a preliminary study to alibrate the traÆ generator. Chapter 8 presents our on lusions and dis usses future work. 21 CHAPTER 2 Related Work A s ienti theory should be as simple as possible, but no simpler. | Albert Einstein (1879{1955) The greatest hallenge to any thinker is stating the problem in a way that will allow a solution. | Bertrand Russell (1872{1970) This hapter presents an overview of the resear h literature relevant for realisti traÆ generation. We onsider two types of works. First, we dis uss the body of literature that developed the on epts and te hniques urrently in use for generating syntheti traÆ in simulations and testbed experiments. Se ond, we examine the Internet measurement literature that informs the dis ussion of what is meant by \realisti " traÆ generation. Intuitively, syntheti traÆ resembling Internet traÆ an only be realisti if derived from measurements ondu ted from real network links. We ould argue that any Internet measurement paper helps to gain a better understanding of the nature of the Internet and its traÆ , being therefore relevant for realisti traÆ generation. However, the sheer size of the Internet measurement literature makes a omplete overview impra ti al, so we will restri t ourselves to the main works that had a dire t impa t on Internet traÆ generation. It is also interesting to note that the most re ent trend in the eld of traÆ generation is pre isely to ombine traÆ measurement and generation into a single, oherent approa h [HCJS+01, LH02, SB04, HCSJ04℄. TraÆ generation for experimental networking resear h was identi ed as one of the key hallenges in Internet modeling and simulation by Paxson and Floyd [PF95℄ in 1995. Interestingly, Floyd and Kohler [FK03℄ made a similar point in 2003, and argued that it was still diÆ ult to ondu t experiments with representative, validated syntheti traÆ . While traÆ measurement and Internet measurement in general have be ome in reasingly popular in re ent years, most studies are exploratory and provide little foundation to build traÆ generators. This hapter provides an overview of the major works in the eld of Internet traÆ generation, onsidering rst pa ket-level traÆ generation and then sour e-level traÆ generation. Other aspe ts of traÆ generation, su h as load s aling, in orporating network-dependen ies and implementation issues are dis ussed at the end of the hapter. 2.1 Pa ket-Level TraÆ Generation In this dissertation we restri t the question of generating realisti traÆ to a single link. This is the most essential form of the traÆ generation problem. It does not seem possible to ta kle the problem of generating traÆ for multiple links, say the ba kbone of an ISP, if single-link traÆ generation is not fully understood. The simplest way of generating realisti traÆ on a single link is to inje t pa kets into the network a ording to the hara teristi s of the pa kets observed traversing a real link. We will use the term pa ket-level traÆ generation to refer to this approa h. Pa ket-level traÆ generation an mean either performing a pa ket-level replay , i.e., reprodu ing the exa t arrivals and sizes of every observed pa ket, or inje ting pa kets in su h a manner as to preserve some set of statisti al properties onsidered fundamental, or relevant for a spe i experiment. Pa ketlevel replay, whi h has been implemented in tools like t preplay [t pb℄, is a straightforward te hnique that is useful for ertain types of experiments where on guration of the network is not expe ted to a e t the generated traÆ . In other words, whenever it is reasonable to generate traÆ that is invariant of (i.e., unresponsive to) the experimental onditions, then pa ket-level replay is an e e tive means for generating syntheti traÆ . For example, pa ket-level replays of tra es olle ted from the Internet have been used to evaluate a he repla ement poli ies in routing tables [Jai90, Fel88, G C02℄. In this type of experiments, di erent a he repla ement 23 poli ies are ompared by feeding the lookup a he of a routing engine with a pa ket tra e and omputing the a hieved hit ratio. Also, studies that require mali ious traÆ generation an often make use of pa ket-level replay [SYB04, RDFS04℄. Mali ious traÆ (e.g., a SYN ood) is frequently not responsive to network onditions (and their degradation). Before ondu ting an experiment in whi h traÆ is generated using pa ket-level replay, resear hers must obtain one or more tra es of the arrivals of pa kets to a network link. These tra es are olle ted using a pa ket \sni er" to monitor the traÆ traversing some given link. This pa ket apturing an be performed with and without hardware support. The most prominent example of software-only apture is the Berkeley Pa ket Filter (BPF) system [MJ93, t pa℄. BPF in ludes a pa ket apturing library, libp ap, and a ommand-line interfa e and tra e analysis tool, t pdump. BPF relies on the promis uous mode of network interfa es to observe pa kets traversing a network link and to reate a tra e of them in the \p ap" format. Due to priva y and size onsiderations, most tra es only in lude the proto ol headers (IP and TCP/UDP) of ea h pa ket and a timestamp of the pa ket's arrival. Monitoring high-speed links with a softwareonly system is problemati , given that traÆ has to be forwarded from the network interfa e to the monitoring software using the system bus. The system bus may not be fast enough for this task depending on the load on the monitored link. High loads an result in \dropped" pa kets that are absent from the olle ted tra e. Furthermore, the extra forwarding from the wire to the monitoring program, whi h usually involves bu ering in the network interfa e and in operating system layers, makes timestamps rather ina urate. In the ase of BPF, timestamping ina ura ies of a few hundreds of mi rose onds are quite ommon. In order to over ome these diÆ ulties, resear hers often make use of spe ialized hardware that an extra t headers and provide timestamps without the intervention of the operating system. This is of ourse far more expensive, but it dramati ally improves timestamp a ura y and in reases the volume of traÆ that an be olle ted without drops. The DAG platform [Pro, GMP97, MDG01℄ is a good example of this approa h, and it is widely used in network measurement proje ts. The timestamping a ura y of DAG tra es is on the order of nanose onds. Multiple DAG ards, possibly at di erent lo ations, an also be syn hronized using an external lo k signal, su h 24 as the one from the Global Positioning System (GPS). Besides olle ting their own tra es, resear hers an also make use of publi repositories of p ap and DAG tra es, su h as the Internet TraÆ Ar hive [Int℄ and the PMA proje t at NLANR [nlab℄. While pa ket-level replay is on eptually simple, it involves a number of engineering hallenges. First, traÆ generators usually rely on operating systems layers and abstra tions, su h as raw so kets, to perform the pa ket-level replay. Most operating systems provide no guarantee on the exa t delay between the time of pa ket inje tion by the traÆ generator and the time at whi h the pa ket leaves the network interfa e. Servi ing interrupts, s heduling pro esses, et ., an introdu e arbitrary delays, whi h make the arrival pro ess of the pa ket replay di er from the original and intended arrival pro ess. This ina ura y may or may not be signi ant for a given experiment. Another hallenge is the replay of tra es olle ted in high-speed links. The rate of pa ket arrivals in a tra e an be far higher than the rate at whi h a single host an generate pa kets. For example, the speed at whi h a ommodity PC an inje t pa kets into the network is primarily limited by the speed of its bus and the bandwidth of its network interfa e. As a onsequen e, replying a high rate tra e often requires an experimenter to partition the tra e into subtra es that have to be replayed using a olle tion of hosts. In this ase, it is important to arefully syn hronize the replay of these hosts. This is generally a diÆ ult task, sin e the syn hronization has to be done using the network itself, whi h introdu es variable I/O delays. Clo k drift is also a on ern with ommon PC lo ks. Ye et al. [YVIB05℄ dis ussed pa ket-level replay of high rate tra es, fo using on OC-48, and how to evaluate the a ura y of the replay. They proposed ow-based splitting to onstru t a partition of the original tra e that an be a urately replayed by an ensemble of traÆ generators. This addresses the hallenge of replaying a tra e using multiple traÆ generators without reordering the pa kets within a ow. In ontrast, round-robin assignment of pa kets to traÆ generators, alled hoi e of N in this work, results in pa kets belonging to the same ow generated by di erent traÆ generators. As a onsequen e, the generated traÆ exhibits substantial pa ket reordering. This reordering is due to the diÆ ulty of maintaining the generators perfe tly syn hronized with ommodity hardware, so one generator an easily get ahead of another 25 and modify the order of pa kets within a ow. Ye et al. also dis ussed the diÆ ulties reated by bu ering on the network ards, whi h modi es the properties of the pa ket arrival pro ess at ne s ales. An alternative to the approa h in Ye et al. is to rely on spe ialized hardware. Most DAG ards support pa ket-level replay, bypassing the network sta k. However, no information is available on how a urately the generated traÆ preserves the properties of original pa ket arrival pro ess. Pa ket-level replay has two important short omings: it is in exible and it is open-loop. Given that a pa ket-level replay is the exa t reprodu tion of a olle ted tra e, both in terms of pa ket arrival times and pa ket ontent, there is no way to introdu e variability in the experiments other than a quiring a olle tion of tra es and using a di erent tra e in di erent runs of the experiments. This makes pa ket replay in exible, sin e the resear her has to limit his experiments to the available tra es and their hara teristi s. The \right" tra es may not be available or may be diÆ ult to olle t. Even ondu ting experiments that study simple questions an be umbersome. For example, a resear her that intends to test a a he repla ement poli y under heavy loads must nd tra es with high pa ket arrival rates, whi h may or may not be available. Similarly, evaluating a queuing me hanism under a range of (open-loop) loads requires one to nd tra es overing this range of loads, and may involve mixing tra es from di erent lo ations, whi h ould ast doubt on the realism of the resulting traÆ and thus on the on lusions of the evaluation. More exible traÆ generation an be a hieved by generating pa kets a ording to a set of statisti al properties derived from real measurements. The hallenge then is to determine whi h properties of traÆ are most important to reprodu e so that the syntheti generated traÆ makes the experiments \realisti enough." For example, Internet traÆ has been found to be very bursty, showing very frequent hanges in throughput (both for pa kets and bytes per unit of time). Therefore, most experiments should make use of syntheti traÆ that preserves this observed burstiness. Leland et al. [LTWW93℄ observed that this burstiness an be studied using the framework provided by statisti al self-similarity . At a high-level, self-similarity means that traÆ is equally bursty, i.e., equal varian e in arrival times, a ross a wide range of time 26 s ales. This is similar to the geometri self-similarity that fra tals exhibit. Mathemati ally, statisti al self-similarity manifests itself as long-range dependen e, a sub-exponential de ay of the auto orrelation of a time-series with s ale. This is in sharp ontrast to Poisson modeling and its short-range dependen e, whi h implies an exponential de ay of the auto orrelation with s ale. Therefore, it is generally diÆ ult to a ept experimental results where syntheti traÆ does not exhibit some degree of self-similarity. A ordingly, some experiments may simply rely on some method for generating a self-similar pro ess [Pax97℄ and inje t pa kets into the experiments a ording to this pro ess. Studies on queuing dynami s, e.g., [ENW96℄, made use of this traÆ generation approa h. Other experiments with a more stringent need for realism may also attempt to reprodu e other known properties of traÆ . For example, a realisti distribution of IP addresses is essential for experiments in whi h route a hing performan e is evaluated. To a omplish this, pa ketlevel traÆ generation an be ombined with a statisti al model of pa ket arrival and a model of address stru ture. As one example, Aida and Abe [AA01℄ proposed a generative model based on the nding that the popularity of addresses follows a powerlaw (a heavy-tailed distribution with a hyperboli shape). In ontrast, Kohler et al. [KLPS02℄ fo used on the hierar hi al stru ture of addresses and pre xes, whi h is shown to be well-des ribed by a multi-fra tal model. Both studies ould be used to enri h pa ket-level traÆ generation. 2.2 Sour e-Level TraÆ Generation While pa ket-level traÆ generation based on a set of statisti al properties is onvenient for the experimenter, and attra tive from a mathemati al point of view, it fails to preserve an essential property of Internet traÆ . As Floyd and Paxson [PF95℄ point out, pa ket-level traÆ generation is open-loop, in the sense that it does not preserve the feedba k loop that exists between the sour es of the traÆ (the endpoints) and the network. This feedba k loop omes from the fa t that endpoints rea t to network onditions, and this rea tion itself an hange these onditions, and therefore trigger further hanges in the behavior of the endpoints. For 27 example, TCP traÆ rea ts to ongestion by lowering its sending rate, whi h in turn de reases ongestion. A tra e of pa ket arrivals olle ted at some given link is therefore spe i to the hara teristi s of this link, the time of the tra ing paths of the onne tions that traversed it, et . Therefore, any hanges that the experimenter makes to the experimental onditions make the pa ket-level traÆ invalid sin e the traÆ generation pro ess is insensitive to these hanges (unlike real Internet traÆ ). For example, pa ket-level replay of TCP traÆ does not rea t to ongestion in any manner. The solution is to model the sour es of traÆ , i.e., to model the network behavior of the appli ations running on the endpoints that ommuni ate using network ows. Sour e-level models are then used to drive network sta ks whi h do implement ow and ongestion ontrol me hanisms, and therefore rea t to hanges in network onditions as real Internet endpoints do. As a result, the generated traÆ is losed-loop, whi h is far more realisti for a wide range of experiments. The simplest sour e-level model is the in nite sour e model . The starting point of the in nite sour e model is the availability of an in nite amount of data to be ommuni ated from one endpoint to another. Generating traÆ a ording to this model means that a traÆ generator opens one or more transport onne tions, and onstantly provides them with data to be transferred. This means that, for ea h onne tion, one of the endpoints is onstantly writing (sending data pa kets) while the other endpoint is onstantly reading (re eiving data pa kets). The sour es are never the bottlene k in this model. The only pro ess that limits the rate at whi h the endpoints transmit data is the network, broadly de ned to in lude any me hanism below the sour es, su h as TCP's maximum re eiver window. The in nite sour e model is very attra tive for several reasons, whi h make it rather popular in both theoreti al and experimental studies [FJ93b, KHR02, AKM04, SBDR05℄. First, the in nite sour e model has no parameters and hen e it is easy to understand and amenable to formal analysis. It was, for example, the foundation for the work on the mathemati al analysis of steady-state TCP throughput [PFTK98, BHCKS04℄. Se ond, its underlying assumption is that the largest ows on the network, whi h a ount for the majority of the pa kets and 28 the bytes, \look like" in nite sour es. For example, an in nite sour e provides a onvenient approximation to a multi-gigabyte le download using FTP. Third, in nite sour es are wellbehaved, in the sense that, if driving TCP onne tions, they try to onsume as mu h bandwidth as possible. They also result in the ideal ase for bandwidth sharing. This makes them useful for experiments in the area of ongestion ontrol, sin e in nite sour es an easily ongest network links. Despite their onvenien e, in nite sour es are unrealisti and do not provide a solid foundation for networking experiments, or even for understanding the behavior and performan e of the Internet. The pioneering work by C a eres et al. [CDJM91℄, published as early as 1991, provided a rst insight into the substantial di eren e between in nite sour es and real appli ation traÆ . These authors examined pa ket header tra es from three sites (the University of California at Berkeley, the University of Southern California, and Bell ore in New Jersey) using the on ept of appli ation-level onversations. An appli ation-level onversation was de ned as the set of pa kets ex hanged between two network endpoints. These onversations ould in lude one or more \asso iations" (TCP onne tions and UDP streams). A general problem when studying traÆ for extended periods is the need to separate traÆ into independent units of a tivity, whi h in this ase orrespond to onversations. Endpoints may ex hange traÆ regularly, say every day, but that does not mean that they are engaged in the same onversation for days. Danzig et al. separated onversations between the same endpoints by identifying long periods without any traÆ ex hange, whi h are generally referred to as idle times or quiet times in the literature. In their study, they used a threshold of 20 minutes to di erentiate between two onversations. The authors examined onversations from 13 di erent appli ations, hara terizing them with the help of empiri al umulative distribution fun tions (empiri al CDFs). The results in lude empiri al CDFs for the number of bytes in ea h onversation, the dire tionality of the ow of data (i.e., whether the two endpoints sent a similar amount of data), the distribution of pa ket sizes, the popularity of di erent networks, et . Danzig and Jamin [DJ91℄ used these distributions in their traÆ generation tool, t plib. The results from this work are further dis ussed in Se tion 2.2.2. 29 C a eres et al. pointed out a number of substantial di eren es between their results and the assumptions of earlier works. First, the majority of onne tions arried very small amounts of data, less than 10 KB in 75-90% of the ases. This is true for both intera tive appli ations (e.g., telnet and rlogin) and bulk transfer appli ations (e.g., FTP, SMTP). This is in sharp ontrast to the in nite availability of data to be transferred assumed in the in nite sour e model. The dynami s of su h short data transfers are ompletely di erent from those of in nite sour es, whi h for example have time to fully employ ongestion ontrol me hanisms. The se ond di eren e was that traÆ from most appli ations was shown to be strongly bidire tional, and it in luded at least one request/response phase, i.e., an alteration in the role of the endpoints as senders of data. The in nite sour e model is inherently unidire tional, with one of the endpoints always a ting as the sender, and the other endpoint always a ting as the re eiver. Third, the authors observed a wide range of pa ket sizes, and a large fra tion of the data pa kets were small, even for bulk appli ations. Data pa kets from an in nite sour e are ne essarily full size, sin e there is by de nition enough data to ompletely ll new pa kets. These measurement results highlighted a substantial di eren e between in nite sour es and real traÆ , and later experimental studies demonstrated the perils of using traÆ from in nite sour es in the evaluating of network me hanisms. Joo et al. [JRF+99, JRF+01℄ demonstrated that in nite TCP sour es tend to be ome syn hronized, so they in rease or de rease their transmission rate at the same time. This pattern is ompletely absent from more realisti experiments in whi h the majority of the sour es have small and diverse amounts of data to send. As a result, loss patterns, queue lengths and other hara teristi s are strikingly di erent when more realisti syntheti traÆ is used. Joo et al. also studied the di eren e between open-loop and losed-loop traÆ generation. The area of a tive queue management has provided several illustrations of the misleading results obtained with the unrealisti in nite sour es. The rst AQM s heme, RED, was presented by Floyd and Ja obson in [FJ93b℄, and evaluated using in nite sour es. Their results showed that RED signi antly outperformed FIFO, the usual router queuing me hanism. Later work by Christiansen et al. [CJOS00℄ demonstrated that RED o ers very little bene t, if any, 30 when exposed to more realisti traÆ where sour es are not in nite. In parti ular, they used a model of web-like traÆ , whi h is dis ussed later in this hapter. Paxson's analysis [Pax94℄ of pa ket header tra es from seven di erent network links provided further support for the on lusions of C a eres et al. In addition, Paxson onsidered the parsimonious modeling of traÆ from di erent appli ations. He hara terized four prominent appli ations, telnet, NNTP, SMTP and FTP, using analyti models to t the empiri al distributions. Analyti models are more ommonly known as parametri models in the statisti al literature, and orrespond to lassi al distributions, su h as the Pareto distribution, that an be fully hara terized with a mathemati al expression and only one or a few parameters. As Paxson pointed out, the use of analyti models results in a on ise des ription of network appli ations that an be easily ommuni ated and ompared, and are often mathemati ally tra table. His methodology has had a lasting in uen e in appli ation-level modeling. He learly demonstrated that analyti ts (i.e., parametri models) of the observed distributions an losely approximate the hara teristi s of real appli ations. However, it is important to remember that traÆ is not ne essarily more realisti when generated by analyti models as opposed to empiri al models. Empiri al CDFs, derived from network measurement of suÆ ient size, provide a perfe tly valid foundation for traÆ generators. Furthermore, nding analyti ts of omplex random variables that do not mat h well-known statisti al distributions is a daunting task. 2.2.1 Web TraÆ Modeling Modeling web traÆ has re eived substantial attention sin e the sudden emergen e of the World Wide Web in the mid-nineties. Arlitt and Williamson [AW95℄ proposed an early model for generating web traÆ 1, based on pa ket header tra es olle ted at the University of Saskat hewan. The model was entered around the on ept of a onversation, as proposed by C a eres et al. [CDJM91℄. In this ase, a onversation was the set of onne tions observed between a web browser and a web server. These authors were the rst to onsider questions 1To be more spe i , Arlitt and Williamson proposed a model of \Mosai " traÆ . Mosai was the rst web browser. 31 su h as the distribution of the number of bytes in requests and responses, the arrival rates of onne tions, et . In general, the proposed model has parameters that are quite di erent from those of later works. For example, an Erlang model of response sizes was used, whi h is in sharp ontrast to the heavy-tailness observed by other authors. While Arlitt and Williamson did not provide any details on the statisti al methods they employed, it is likely that the small sample size (less than 10,000 TCP onne tions) made it diÆ ult to develop a more statisti ally representative model. One of the major e orts in the area of web traÆ modeling oriented toward traÆ generation took pla e at Boston University. Cunha et al. [CBC95℄ examined lient tra es olle ted by instrumenting browsers at the Department of Computer S ien e. Unlike the pa ket header tra es used in Arlitt and Williamson, lient tra es in lude appli ation information su h as the exa t URL of ea h web obje t requested and downloaded in ea h TCP onne tion. The authors made use of this information to study page and server popularity, whi h are relevant for web a hing studies. In addition, the authors proposed the use of powerlaws for onstru ting a parametri model of web t