REPT: Reverse Debugging of Failures in Deployed Software
@inproceedings{Cui2018REPTRD, title={REPT: Reverse Debugging of Failures in Deployed Software}, author={Weidong Cui and Xinyang Ge and Baris Kasikci and Ben Niu and Upamanyu Sharma and Ruoyu Wang and Insu Yun}, booktitle={OSDI}, year={2018} }
Debugging software failures in deployed systems is important because they impact real users and customers. [] Key Method REPT tackles these challenges by constructing a partial execution order based on timestamps logged by hardware and iteratively performing forward and backward execution with error correction.
We design and implement REPT, deploy it on Microsoft Windows, and integrate it into WinDbg. We evaluate REPT on 16 real-world bugs and show that it can recover data values accurately (92% on…
46 Citations
Reverse Debugging of Kernel Failures in Deployed Systems
- Computer ScienceUSENIX Annual Technical Conference
- 2020
Kernel REPT is the first practical reverse debugging solution for kernel failures that is highly efficient, imposes small memory footprint and requires no extra software layer, and can proactively identify kernel bugs by checking the reconstructed execution history against a set of predetermined invariants.
Postmortem accurate IR-level state recovery for deployed concurrent programs
- Computer ScienceACM SIGAPP Applied Computing Review
- 2021
STRAB (State Recovery at Abstract-level), a collection of proposed methods to solve debugging failures of deployed concurrent software, has significantly higher accuracy compared to REPT at IR-level with only minor slowdowns, while also achieving architecture-independence.
STRAB: state recovery using reverse execution at IR level for concurrent programs
- Computer ScienceSAC
- 2021
Experimental results on a variety of real-world concurrent programs show that STRAB has significantly higher accuracy compared to REPT at IR-level (+40% on average) with only minor slowdowns (x2.7 on average), while also achieving architecture-independence.
WATCHER: in-situ failure diagnosis
- Computer ScienceProc. ACM Program. Lang.
- 2020
A novel diagnosis system that can pinpoint root causes of program failures within the failing process ("in-situ"), eliminating the privacy concern is presented and two optimizations to reduce the diagnosis time and diagnose failures with control flow hijacks are proposed.
Automated Bug Hunting With Data-Driven Symbolic Root Cause Analysis
- Computer ScienceCCS
- 2021
This work proposes bug hunting using symbolically reconstructed states based on execution traces to achieve better detection and root cause analysis of overflow, use-after-free, double free, and format string bugs across user programs and their imported libraries.
Execution reconstruction: harnessing failure reoccurrences for failure reproduction
- Computer SciencePLDI
- 2021
Execution Reconstruction is proposed, a technique that strikes a better balance between efficiency, effectiveness and accuracy for reproducing production failures and reproduces fully replayable executions that can power a variety of debugging and reliabilty use cases.
Ad hoc Test Generation Through Binary Rewriting
- Computer Science2020 IEEE 20th International Working Conference on Source Code Analysis and Manipulation (SCAM)
- 2020
This work builds on record-replay and binary rewriting to automatically generate and run targeted tests for candidate patches significantly faster and more efficiently than traditional test suite generation techniques like symbolic execution.
POMP++: Facilitating Postmortem Program Diagnosis with Value-Set Analysis
- Computer ScienceIEEE Transactions on Software Engineering
- 2021
POMP++ can accurately and efficiently pinpoint program statements that truly contribute to the crashes, making failure diagnosis significantly convenient and reducing the execution time by 60% compared with existing reverse execution.
RoBin: Facilitating the Reproduction of Configuration-Related Vulnerability
- Computer Science2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)
- 2021
RoBin is implemented, a binary similarity-based building configuration inference tool to infer the specific building configurations via the binary from crash report that can help developers reproduce and diagnose the vulnerability, and finally, patch the programs.
Testing Configuration Changes in Context to Prevent Production Failures
- Computer ScienceOSDI
- 2020
The idea behind ctests is simple—connecting production system configurations to software tests so that configuration changes can be tested in the context of code affected by the changes, and it effectively detects real-world failure-inducing configuration changes, diverse injected mis configurationurations and misconfigurations in the deployed files.
References
SHOWING 1-10 OF 38 REFERENCES
Cooperative Bug Isolation
- Computer Science
- 2007
A suite of new algorithms for statistical debugging: finding and fixing software errors based on statistical analysis of sparse feedback data is presented, from simple process of elimination strategies to regression techniques that build models of suspect program behaviors as failure predictors.
BugNet: continuously recording program execution for deterministic replay debugging
- Computer Science32nd International Symposium on Computer Architecture (ISCA'05)
- 2005
The proposed BugNet architecture provides the ability to replay an application's execution across context switches and interrupts, which obviates the need for tracking program I/O, interrupts and DMA transfers, which would have otherwise required more complex hardware support.
Leveraging the short-term memory of hardware to diagnose production-run software failures
- Computer ScienceASPLOS 2014
- 2014
This paper designs a low overhead, low latency, privacy preserving production-run failure diagnosis system based on two observations: first, short-term memory of program execution is often sufficient for failure diagnosis, as many bugs have short propagation distances; and second, maintaining a short- term memory of execution is much cheaper than maintaining a record of the whole execution.
RETracer: Triaging Crashes by Reverse Execution from Partial Memory Dumps
- Computer Science2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)
- 2016
RETracer is presented, the first system to triage software crashes based on program semantics reconstructed from memory dumps, and it is found that RETracer eliminates two thirds of triage errors based on a manual analysis of 140 bugs fixed in Microsoft Windows and Office.
Production-run software failure diagnosis via hardware performance counters
- Computer ScienceASPLOS '13
- 2013
PBI can effectively diagnose failures caused by sequential and concurrency bugs with a small overhead that is never higher than 10%.
Execution synthesis: a technique for automated software debugging
- Computer ScienceEuroSys '10
- 2010
ESD--a debugger based on execution synthesis--is evaluated on popular software and reproduces on its own several real concurrency and memory safety bugs in less than three minutes, thus incurring no runtime overhead and being practical for use in production systems.
Postmortem Program Analysis with Hardware-Enhanced Post-Crash Artifacts
- Computer ScienceUSENIX Security Symposium
- 2017
It is shown that, POMP can accurately and efficiently pinpoint program statements that truly pertain to the crashes, making failure diagnosis significantly convenient.
PSE: explaining program failures via postmortem static analysis
- Computer ScienceSIGSOFT '04/FSE-12
- 2004
PSE (Postmortem Symbolic Evaluation), a static analysis algorithm that can be used by programmers to diagnose software failures, is described, which combines a novel dataflow analysis and memory alias analysis in a manner that allows for precise exploration of the program's behavior in polynomial time.
Instrumentation and sampling strategies for cooperative concurrency bug isolation
- Computer ScienceSPLASH 2010
- 2010
This work presents Cooperative Crug Isolation (CCI), a low-overhead instrumentation framework to diagnose production-run failures caused by crugs, and offers a varied suite of predicates that represent different trade-offs between complexity and fault isolation capability.
Failure sketching: a technique for automated root cause diagnosis of in-production failures
- Computer ScienceSOSP
- 2015
Gist, a prototype for failure sketching that relies on hardware watchpoints and a new hardware feature for extracting control flow traces (Intel Processor Trace), is built and it is shown that Gist can build failure sketches with low overhead for failures in systems like Apache, SQLite, and Memcached.