Posted by Admin: System Admin
Cloud systems suffer from distributed concurrency bugs, which often lead to data loss and service outage. This paper presents CLOUDRAID, a new automatical tool for finding distributed concurrency bugs efficiently and effectively. Distributed concurrency bugs are notoriously difficult to find as they are triggered by untimely interaction among nodes, i.e., unexpected message orderings. To detect concurrency bugs in cloud systems efficiently and effectively, CLOUDRAID analyzes and tests automatically only the message orderings that are likely to expose errors. Specifically, CLOUDRAID mines the logs from previous executions to uncover the message orderings that are feasible but inadequately tested. In addition, we also propose a log enhancing technique to introduce new logs automatically in the system being tested. These extra logs added improve further the effectiveness of CLOUDRAID without introducing any noticeable performance overhead. Our log-based approach makes it well-suited for live systems. We have applied CLOUDRAID to analyze six representative distributed systems: Hadoop2/Yarn, HBase, HDFS, Cassandra, Zookeeper, and Flink. CLOUDRAID has succeeded in testing 60 different versions of these six systems (10 versions per system) in 35 hours, uncovering 31 concurrency bugs, including nine new bugs that have never been reported before. For these nine new bugs detected, which have all been confirmed by their original developers, three are critical and have already been fixed. Machine learning is an important component of the growing field of data science. Through the use of statistical methods, different type of algorithms is trained to make classifications or predictions, and to uncover key insights in this project. These insights subsequently drive decision making within applications and businesses, ideally impacting key growth metrics. Machine learning algorithms build a model based on this project data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of datasets, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.
Liu et al. [33] have recently extended race detection techniques for multi-threaded programs [34], [35], [36], [37], [38], [39] to detect race conditions in distributed systems. Their approach instruments memory accesses and communication events in a system to collect runtime traces at run time. An offline analysis is performed to analyze the happen-before relation among the emory accesses, by using a happenbefore model customized to distributed systems. Concurrent memory accesses that may trigger exceptions are regarded as harmful data races. A trigger is employed to further verify the detected race conditions. In [40], its approach mines logs to recover runtime traces without instrumentation, by restricting itself to message orderings involving only two messages. In this paper, we have improved the effectiveness of this earlier approach with two significant extensions. First, we introduce a new log enhancement technique, which allows us to detect bugs that would otherwise be missed. Second, we are now capable of detecting bugs that manifest themselves in message orderings involving an arbitrary number of messages. With these two extensions, we have provided experimental evidence that our framework can find more bugs in new applications. Fault injection techniques [41], [42], [43], [44], [45], [46], [47], [48], [49] are commonly used to test the resilience of distributed systems. However, they focus on how to inject faults at different system states to expose bugs in the fault handlers. CLOUDRAID can be applied together to detect fault-related concurrency bugs more effectively. Xu et al. [17] mine console logs from a system and apply machine learning techniques to detect anomaly executions. Mined information such as logged values and logging frequencies is visualized to help users diagnose anomaly behaviors. DISTALYZER [59] compares logs from abnormal and normal executions to infer the strongest association between system components and performance. Iprof [18] extracts request IDs and timing information from logs to profile request latency. Stitch [60] organizes log instances into tasks and sub-tasks, by analyzing relations among the logged ID variables to profile different components in the entire distributed software stack. In contrast, CLOUDRAID mines logs to uncover insufficiently exercised message orderings to detect concurrency bugs effectively. CRASHTUNER [61] applies a similar log analysis to infer some system meta-info, e.g., the running nodes and tasks/resources associated to each node. This tool makes use of the meta-info to detect crash-recovery bugs, which are triggered by crashing a node where its associated meta-info is being accessed. In contrast, CLOUDRAID applies log analysis to uncover the orderings between communication events for the purposes of detecting distributed concurrency bugs. Disadvantages ? An existing methodology doesn’t implement a novel strategy for detecting distributed concurrency bugs. ? The system is not aiming at CLOUDRAID leverages the run-time logs of live systems and avoids unnecessary repetitive tests.
? We propose a new approach, CLOUDRAID, for detecting concurrency bugs in distributed systems efficiently and effectively. CLOUDRAID leverages the run-time logs of live systems and avoids unnecessary repetitive tests, thereby drastically improving the efficiency and effectiveness of our approach. ? We describe a new log enhancing technique for improving log quality automatically. This enables us to log key communication events in a system automatically without introducing any noticeable performance penalty. The enhanced logs can further improve the overall effectiveness of our approach. ? We have evaluated extensively CLOUDRAID using six representative distributed systems: Hadoop2/Yarn, HBase, HDFS, Cassandra, Zookeeper, and Flink. CLOUDRAID can test 60 different versions of these six systems (with six workloads in total) in 35 hours, and detect successfully 31 concurrency bugs. Among them, there are nine new bugs, including three critical ones, which have been fixed by their original developers. o Advantages ? The proposed approach focuses on detecting the bugs caused by order violation, i.e., the bugs which manifest themselves whenever a message arrives at a wrong order with respect to another event. The majority of these bugs can be exposed by reordering a pair of messages, as suggested previously. ? However, relatively few but critical bugs still occur when more than two messages are involved. These bugs can only be exposed under special timing conditions, involving, for example, some specific messages or events (e.g., node crashes or reboots). To detect such errors, we have empowered our approach with the capability of reordering an arbitrary number of messages for an application.