Leveraging “Big” Data Analytics for Network Performance Monitoring & Trouble-Shooting


Zhi-Li Zhang, the University of Minnesota


2019-01-07 14:00:00 ~ 2019-01-07 15:30:00


Room 1-402, SEIEE Building


Bo Jiang

As we become increasingly reliant on a variety of large-scale Internet services for our daily activities, providing as good a quality-of-experience (QoE) as possible to users become imperative For example, even a small increase in response time hurts user experience and impacts the monetization ability of service providers It is thus extremely important for service providers to understand the key factors that impact performance and to quickly detect and diagnose any performance degradation However, this is an extremely challenging task, as cloud computing and large-scale Internet services such as search engine and online video streaming services have necessitated a complex architecture of centralized data centers and distributed edge servers dispersed across a web of interconnected access and backbone networks to provide speedy response times to users There are a gamut of diverse, interacting factors that can influence and affect users QoE, spanning servers in data centers, CDN edge servers, various networks on the paths, client machines and user agents (e g , web browser) and user behavior.

Clearly, it is important to collect various sources of data from system configurations, performance metrics and dynamic network usage and leverage "big " data analytics to tackle this challenge In this talk, we argue that it is important to develop a systematic framework to guide this process in order to cope with the vast complexity of network performance monitoring and trouble-shooting, where domain knowledge plays a crucial role In particular, we will describe a framework based on statistical inference and machine learning techniques to first identity and quantify the major categories of factors that have major influences on system performance We will use two real-world case studies to illustrate the utility of this framework: i) understanding the complexity of 3G UMTS cellular network performance; and ii) dissecting the search response time variations of a large web search engine as well as a comparative study of the impact of architectural design choices on the performance of two large web search engine services.

In the first case study, using large amount of datasets collected from various sources over time across a large US 3G UMTS network carrier, we briefly discuss how to identify the key factors that influence the network performance in terms of the round-trip times and loss rates(averaged over an hourly time scale) We apply RuleFit – a powerful supervised machine learning tool that combines linear regression and decision trees – to develop models and analyze the relative importance of various factors in estimating and predicting the network performance Our analysis culminates with the detection and diagnosis of both “transient” and “persistent” performance anomalies, with discussion on the complex interactions and differing effects of the various factors that may influence the 3G UMTS network performance.

In the second case study, we provide an analysis of web search response times from a service provider s perspective Using measurement and instrumentation data from Microsoft Bing, we show that search response time (SRT) varies widely over time and also exhibits counter-intuitive behavior It is actually higher during off-peak hours, when the query load is lower, than during peak hours To resolve this paradox and explain SRT variations in general, we develop an analytics framework that separates systemic variations due to periodic changes in service usage and anomalous variations due to unanticipated events such as failures and denial-of-service attacks We find that systemic SRT variations are primarily caused by systemic changes in aggregate network characteristics, nature of user queries, and browser types For instance, one reason for higher SRTs during off-peak hours is that during those hours a greater fraction of queries come from slower, mainly-residential networks We also develop a technique that, by factoring out the impact of such variations, robustly detects and diagnoses performance anomalies in SRT Deployment experience shows that our technique detects three times more true (operator-verified) anomalies than existing techniques.

If time permits, we will also examine the architectural factors in the performance of cloud-based online services by investigating the roles of front-end (edge) servers in improving user-perceived performance Using Bing and Google search services as two examples, we perform extensive network measurement and analysis to understand several key factors that affect the overall user-perceived performance In particular, we develop a simple model-based inference framework to indirectly measure and quantify the (directly unobservable) “frontend-to-backend fetching time” comprised of the query processing time at back-end data centers and the delivery time between the back-end data centers and front-end servers We show that this fetching time plays a critical role in the end-to-end performance of dynamic content delivery.
Zhi-Li Zhang received Ph.D. degrees in computer science from the University of Massachusetts. He joined the faculty of the Department of Computer Science and Engineering at the University of Minnesota in 1997, where he is currently the McKnight Distinguished University Professor and Qwest Chair Professor in Telecommunications. He currently also serves as the Associate Director for Research at the Digital Technology Center, University of Minnesota.

Prof. Zhang's research interests lie broadly in computer and communication networks, Internet technology, multimedia systems and content distribution networks, cyber-physical systems and Internet-of-Things, and (applied) machine learning and data mining. Prof. Zhang has published more than 100 journal and conference/workshop papers, many of them in top venues in networking and related fields. He is co-recipient of several Best Papers awards including IEEE INFOCOM, ICNP and ACM SIGMETRI.CS Prof. Zhang has chaired the program committees of several major conferences in networking including IEEE INFOCOM, ACM SIGMETRICS, IEEE ICNP and ACM Internet Measurement Conference (IMC), and served on the Editorial Board of several journals such as IEEE/ACM Transactions on Networking, ACM TOMPECS, and PACM MACS. He is a Fellow of IEEE.
© John Hopcroft Center for Computer Science, Shanghai Jiao Tong University