Fan Zhang

Title Implementation of Real Time Processing Pipelines for Big Data Analytic Applications
Abstract

"Big data" and "data deluge" have been emerging as major challenges for scientific computing. Healthcare scientific applications, such as body area network, require of deploying hundreds of interconnected sensors to monitor the health status of a host. As another example, the Laser Interferometer Gravitational-wave Observatory (LIGO) sites daily collect more than Terabyte data from thousands of distributed sensors for real time processing. Follow-up data analysis would normally involve moving the collected big data to a cloud data center. Therefore, an efficient cloud platform with very elastic scaling capacity is needed to support such kind of real time streaming data applications. In this talk, I will present a series of on-going big-data projects I have been involved in. As a start, I will talk about an analysis pipeline for close to real-time identification of transient, non-Gaussian noise artifacts – glitches, in our Gravitational-Wave detection project. In particular, I will show how multivariate classifiers, e.g. Artificial Neural Network, Random Forest and Support Vector Machine, are used to identify the glitches. After that, I will introduce my experience of leveraging high throughput computing tools, such as Condor and Hadoop MapReduce to harness the computing capability up to hundreds of cloud instances. Specifically, a task-level adaptive MapReduce simulator will be introduced to process streaming big data. Finally, I will report how the workflow pipelines and output data are interactively presented and visualized in the projects.

Bio

Dr. Fan Zhang is currently a Senior Software Engineer with the IBM Massachusetts lab. He was a postdoctoral associate with the Kavli Institute for Astrophysics and Space Research at Massachusetts Institute of Technology. He received his Ph.D. in Department of Control Science and Engineering, Tsinghua University in Jan. 2012. From 2011 to 2013 he was a research scientist at Cloud Computing Laboratory, Carnegie Mellon University. An IEEE Senior Member, he received an Honorarium Research Funding Award from the University of Chicago and Argonne National Laboratory (2013), a Meritorious Service Award (2013) from IEEE Transactions on Service Computing, two IBM Ph.D. Fellowship Awards (2010 and 2011). His research interests include big-data scientific computing applications, simulation-based optimization approaches, cloud computing, and novel programming models for streaming data applications on elastic cloud platforms.