pdsw-DISCS 2019:

4th International Parallel Data Systems Workshop

Held in conjunction with SC19

Monday, November 18, 2019
Denver, CO

Program Co-Chairs:

Lawrence Berkeley National Laboratory

Argonne National Laboratory
General Chair:

New York University,
Courant Institute of Mathematical Sciences Center for Data Science

Haoyuan Li, Alluxio

Alluxio - Data Orchestration for Analytics and AI in the Cloud


abstract: The data eco-system has heavily evolved over the past two decades. There is an explosion of data-driven frameworks including Presto, Hive, Spark, and MapReduce to run data analytics and ETL queries, as well as TensorFlow, PyTorch to train and serve models. On the data side, the approach to manage and store data has evolved from HDFS to cheaper, more scalable and separated services typified by cloud object stores like AWS S3. Data engineering has become increasingly complex, inefficient, and hard, particularly in the hybrid and cloud environments.

Alluxio Open Source Software is to address these challenges. Alluxio, born from UC Berkeley AMPLab, is a data orchestration system that provides a unified data access and caching layer for single cloud, hybrid and multi-cloud deployments. Alluxio enables distributed compute engines like Presto, Hive, or TensorFlow to transparently access data from various storage systems (including S3, HDFS, Azure etc.) while actively leveraging in-memory cache to accelerate data access. Alluxio community has 1000+ open source contributors and the software is used by 100+ companies worldwide with the large production deployment over 1000 nodes.

In this talk, we will present

  • New trends and challenges in the data ecosystem in cloud era
  • Key innovation of Alluxio Project
  • Production use cases of using popular stacks like {Presto, Spark, Flink, Tensorflow}/Alluxio/{S3, HDFS}


bio: Haoyuan (H.Y.) Li is the Founder, Chairman, and CTO of Alluxio. He holds a PhD in computer science from UC Berkeley’s AMPLab, where he co-created the Alluxio (formerly Tachyon) open source data orchestration system, co-created Apache Spark Streaming, and became an Apache Spark founding committer. He also holds an MS from Cornell University and a BS from Peking University, both in computer science.