Introduction to distributed file systems, MapReduce and basic data processing using Spark

Instructors:

Christopher Paciorek

Chris Paciorek is an adjunct professor in the Department of Statistics, as well as the Statistical Computing Consultant in the Department's Statistical Computing Facility and a user support consultant for Berkeley Research Computing. He teaches and presents workshops on statistical computing topics, with a focus on R.

Read more about Christopher Paciorek

When & Where

Date:

Fri, November 7, 2014 - 4:00 PM to 5:15 PM

Location:

Evans 1011

Description

Type:

Workshop

The Statistics Department is offering a two-session workshop on distributed computing using Spark. Spark is the Berkeley AmpLab's variant on Hadoop that allows for MapReduce calculations to be done in computer memory when possible, speeding computation.

This first session provides an introduction to distributed file systems, Map Reduce, basic data processing using Spark

The instructor be setting up an Amazon account with free credits that participants can use to start up their own virtual Linux cluster to try Spark on.

If you want to get an account, please fill out this form: https://docs.google.com/a/berkeley.edu/forms/d/1HP8LUXtLHqedkrgmqMeQ7RSt...

Materials will be available at https://github.com/berkeley-scf/spark-workshop-2014 (under construction). No prior knowledge is assumed. Some familiarity with Python will be helpful as we'll run Spark via Python, but I think you'll get something out of it even if you're not familiar with Python syntax.

Keyword:

Python, Quantitative Analysis

Details

Training Host:

D-Lab

D-lab Facilitator:

Jon Stiles

Participant Technology Requirement:

laptops

Intelligent research design for data intensive social science

Introduction to distributed file systems, MapReduce and basic data processing using Spark

Services

Instructors:

Christopher Paciorek

Connect with us