The focus of this workshop is machine learning using the H2O R and Python packages. H2O is an open source distributed machine learning platform designed for big data, with the added benefit that it's easy to use on a laptop (in addition to a multi-node Hadoop or Spark cluster).
The core machine learning algorithms of H2O are implemented in high-performance Java; however, fully featured APIs are available in R, Python, Scala, REST/JSON and also through a web interface. Since H2O's algorithm implementations are distributed, this allows the software to scale to very large datasets that may not fit into RAM on a single machine.
H2O currently features distributed implementations of generalized linear models, gradient boosting machines, random forest, deep neural nets, dimensionality reduction methods (PCA, GLRM), clustering algorithms (K-means), and anomaly detection methods, among others. The ability to create stacked ensembles, or "super learners," from a collection of supervised base learners is provided via the h2oEnsemble R package.
R and Python Jupyter notebooks with H2O machine learning code examples will be demoed live and made available on GitHub for attendees to follow along on their laptops.
Prior knowledge:
Familiarity with R or Python is recommended. Some basic familiarity with topics in machine learning is also recommended. Examples topics are: classification, regression, training set, test set, cross-validation, etc. Some of the basic concepts are explained in this tutorial.
Technology requirements:
- Any operating system: Linux, OS X or Windows
- Java 7 or 8
- Either R or Python is recommended, although H2O can also be used from the browser using the GUI as well, so R/Python is not required.
- To install R and/or Python:
Interested in learning more? Check out our related upcoming H2O workshop: How to Use the H2O Web GUI