6 hours of instruction
Introduces Dask for scaling data analysis in Python. The workshop begins with an overview of the fundamentals of parallel computing in Python with explorations of technical limitations of NumPy & Pandas. After exploring core Dask data structures, participants will apply Dask arrays & dataframes in practice, using dashboard tools to monitor Dask workflows and measure performance.
PREREQUISITES
Participants should have prior experience using the Python language and, in particular, using standard Python tools for data analysis (notably NumPy, Pandas, Scikit-Learn, Jupyter). No prior exposure to Dask or to parallel computing is required.
LEARNING OBJECTIVES
- Explain relevant parallel computing concepts in the context of data analysis pipelines.
- Identify where in a data-processing pipeline parallelism is attainable or difficult.
- Identify opportunities for parallel computation in existing Python data workflows.
- Develop scalable Dask data pipelines to extend examples using Pandas/NumPy.
- Select Dask data structures appropriate to a given compute-intensive scenario.
- Construct scalable data analysis pipelines in Python using Dask from scratch.
- Apply Dask dashboard tools to monitor performance of data analytics.
- Use Dask diagnostic tools to assess and tune performance in applications.
- Apply distinct schedulers appropriate to relevant hardware available.
- Plan out & execute embarrassingly parallel Dask workflows on remote data.
Login
Accessing this course requires a login. Please enter your credentials below!