ml4xcube API#
1. plotting#
The Plotting
module provides functionality to visualize slices of data from xarray.DataArray
.
It facilitates the analysis of multidimensional data. The primary function in this module is plot_slice
,
which allows users to create visualizations with optional masks to highlight specific regions of interest.
Functions:
- plot_slice - Renders a 2D slice of an
xarray.DataArray
with optional emphasis on specific features via masking.
2. insights#
The insights
module offers tools for extracting and analyzing characteristics of multidimensional data cubes from
xarray.DataArray
objects. This module includes functions to assess the completeness and distribution of data within
the cube, helping users understand the dataset's quality and spatial-temporal coverage.
The detailed workflow in order to analyze the specifics of a data cube is demonstrated in the following
Jupyter Notebook.
Functions:
- get_insights - Extracts and prints detailed characteristics of a data cube, including dimensions, value ranges, and gaps in the data.
- get_gap_heat_map - Generates a heat map to visualize the distribution of non-
NaN
values across selected dimensions, revealing patterns of data availability or missingness.
3. gapfilling#
The ml4xcube.gapfilling
module is designed to address and rectify data gaps in multidimensional geospatial datasets,
particularly those represented in xarray.Dataset
formats. The gap filling process is divided into three main submodules,
each playing a crucial role in the preparation, processing, and application of sophisticated machine learning algorithms
to ensure accurate and efficient data imputation. he entire gapfilling process is showcased in the following
Jupyter Notebook.
3.1. helper.predictors#
The HelpingPredictor
class within the helper.predictors
submodule facilitates the preparation of predictor data for
gap filling applications. This submodule focuses on extracting and processing global predictor data, such as land
cover classifications, matching them to the corresponding dimensions (e.g., latitude and longitude) of the target data cube.
The prepared predictor is then stored in a .zarr
dataset, ready to be used across various gap filling applications.
If not leveraged during the gap filling process, a Support Vector Machine is trained on artificial gaps within the Gapfiller
class.
Classes:
- HelpingPredictor - Facilitates the preparation of predictor data for gap filling.
3.2. gap_dataset#
The GapDataset
class in the gap_dataset
submodule is designed to prepare data before performing the actual gap
filling applications. This submodule focuses on slicing specific dimensions from a data cube, applying optional
artificial gaps, and managing datasets for subsequent gap filling operations.
Classes:
- GapDataset - Prepares data with artificial gaps optionally for training of a regressor and datasets with real gaps before gap filling can be performed.
3.3. gap_filling#
The Gapfiller
class within the gap_filling
submodule is an integral part of the ml4xcube.gapfilling
module
designed to implement and manage the gap filling process using machine learning techniques, specifically focusing
on Support Vector Regression (SVR) for now. It allows for the integration of different hyperparameters, and predictors
to optimize the gap filling process. A prerequsite before gap filling can be applied, is a specific data preparation
step, taken over by the functionalities of the GapDataset
class.
Classes:
- Gapfiller - Optionally trains a predictor to estimate actual values in gaps. Performs gap filling with SVR or a user-provided regressor.
4. Splits#
The splits
module includes functions designed to divide an xarray.Datasets
into a train and a test set. These
functions use sampling strategies to ensure that data is split in a manner that respects the integrity of spatial
and temporal data blocks, facilitating the development of machine learning models. Functions to assign split variables
provide structured and random approaches to segmenting the dataset, which are then utilized by the create_split
function to generate actual train-test splits.
Functions:
- assign_block_split - Determines the assignment of data blocks to train or test sets using a deterministic approach based on the Cantor pairing function. This structured random sampling respects data locality and sets up the
splits
variable used bycreate_split
. - assign_rand_split - Randomly assigns a split indicator to each element in the dataset based on a specified proportion. This method provides a randomized approach to setting up the
splits
variable, which is also used bycreate_split
. - create_split - Generates train-test splits for machine learning models by utilizing a predefined
splits
variable within the dataset. Supportsxarray.Dataset
or a dictionary ofnumpy
arrays and provides flexibility in specifying feature and target variables, effectively leveraging the split defined by the previous functions.
5. preprocessing#
The preprocessing
module provides a collection of functions for preparing and processing data from xarray.Datasets
,
particularly focusing on operations commonly required in data science and machine learning workflows.
These functions include filtering, filling missing data, calculating statistics, and normalizing or standardizing data.
Functions:
- apply_filter - Applies a specified filter to the data by setting all values to NaN which do not belong to the mask or dropping the entire sample.
- assign_mask - Assigns a mask to the dataset for later data division or filtering.
- drop_nan_values - Filters out samples from a dataset if they contain any
NaN
values, with an optional mask to determine sample validity. It handles both 1D and multi-dimensional samples. - fill_nan_values - Fills
NaN
values in the dataset using a specified method. - get_range - Computes the range (min and max) of the data.
- get_statistics - Computes the mean and standard deviation of a specified variable.
- normalize - Normalizes the data to the range [0,1].
- standardize - Standardizes the data to have a mean of 0 and variance of 1.
6. datasets#
The datasets module is a comprehensive suite designed to handle, process, and prepare data cubes for machine learning
applications. This module supports various data scales and integrates seamlessly with major deep learning frameworks like PyTorch and TensorFlow, ensuring that data stored in xarray
datasets is optimally formatted and ready for training deep learning models.
6.1. multiproc_sampler#
The MultiProcSampler
class is designed to process and sample large multidimensional training and testing datasets
efficiently using parallel processing, specifically tailored for machine learning model training in the ml4xcube
framework.
Classes:
- MultiProcSampler - Samples train and test data as
.zarr
datasets.
6.2. pytorch#
The datasets.pytorch
module integrates with PyTorch
to manage and process large datasets efficiently.
This module utilizes the power of PyTorch
's Dataset
and DataLoader
functionalities to prepare and iterate over
chunks of data cubes for deep learning applications, ensuring that data management is scalable and performance-optimized.
Classes:
- PTXrDataset - Corresponds to a subclass of PyTorch’s Dataset, designed specifically to handle large datasets based on a provided
xarray.Dataset
.
Functions:
- prep_dataloader - Sets up one or two
DataLoader
s from a PyTorchDataset
which was sampled from anxarray.Dataset
. If a test set is provided, twoDataLoader
s are returned; otherwise, one.
6.3. tensorflow#
The datasets.tensorflow module
is specifically designed to handle and iterate over large xarray
datasets and
efficiently prepare them for use with TensorFlow models. This module provides a seamless interface to transform
data stored in xarray
datasets into structured TensorFlow datasets that can be directly utilized in training
and inference pipelines. The core functionality is encapsulated in the TFXrDataset
class, which leverages
TensorFlow's capabilities to manage data flow dynamically, supporting scalable machine learning operations on large datasets.
Classes:
- TFXrDataset - TensorFlow specific implementation to handle and iterate over large
xarray
datasets.
6.4. xr_dataset#
The XrDataset
class within the datasets/xr-dataset
module is tailored to efficiently manage and process smaller
datasets directly within memory, leveraging in-memory operations to enhance both speed and performance.
Classes:
- XrDataset - Creates small datasets manageable in memory.
7. training#
The training module serves as a comprehensive suite for training machine learning models across various frameworks, designed to accommodate the unique demands of large-scale and high-dimensional datasets typically encountered in geospatial analysis and beyond. This module streamlines the training process, offering specialized support for PyTorch, TensorFlow, and scikit-learn, enabling users to leverage the strengths of these popular frameworks efficiently.
7.1. pytorch#
The training.pytorch
module provides tools for training PyTorch models. It includes functionalities such as
early stopping, model checkpointing, and performance logging, ensuring efficient training and optimization of models.
Classes:
- Trainer - Tailored for the training of PyTorch models.
7.2. pytorch_distributed#
The training.pytorch_distributed module is designed to facilitate efficient distributed training of PyTorch models across multiple GPUs or nodes. This module leverages PyTorch's DistributedDataParallel (DDP) functionality, providing tools to handle complex distributed training tasks with ease, including setup, execution, and synchronization across multiple processes.
Classes:
- Trainer - Crafted to perform distributed training for PyTorch models.
Functions:
- ddp_init - Initializes the distributed process group for GPU-based distributed training.
7.3. sklearn#
The training.sklearn
module is tailored to train scikit-learn models efficiently. It supports batch training for
handling large datasets and provides tools for evaluating model performance using various metrics, catering to both
supervised and unsupervised learning tasks.
Classes:
- Trainer - Designed for training scikit-learn models.
7.4. tensorflow#
The training.tensorflow
module is specifically designed for training TensorFlow models.
This module provides a comprehensive suite of tools for training, evaluating, and monitoring TensorFlow models,
particularly those used in processing large datasets typically encountered in fields such as geospatial analysis.
Classes:
- Trainer - Created to facilitate the training of TensorFlow models.
8. postprocessing#
The preprocessing
module provides functionalities, which are commonly required after machine learning operations
to receive the final predictions.
Functions:
- undo_normalizing - Reverts the normalization process to obtain the original data range.
- undo_standardizing - Reverts the standardization process to obtain the original data scale.
9. evaluation#
The evaluation module in the ml4xcube
API is designed to support comprehensive metric evaluation for machine learning
models across various frameworks including PyTorch, TensorFlow, and Scikit-learn. This module supports with
assessing model performance during validation or testing phases, providing a range of metrics to evaluate accuracy,
error rates, and other critical performance indicators. Providing a unified access to metrics from the different frameworks
9.1. evaluator#
The Evaluator
class is tailored to handle metric evaluations, allowing users to measure and analyze model performance
using metrics suited to their specific framework.
Classes:
- Evaluator - Facilitates metric evaluation across different machine learning frameworks, enabling the assessment of various performance metrics during model validation or testing.
10. utils#
The utils
module provides a set of utility functions for handling and processing xarray.Datasets
. These functions
facilitate tasks such as rechunking datasets, retrieving specific data chunks, and iterating over data blocks. They are
particularly helpful for optimizing the performance of data operations and preparing datasets for machine learning tasks.
Functions:
- assign_dims - Assign dimensions to each
dask.array
orxarray.DataArray
within a dictionary. - calculate_total_chunks - Compute the number of chunks of an
xarray.Dataset
. - get_chunk_by_index - Retrieve a specific data chunk from an
xarray.Dataset
. - get_chunk_sizes - Determine maximum chunk sizes of all data variables of the
xarray.Dataset
. - get_dim_range - Calculates the dimension range of an
xarray.DataArray
dimension. - iter_data_var_blocks - Create an iterator over chunks of an
xarray.Dataset
. - rechunk_cube - Rechunks an
xarray.DataArray
to a new chunking scheme and stores the result at a specified path. - split_chunk - Split a chunk into data samples for subsequent machine learning training.