Data Utilities¶
Dataset Manipulation¶
fusion_bench.utils.data
¶
InfiniteDataLoader
¶
A wrapper class for DataLoader to create an infinite data loader. This is useful in case we are only interested in the number of steps and not the number of epochs.
This class wraps a DataLoader and provides an iterator that resets when the end of the dataset is reached, creating an infinite loop.
Attributes:
-
data_loader(DataLoader) –The DataLoader to wrap.
-
_data_iter(iterator) –An iterator over the DataLoader.
-
_iteration_count(int) –Number of complete iterations through the dataset.
Example
train_loader = DataLoader(dataset, batch_size=32) infinite_loader = InfiniteDataLoader(train_loader) for i, batch in enumerate(infinite_loader): ... if i >= 1000: # Train for 1000 steps ... break ... train_step(batch)
Source code in fusion_bench/utils/data.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | |
iteration_count
property
¶
Get the number of complete iterations through the dataset.
__init__(data_loader, max_retries=1)
¶
Initialize the InfiniteDataLoader.
Parameters:
-
data_loader(DataLoader) –The DataLoader to wrap.
-
max_retries(int, default:1) –Maximum number of retry attempts when resetting the data loader (default: 1).
Raises:
-
ValidationError–If data_loader is None or not a DataLoader instance.
Source code in fusion_bench/utils/data.py
__iter__()
¶
__len__()
¶
Return the length of the underlying data loader.
Returns:
-
int–The number of batches in one complete iteration.
__next__()
¶
Get the next batch, resetting to the beginning when the dataset is exhausted.
Returns:
-
–
The next batch from the data loader.
Raises:
-
RuntimeError–If the data loader consistently fails to produce data.
Source code in fusion_bench/utils/data.py
load_tensor_from_file(file_path, device=None)
¶
Loads a tensor from a file, which can be either a .pt, .pth or .np file. If the file is not one of these formats, it will try to load it as a pickle file.
Parameters:
-
file_path(str) –The path to the file to load.
-
device(Optional[Union[str, device]], default:None) –The device to move the tensor to. By default the tensor is loaded on the CPU.
Returns:
-
Tensor–torch.Tensor: The tensor loaded from the file.
Raises:
-
ValidationError–If the file doesn't exist
-
ValueError–If the file format is unsupported
Source code in fusion_bench/utils/data.py
train_validation_split(dataset, validation_fraction=0.1, validation_size=None, random_seed=None, return_split='both')
¶
Split a dataset into a training and validation set.
Parameters:
-
dataset(Dataset) –The dataset to split.
-
validation_fraction(Optional[float], default:0.1) –The fraction of the dataset to use for validation.
-
validation_size(Optional[int], default:None) –The number of samples to use for validation.
validation_fractionmust be set toNoneif this is provided. -
random_seed(Optional[int], default:None) –The random seed to use for reproducibility.
-
return_split(Literal['all', 'train', 'val'], default:'both') –The split to return.
Returns:
-
Union[Tuple[Dataset, Dataset], Dataset]–Tuple[Dataset, Dataset]: The training and validation datasets.
Source code in fusion_bench/utils/data.py
train_validation_test_split(dataset, validation_fraction, test_fraction, random_seed=None, return_spilt='all')
¶
Split a dataset into a training, validation and test set.
Parameters:
-
dataset(Dataset) –The dataset to split.
-
validation_fraction(float) –The fraction of the dataset to use for validation.
-
test_fraction(float) –The fraction of the dataset to use for test.
-
random_seed(Optional[int], default:None) –The random seed to use for reproducibility.
-
return_spilt(Literal['all', 'train', 'val', 'test'], default:'all') –The split to return.
Returns:
-
Union[Tuple[Dataset, Dataset, Dataset], Dataset]–Tuple[Dataset, Dataset, Dataset]: The training, validation and test datasets.
Source code in fusion_bench/utils/data.py
Json Import/Export¶
fusion_bench.utils.json
¶
load_from_json(path, filesystem=None)
¶
load an object from a json file
Parameters:
-
path(Union[str, Path]) –the path to load the object
-
filesystem(FileSystem, default:None) –PyArrow FileSystem to use for reading. If None, uses local filesystem via standard Python open(). Can also be an s3fs.S3FileSystem or fsspec filesystem.
Returns:
-
Union[dict, list]–Union[dict, list]: the loaded object
Raises:
-
ValidationError–If the file doesn't exist (when using local filesystem)
Source code in fusion_bench/utils/json.py
print_json(j, indent=' ', verbose=False, print_type=True)
¶
print an overview of json file
Examples:
Parameters:
-
j(dict) –loaded json file
-
indent(str, default:' ') –Defaults to ' '.
Source code in fusion_bench/utils/json.py
save_to_json(obj, path, filesystem=None)
¶
save an object to a json file
Parameters:
-
obj(Any) –the object to save
-
path(Union[str, Path]) –the path to save the object
-
filesystem(FileSystem, default:None) –PyArrow FileSystem to use for writing. If None, uses local filesystem via standard Python open(). Can also be an s3fs.S3FileSystem or fsspec filesystem.
Source code in fusion_bench/utils/json.py
TensorBoard Data Import¶
fusion_bench.utils.tensorboard
¶
functions deal with tensorboard logs.
parse_tensorboard_as_dict(path, scalars)
¶
returns a dictionary of pandas dataframes for each requested scalar.
Parameters:
-
path(str) –A file path to a directory containing tf events files, or a single tf events file. The accumulator will load events from this path.
-
scalars(Iterable[str]) –scalars
Returns:
-
–
Dict[str, pandas.DataFrame]: a dictionary of pandas dataframes for each requested scalar
Source code in fusion_bench/utils/tensorboard.py
parse_tensorboard_as_list(path, scalars)
¶
returns a list of pandas dataframes for each requested scalar.
see also: func:
parse_tensorboard_as_dict
Parameters:
-
path(str) –A file path to a directory containing tf events files, or a single tf events file. The accumulator will load events from this path.
-
scalars(Iterable[str]) –scalars
Returns:
-
–
List[pandas.DataFrame]: a list of pandas dataframes for each requested scalar.