ebes.data package

Submodules

ebes.data.accessors module

The module contains classes that expose a pd.DataFrame interface to datasets.

class ebes.data.accessors.InMemoryPandasDataAccessor(*, parquet_path, split_sizes, data_queries=None, split_by_col=None, random_split=False, split_seed=None)

Bases: PandasDataAccessor

Data accessor that keeps all data in memory.

get_split(split_idx)

Get split by its index.

Parameters:: split_idx – positive index of split.
Return type:: DataFrame
Returns:: A dataframe or a sequence of dataframes with data in given split.

class ebes.data.accessors.PandasDataAccessor

Bases: ABC

Abstract class for all data accessors.

Data accessor is responsible for splitting the data on train/test/whatever, filtering the data and exposing pd.DataFrame interface to it. The splits are configred in subclass __init__ methods and accessed by their index. Each subclass splits the data (using any specified strategy) and returns a split by its positive index as a result of the get_split method.

abstract get_split(split_idx)

Get split by its index.

Parameters:: split_idx (int) – positive index of split.
Return type:: DataFrame | Sequence[DataFrame]
Returns:: A dataframe or a sequence of dataframes with data in given split.

ebes.data.batch_tfs module

Batch transforms for data loading pipelines.

class ebes.data.batch_tfs.BatchTransform

Bases: ABC

Base class for all batch transforms.

The BatchTransform is a Callable object that modifies Batch in-place.

class ebes.data.batch_tfs.CatToNum

Bases: BatchTransform

Process categorical features as numerical.

Treat categorical features as numerical (just type cast). Category 0 is converted to NaN value.

class ebes.data.batch_tfs.ContrastiveTarget

Bases: BatchTransform

Set target for contrastive losses.

New target is LongTensor such that items with different indices have different target labels.

class ebes.data.batch_tfs.DatetimeToFloat(loc, scale)

Bases: BatchTransform

Cast time from np.datetime64 to float by rescale. scale:

loc: str | datetime64: Location to subtract. If string is passed, it is converted to np.datetime64 beforehand.

scale: tuple[int, str] | timedelta64: Scale to divide time by. If tuple is passed, it is passed to the np.timedelta64 function. The first item is a value and the second is a unit.

class ebes.data.batch_tfs.FillNans(fill_value)

Bases: BatchTransform

Fill NaNs with specified values.

fill_value: Mapping[str, float] | float: If float, all NaNs in all numerical features will be replaced with the fill_value. Mapping sets feature-specific replacement values.

class ebes.data.batch_tfs.ForwardFillNans(backward=False)

Bases: BatchTransform

Fill NaN values by propagating forwad lase non-nan values.

The algoritm starts from the second step. If some values are NaNs, the values from the prevoius step are used to fill them. If the first time step contains NaNs, some NaNs will not be filled after the forward pass. To handle it backward=True might be specified to fill remaining NaN values from last to first after the forwad pass. But even after a backward pass the batch may contain NaNs, if some feature has all NaN values. To fill it use FillNans transform.

backward: bool = False: Wether to do backward fill after the forwad fill (see the class description).

class ebes.data.batch_tfs.Logarithm(names)

Bases: BatchTransform

Apply natural logarithm to specific feature.

names: list[str]: Feature names to transform by taking the logarithm.

class ebes.data.batch_tfs.MaskValid

Bases: BatchTransform

Add mask indicating valid values to batch.

Mask has shape (max_seq_len, batch_size, n_features) and has True values where there are non-NaN values (nonzero category) and where the data is not padded.

class ebes.data.batch_tfs.PrimeNetSampler(mask_ratio_per_seg=0.05, segment_num=3, pretrain_tasks='full2')

Bases: BatchTransform

Contrastive sampling according to PrimeNet.

Input:: batch: Batch. Masks required.

batch.num_features (T, B, D) -> (T, 2B, D) batch.cat_features (T, B, D) -> (T, 2B, D)

Masks have additional dim for constrastive and interpolation: batch.num_mask (T, B, D) - > (T, 2B, D, 2) batch.cat_mask (T, B, D) - > (T, 2B, D, 2)

dense_sampling_bound = [0.4, 0.6]

len_sampling_bound = [0.3, 0.7]

mask_ratio_per_seg: float = 0.05

pretrain_tasks: str = 'full2'

segment_num: int = 3

class ebes.data.batch_tfs.RandomEventsPermutation(keep_last=False)

Bases: BatchTransform

Permute events in sequence randomly.

Time, target and masks are left unchanged.

keep_last: bool = False: If True the last event remains on its place, other are permuted.

class ebes.data.batch_tfs.RandomSlices(split_count, cnt_min, cnt_max, short_seq_crop_rate=1.0, seed=None)

Bases: BatchTransform

Sample random slices from input sequences.

The transform is taken from https://github.com/dllllb/coles-paper. It samples random slices from initial sequences. The batch size after this transform will be split_count times larger.

cnt_max: int: Maximal sample sequence length.

cnt_min: int: Minimal sample sequence length.

seed: int | None = None: Value to seed the random generator.

short_seq_crop_rate: float = 1.0: Must be from (0, 1]. If short_seq_crop_rate < 1, and if a sequence of length less than cnt_min is encountered, the mininum sample length for this sequence is set as a short_seq_crop_rate time the actual sequence length.

split_count: int: How many sample slices to draw for each input sequence.

class ebes.data.batch_tfs.RandomTime

Bases: BatchTransform

Replace time with uniformly disributed values.

class ebes.data.batch_tfs.Rescale(name, loc, scale)

Bases: BatchTransform

Rescale feature: subtract location and divide by scale.

loc: Any: Value to subtract from the feature values.

name: str: Feature name.

scale: Any: Value to divide by the feature values.

class ebes.data.batch_tfs.RescaleTime(loc, scale)

Bases: BatchTransform

Rescale time: subtract location and divide by scale.

loc: float: Location to subtract from time.

scale: float: Scale to divide time by.

class ebes.data.batch_tfs.TargetToLong

Bases: BatchTransform

Cast target to LongTensor

class ebes.data.batch_tfs.TimeToFeatures(process_type='none', time_name='time')

Bases: BatchTransform

Add time to numerical features.

To apply this transform first cast time to Tensor. Has to be applied BEFORE mask creation. And AFTER DatetoTime

process_type: Literal['cat', 'diff', 'none'] = 'none'

How to add time to features. The options are:

"cat" — add absolute time to other numerical features,
"diff" — add time intervals between sequential events. In this case the first interval in a sequence equals zero.
"none" — do not add time to features. This option is added for the ease of optuna usage.

time_name: str = 'time': Name of new feature with time, default "time".

class ebes.data.batch_tfs.UnsqueezeTarget

Bases: BatchTransform

Unsqueeze last dimension in target array.

Last linear layer for regression task produces tensors of shape (bs, 1). When calling MSE loss with target of shape (bs,), PyTorch expands it to the shape (bs, bs) and loss is computed incorrectly. This batch transform reshapes the target to (bs, 1), so MSE loss is computed correctly.

ebes.data.datasets module

class ebes.data.datasets.SeriesDataset(data, batch_size, query=None, drop_incomplete=False, shuffle=False, loop=False, random_seed=None)

Bases: IterableDataset

An iterable dataset over the DataFrame rows.

class ebes.data.datasets.SizedSeriesDataset(data, batch_size, query=None, drop_incomplete=False, shuffle=False, loop=False, random_seed=None)

Bases: SeriesDataset

The same as SeriesDataset, but has __len__ method implemented.

ebes.data.datasets.series(df)

Return list of DataFrame rows as a series.

Return type:: list[Series]

ebes.data.loading module

class ebes.data.loading.SequenceCollator(*, time_name, cat_cardinalities=None, num_names=None, index_name=None, target_name=None, max_seq_len=0, batch_transforms=None, padding_type='zeros')

Bases: object

batch_transforms: list[Callable[[Batch], None]] | None = None

cat_cardinalities: Mapping[str, int] | None = None

index_name: str | None = None

max_seq_len: int = 0

num_names: list[str] | None = None

padding_type: str = 'zeros'

target_name: str | list[str] | None = None

time_name: str

ebes.data.utils module

ebes.data.utils.build_loaders(dataset, loaders, preprocessing)

Return type:: Mapping[str, DataLoader]

ebes.data.utils.get_accessor(parquet_path, split_sizes, split_by_col=None, random_split=False, split_seed=None)

ebes.data.utils.get_collator(time_name, cat_cardinalities=None, num_names=None, index_name=None, target_name=None, max_seq_len=0, batch_transforms=None, padding_type='zeros')

Return type:: SequenceCollator

ebes.data.utils.get_loader(accessor, collators, split_idx, preprocessing, batch_size, query=None, drop_incomplete=False, shuffle=False, loop=False, random_seed=None, num_workers=0, labeled=True)

Return type:: DataLoader

ebes.data package

Submodules

ebes.data.accessors module

ebes.data.batch_tfs module

ebes.data.datasets module

ebes.data.loading module

ebes.data.utils module

Module contents