ebes.data package

Submodules

ebes.data.accessors module

The module contains classes that expose a pd.DataFrame interface to datasets.

class ebes.data.accessors.InMemoryPandasDataAccessor(*, parquet_path, split_sizes, data_queries=None, split_by_col=None, random_split=False, split_seed=None)

Bases: PandasDataAccessor

Data accessor that keeps all data in memory.

get_split(split_idx)

Get split by its index.

Parameters:

split_idx – positive index of split.

Return type:

DataFrame

Returns:

A dataframe or a sequence of dataframes with data in given split.

class ebes.data.accessors.PandasDataAccessor

Bases: ABC

Abstract class for all data accessors.

Data accessor is responsible for splitting the data on train/test/whatever, filtering the data and exposing pd.DataFrame interface to it. The splits are configred in subclass __init__ methods and accessed by their index. Each subclass splits the data (using any specified strategy) and returns a split by its positive index as a result of the get_split method.

abstract get_split(split_idx)

Get split by its index.

Parameters:

split_idx (int) – positive index of split.

Return type:

DataFrame | Sequence[DataFrame]

Returns:

A dataframe or a sequence of dataframes with data in given split.

ebes.data.batch_tfs module

Batch transforms for data loading pipelines.

class ebes.data.batch_tfs.BatchTransform

Bases: ABC

Base class for all batch transforms.

The BatchTransform is a Callable object that modifies Batch in-place.

class ebes.data.batch_tfs.CatToNum

Bases: BatchTransform

Process categorical features as numerical.

Treat categorical features as numerical (just type cast). Category 0 is converted to NaN value.

class ebes.data.batch_tfs.ContrastiveTarget

Bases: BatchTransform

Set target for contrastive losses.

New target is LongTensor such that items with different indices have different target labels.

class ebes.data.batch_tfs.DatetimeToFloat(loc, scale)

Bases: BatchTransform

Cast time from np.datetime64 to float by rescale. scale:

loc: str | datetime64

Location to subtract. If string is passed, it is converted to np.datetime64 beforehand.

scale: tuple[int, str] | timedelta64

Scale to divide time by. If tuple is passed, it is passed to the np.timedelta64 function. The first item is a value and the second is a unit.

class ebes.data.batch_tfs.FillNans(fill_value)

Bases: BatchTransform

Fill NaNs with specified values.

fill_value: Mapping[str, float] | float

If float, all NaNs in all numerical features will be replaced with the fill_value. Mapping sets feature-specific replacement values.

class ebes.data.batch_tfs.ForwardFillNans(backward=False)

Bases: BatchTransform

Fill NaN values by propagating forwad lase non-nan values.

The algoritm starts from the second step. If some values are NaNs, the values from the prevoius step are used to fill them. If the first time step contains NaNs, some NaNs will not be filled after the forward pass. To handle it backward=True might be specified to fill remaining NaN values from last to first after the forwad pass. But even after a backward pass the batch may contain NaNs, if some feature has all NaN values. To fill it use FillNans transform.

backward: bool = False

Wether to do backward fill after the forwad fill (see the class description).

class ebes.data.batch_tfs.Logarithm(names)

Bases: BatchTransform

Apply natural logarithm to specific feature.

names: list[str]

Feature names to transform by taking the logarithm.

class ebes.data.batch_tfs.MaskValid

Bases: BatchTransform

Add mask indicating valid values to batch.

Mask has shape (max_seq_len, batch_size, n_features) and has True values where there are non-NaN values (nonzero category) and where the data is not padded.

class ebes.data.batch_tfs.PrimeNetSampler(mask_ratio_per_seg=0.05, segment_num=3, pretrain_tasks='full2')

Bases: BatchTransform

Contrastive sampling according to PrimeNet.

Input:

batch: Batch. Masks required.

batch.num_features (T, B, D) -> (T, 2B, D) batch.cat_features (T, B, D) -> (T, 2B, D)

Masks have additional dim for constrastive and interpolation: batch.num_mask (T, B, D) - > (T, 2B, D, 2) batch.cat_mask (T, B, D) - > (T, 2B, D, 2)

dense_sampling_bound = [0.4, 0.6]
len_sampling_bound = [0.3, 0.7]
mask_ratio_per_seg: float = 0.05
pretrain_tasks: str = 'full2'
segment_num: int = 3
class ebes.data.batch_tfs.RandomEventsPermutation(keep_last=False)

Bases: BatchTransform

Permute events in sequence randomly.

Time, target and masks are left unchanged.

keep_last: bool = False

If True the last event remains on its place, other are permuted.

class ebes.data.batch_tfs.RandomSlices(split_count, cnt_min, cnt_max, short_seq_crop_rate=1.0, seed=None)

Bases: BatchTransform

Sample random slices from input sequences.

The transform is taken from https://github.com/dllllb/coles-paper. It samples random slices from initial sequences. The batch size after this transform will be split_count times larger.

cnt_max: int

Maximal sample sequence length.

cnt_min: int

Minimal sample sequence length.

seed: int | None = None

Value to seed the random generator.

short_seq_crop_rate: float = 1.0

Must be from (0, 1]. If short_seq_crop_rate < 1, and if a sequence of length less than cnt_min is encountered, the mininum sample length for this sequence is set as a short_seq_crop_rate time the actual sequence length.

split_count: int

How many sample slices to draw for each input sequence.

class ebes.data.batch_tfs.RandomTime

Bases: BatchTransform

Replace time with uniformly disributed values.

class ebes.data.batch_tfs.Rescale(name, loc, scale)

Bases: BatchTransform

Rescale feature: subtract location and divide by scale.

loc: Any

Value to subtract from the feature values.

name: str

Feature name.

scale: Any

Value to divide by the feature values.

class ebes.data.batch_tfs.RescaleTime(loc, scale)

Bases: BatchTransform

Rescale time: subtract location and divide by scale.

loc: float

Location to subtract from time.

scale: float

Scale to divide time by.

class ebes.data.batch_tfs.TargetToLong

Bases: BatchTransform

Cast target to LongTensor

class ebes.data.batch_tfs.TimeToFeatures(process_type='none', time_name='time')

Bases: BatchTransform

Add time to numerical features.

To apply this transform first cast time to Tensor. Has to be applied BEFORE mask creation. And AFTER DatetoTime

process_type: Literal['cat', 'diff', 'none'] = 'none'

How to add time to features. The options are:

  • "cat" — add absolute time to other numerical features,

  • "diff" — add time intervals between sequential events. In this case the first interval in a sequence equals zero.

  • "none" — do not add time to features. This option is added for the ease of optuna usage.

time_name: str = 'time'

Name of new feature with time, default "time".

class ebes.data.batch_tfs.UnsqueezeTarget

Bases: BatchTransform

Unsqueeze last dimension in target array.

Last linear layer for regression task produces tensors of shape (bs, 1). When calling MSE loss with target of shape (bs,), PyTorch expands it to the shape (bs, bs) and loss is computed incorrectly. This batch transform reshapes the target to (bs, 1), so MSE loss is computed correctly.

ebes.data.datasets module

class ebes.data.datasets.SeriesDataset(data, batch_size, query=None, drop_incomplete=False, shuffle=False, loop=False, random_seed=None)

Bases: IterableDataset

An iterable dataset over the DataFrame rows.

class ebes.data.datasets.SizedSeriesDataset(data, batch_size, query=None, drop_incomplete=False, shuffle=False, loop=False, random_seed=None)

Bases: SeriesDataset

The same as SeriesDataset, but has __len__ method implemented.

ebes.data.datasets.series(df)

Return list of DataFrame rows as a series.

Return type:

list[Series]

ebes.data.loading module

class ebes.data.loading.SequenceCollator(*, time_name, cat_cardinalities=None, num_names=None, index_name=None, target_name=None, max_seq_len=0, batch_transforms=None, padding_type='zeros')

Bases: object

batch_transforms: list[Callable[[Batch], None]] | None = None
cat_cardinalities: Mapping[str, int] | None = None
index_name: str | None = None
max_seq_len: int = 0
num_names: list[str] | None = None
padding_type: str = 'zeros'
target_name: str | list[str] | None = None
time_name: str

ebes.data.utils module

ebes.data.utils.build_loaders(dataset, loaders, preprocessing)
Return type:

Mapping[str, DataLoader]

ebes.data.utils.get_accessor(parquet_path, split_sizes, split_by_col=None, random_split=False, split_seed=None)
ebes.data.utils.get_collator(time_name, cat_cardinalities=None, num_names=None, index_name=None, target_name=None, max_seq_len=0, batch_transforms=None, padding_type='zeros')
Return type:

SequenceCollator

ebes.data.utils.get_loader(accessor, collators, split_idx, preprocessing, batch_size, query=None, drop_incomplete=False, shuffle=False, loop=False, random_seed=None, num_workers=0, labeled=True)
Return type:

DataLoader

Module contents