ebes.data package
Submodules
ebes.data.accessors module
The module contains classes that expose a pd.DataFrame interface to datasets.
- class ebes.data.accessors.InMemoryPandasDataAccessor(*, parquet_path, split_sizes, data_queries=None, split_by_col=None, random_split=False, split_seed=None)
Bases:
PandasDataAccessor
Data accessor that keeps all data in memory.
- get_split(split_idx)
Get split by its index.
- Parameters:
split_idx – positive index of split.
- Return type:
DataFrame
- Returns:
A dataframe or a sequence of dataframes with data in given split.
- class ebes.data.accessors.PandasDataAccessor
Bases:
ABC
Abstract class for all data accessors.
Data accessor is responsible for splitting the data on train/test/whatever, filtering the data and exposing pd.DataFrame interface to it. The splits are configred in subclass __init__ methods and accessed by their index. Each subclass splits the data (using any specified strategy) and returns a split by its positive index as a result of the get_split method.
- abstract get_split(split_idx)
Get split by its index.
- Parameters:
split_idx (
int
) – positive index of split.- Return type:
DataFrame
|Sequence
[DataFrame
]- Returns:
A dataframe or a sequence of dataframes with data in given split.
ebes.data.batch_tfs module
Batch transforms for data loading pipelines.
- class ebes.data.batch_tfs.BatchTransform
Bases:
ABC
Base class for all batch transforms.
The BatchTransform is a Callable object that modifies Batch in-place.
- class ebes.data.batch_tfs.CatToNum
Bases:
BatchTransform
Process categorical features as numerical.
Treat categorical features as numerical (just type cast). Category 0 is converted to NaN value.
- class ebes.data.batch_tfs.ContrastiveTarget
Bases:
BatchTransform
Set target for contrastive losses.
New target is LongTensor such that items with different indices have different target labels.
- class ebes.data.batch_tfs.DatetimeToFloat(loc, scale)
Bases:
BatchTransform
Cast time from np.datetime64 to float by rescale. scale:
-
loc:
str
|datetime64
Location to subtract. If string is passed, it is converted to
np.datetime64
beforehand.
-
scale:
tuple
[int
,str
] |timedelta64
Scale to divide time by. If tuple is passed, it is passed to the
np.timedelta64
function. The first item is a value and the second is a unit.
-
loc:
- class ebes.data.batch_tfs.FillNans(fill_value)
Bases:
BatchTransform
Fill NaNs with specified values.
-
fill_value:
Mapping
[str
,float
] |float
If float, all NaNs in all numerical features will be replaced with the
fill_value
. Mapping sets feature-specific replacement values.
-
fill_value:
- class ebes.data.batch_tfs.ForwardFillNans(backward=False)
Bases:
BatchTransform
Fill NaN values by propagating forwad lase non-nan values.
The algoritm starts from the second step. If some values are NaNs, the values from the prevoius step are used to fill them. If the first time step contains NaNs, some NaNs will not be filled after the forward pass. To handle it
backward=True
might be specified to fill remaining NaN values from last to first after the forwad pass. But even after a backward pass the batch may contain NaNs, if some feature has all NaN values. To fill it useFillNans
transform.-
backward:
bool
= False Wether to do backward fill after the forwad fill (see the class description).
-
backward:
- class ebes.data.batch_tfs.Logarithm(names)
Bases:
BatchTransform
Apply natural logarithm to specific feature.
-
names:
list
[str
] Feature names to transform by taking the logarithm.
-
names:
- class ebes.data.batch_tfs.MaskValid
Bases:
BatchTransform
Add mask indicating valid values to batch.
Mask has shape (max_seq_len, batch_size, n_features) and has True values where there are non-NaN values (nonzero category) and where the data is not padded.
- class ebes.data.batch_tfs.PrimeNetSampler(mask_ratio_per_seg=0.05, segment_num=3, pretrain_tasks='full2')
Bases:
BatchTransform
Contrastive sampling according to PrimeNet.
- Input:
batch: Batch. Masks required.
batch.num_features (T, B, D) -> (T, 2B, D) batch.cat_features (T, B, D) -> (T, 2B, D)
Masks have additional dim for constrastive and interpolation: batch.num_mask (T, B, D) - > (T, 2B, D, 2) batch.cat_mask (T, B, D) - > (T, 2B, D, 2)
- dense_sampling_bound = [0.4, 0.6]
- len_sampling_bound = [0.3, 0.7]
-
mask_ratio_per_seg:
float
= 0.05
-
pretrain_tasks:
str
= 'full2'
-
segment_num:
int
= 3
- class ebes.data.batch_tfs.RandomEventsPermutation(keep_last=False)
Bases:
BatchTransform
Permute events in sequence randomly.
Time, target and masks are left unchanged.
-
keep_last:
bool
= False If
True
the last event remains on its place, other are permuted.
-
keep_last:
- class ebes.data.batch_tfs.RandomSlices(split_count, cnt_min, cnt_max, short_seq_crop_rate=1.0, seed=None)
Bases:
BatchTransform
Sample random slices from input sequences.
The transform is taken from https://github.com/dllllb/coles-paper. It samples random slices from initial sequences. The batch size after this transform will be
split_count
times larger.-
cnt_max:
int
Maximal sample sequence length.
-
cnt_min:
int
Minimal sample sequence length.
-
seed:
int
|None
= None Value to seed the random generator.
-
short_seq_crop_rate:
float
= 1.0 Must be from (0, 1]. If
short_seq_crop_rate
< 1, and if a sequence of length less than cnt_min is encountered, the mininum sample length for this sequence is set as ashort_seq_crop_rate
time the actual sequence length.
-
split_count:
int
How many sample slices to draw for each input sequence.
-
cnt_max:
- class ebes.data.batch_tfs.RandomTime
Bases:
BatchTransform
Replace time with uniformly disributed values.
- class ebes.data.batch_tfs.Rescale(name, loc, scale)
Bases:
BatchTransform
Rescale feature: subtract location and divide by scale.
-
loc:
Any
Value to subtract from the feature values.
-
name:
str
Feature name.
-
scale:
Any
Value to divide by the feature values.
-
loc:
- class ebes.data.batch_tfs.RescaleTime(loc, scale)
Bases:
BatchTransform
Rescale time: subtract location and divide by scale.
-
loc:
float
Location to subtract from time.
-
scale:
float
Scale to divide time by.
-
loc:
- class ebes.data.batch_tfs.TargetToLong
Bases:
BatchTransform
Cast target to LongTensor
- class ebes.data.batch_tfs.TimeToFeatures(process_type='none', time_name='time')
Bases:
BatchTransform
Add time to numerical features.
To apply this transform first cast time to Tensor. Has to be applied BEFORE mask creation. And AFTER DatetoTime
-
process_type:
Literal
['cat'
,'diff'
,'none'
] = 'none' How to add time to features. The options are:
"cat"
— add absolute time to other numerical features,"diff"
— add time intervals between sequential events. In this case the first interval in a sequence equals zero."none"
— do not add time to features. This option is added for the ease of optuna usage.
-
time_name:
str
= 'time' Name of new feature with time, default
"time"
.
-
process_type:
- class ebes.data.batch_tfs.UnsqueezeTarget
Bases:
BatchTransform
Unsqueeze last dimension in target array.
Last linear layer for regression task produces tensors of shape (bs, 1). When calling MSE loss with target of shape (bs,), PyTorch expands it to the shape (bs, bs) and loss is computed incorrectly. This batch transform reshapes the target to (bs, 1), so MSE loss is computed correctly.
ebes.data.datasets module
- class ebes.data.datasets.SeriesDataset(data, batch_size, query=None, drop_incomplete=False, shuffle=False, loop=False, random_seed=None)
Bases:
IterableDataset
An iterable dataset over the DataFrame rows.
- class ebes.data.datasets.SizedSeriesDataset(data, batch_size, query=None, drop_incomplete=False, shuffle=False, loop=False, random_seed=None)
Bases:
SeriesDataset
The same as SeriesDataset, but has __len__ method implemented.
- ebes.data.datasets.series(df)
Return list of DataFrame rows as a series.
- Return type:
list
[Series
]
ebes.data.loading module
- class ebes.data.loading.SequenceCollator(*, time_name, cat_cardinalities=None, num_names=None, index_name=None, target_name=None, max_seq_len=0, batch_transforms=None, padding_type='zeros')
Bases:
object
-
cat_cardinalities:
Mapping
[str
,int
] |None
= None
-
index_name:
str
|None
= None
-
max_seq_len:
int
= 0
-
num_names:
list
[str
] |None
= None
-
padding_type:
str
= 'zeros'
-
target_name:
str
|list
[str
] |None
= None
-
time_name:
str
-
cat_cardinalities:
ebes.data.utils module
- ebes.data.utils.build_loaders(dataset, loaders, preprocessing)
- Return type:
Mapping
[str
,DataLoader
]
- ebes.data.utils.get_accessor(parquet_path, split_sizes, split_by_col=None, random_split=False, split_seed=None)
- ebes.data.utils.get_collator(time_name, cat_cardinalities=None, num_names=None, index_name=None, target_name=None, max_seq_len=0, batch_transforms=None, padding_type='zeros')
- Return type:
- ebes.data.utils.get_loader(accessor, collators, split_idx, preprocessing, batch_size, query=None, drop_incomplete=False, shuffle=False, loop=False, random_seed=None, num_workers=0, labeled=True)
- Return type:
DataLoader