realtabformer.rtf_analyze#

Module Contents#

Classes#

SyntheticDataBench

This class handles all the assessments

SyntheticDataExperiment

For each data and model:

class realtabformer.rtf_analyze.SyntheticDataBench(data: pandas.DataFrame, target_col: str, categorical: bool, target_pos_val: Any = None, test_size: float = 0.2, test_df: pandas.DataFrame | None = None, random_state: int = 1029)[source]#

This class handles all the assessments needed for testing the synthetic data.

register_synthetic_data(synthetic: pandas.DataFrame)[source]#

Registers synthetic data for the assessment.

The synthetic data is split into training and test sets according to the values of n_train and n_test. The split is done by sampling the data without replacement.

Args: synthetic: A DataFrame containing synthetic data. The DataFrame must have at least as many rows as n_train + n_test.

Returns: None

static compute_distance_to_closest_records(original: pandas.DataFrame, synthetic: pandas.DataFrame, n_test: int, distance: sklearn.metrics.pairwise.manhattan_distances = manhattan_distances) pandas.Series[source]#

original: The dataframe of the training data used to train the generative model. synthetic: The dataframe generated by the generative model or any data we want to compare

with the original data.

n_test: The number of observations we want to compare with the original data

from the synthetic data. Ideally, this should be the same size as the test data.

static measure_ml_efficiency(model: sklearn.base.BaseEstimator, train: pandas.DataFrame, synthetic: pandas.DataFrame, test: pandas.DataFrame, target_col: str, random_state: int = 1029) pandas.DataFrame[source]#

This function trains the provided model on the original and synthetic training data, and then uses the trained models to make predictions on the test data. It returns a dataframe containing the actual values and predictions from both training sets. This dataframe can be used to compare the performance of the model trained on the original data with the model trained on the synthetic data.

Parameters: model (sklearn.base.BaseEstimator): The model to be trained and used for prediction. train (pd.DataFrame): The original training data. synthetic (pd.DataFrame): The synthetic training data generated by a generative model.

Must have the same size as the train.

test (pd.DataFrame): The test data to be used for prediction. target_col (str): The name of the target column in the train and test data.

Returns: pd.DataFrame: A dataframe containing the actual values and predictions from both

training sets.

static preprocess_data(data: pandas.DataFrame, other: pandas.DataFrame | List[pandas.DataFrame] = None, fillna: bool = True) dict[source]#

Preprocesses a DataFrame containing mixed data types and returns a feature matrix.

The function first extracts the categorical and numerical columns from the DataFrame, and then applies a processing pipeline that one-hot encodes the categorical features and standardizes the numerical features.

Parameters:

data (pandas.DataFrame) – A DataFrame containing mixed data types.

Returns:

  • preprocessor: The trained feature processor pipeline.

  • column_names: The new column names for the processed data.

  • data: A feature matrix containing only numerical values for the input data.

  • other (optional): A feature matrix containing only numerical values

for the input other.

Return type:

dict

static compute_discriminator_predictions(original: pandas.DataFrame, synthetic: pandas.DataFrame, test: pandas.DataFrame, model: sklearn.base.BaseEstimator, random_state: int = 1029) dict[source]#

Builds a discriminator model that attempts to distinguish between original and synthetic data.

The function first preprocesses the data by extracting the categorical and numerical columns, then applies a processing pipeline that one-hot encodes the categorical features and standardizes the numerical features. Next, it adds labels to the original and synthetic data to indicate which is which, then combines the data into one DataFrame and splits it into training and test sets. Finally, it trains a classifier model on the training data and returns the model.

Parameters:
  • original (pandas.DataFrame) – A DataFrame containing original data.

  • synthetic (pandas.DataFrame) – A DataFrame containing synthetic data.

  • model (Type[LogisticRegression]) – A type of scikit-learn model to use. Defaults to LogisticRegression.

  • test_size (float) – The proportion of data to include in the test set. Defaults to 0.2.

  • random_state (int) – The random seed to use for splitting the data. Defaults to 1029.

Returns:

  • y_test: Labels for the test/synthetic test data.

  • y_preds: Predictions for the label.

Return type:

dict

get_dcr(is_test: bool = False, distance: sklearn.metrics.pairwise.manhattan_distances = manhattan_distances) pandas.Series[source]#

Get the DCR values for this experiment.

get_ml_efficiency(model: sklearn.base.BaseEstimator, synthetic: pandas.DataFrame = None) pandas.DataFrame[source]#

Get the ML efficiency for this experiment.

get_discriminator_performance(model: sklearn.base.BaseEstimator)[source]#

Compute the discriminator performance for this experiment.

static compute_data_copying_predictions(original: pandas.DataFrame, synthetic: pandas.DataFrame, test: pandas.DataFrame, model: sklearn.base.BaseEstimator, random_state: int = 1029) dict[source]#

Builds a discriminator model that attempts to distinguish between original and synthetic data.

The function first preprocesses the data by extracting the categorical and numerical columns, then applies a processing pipeline that one-hot encodes the categorical features and standardizes the numerical features. Next, it adds labels to the original and synthetic data to indicate which is which, then combines the data into one DataFrame and splits it into training and test sets. Finally, it trains a classifier model on the training data and returns the model.

Parameters:
  • original (pandas.DataFrame) – A DataFrame containing original data.

  • synthetic (pandas.DataFrame) – A DataFrame containing synthetic data.

  • model (Type[LogisticRegression]) – A type of scikit-learn model to use. Defaults to LogisticRegression.

  • test_size (float) – The proportion of data to include in the test set. Defaults to 0.2.

  • random_state (int) – The random seed to use for splitting the data. Defaults to 1029.

Returns:

  • y_test: Labels for the test/synthetic test data.

  • y_preds: Predictions for the label.

Return type:

dict

static compute_sensitivity_metric(original: pandas.DataFrame, synthetic: pandas.DataFrame, test: pandas.DataFrame, qt_max: float = 0.05, qt_interval: int = 1000, distance: sklearn.metrics.pairwise.manhattan_distances = manhattan_distances, tsvd: sklearn.decomposition.TruncatedSVD = None, max_col_nums: int = 50, use_ks: bool = False, verbose: bool = False) float[source]#
static compute_sensitivity_threshold(train_data: pandas.DataFrame, num_bootstrap: int = 100, test_size: int = None, frac: float = None, qt_max: float = 0.05, qt_interval: int = 1000, distance: sklearn.metrics.pairwise.manhattan_distances = manhattan_distances, tsvd: sklearn.decomposition.TruncatedSVD = None, return_values: bool = False, quantile: float = 0.95, max_col_nums: int = 50, use_ks: bool = False, full_sensitivity: bool = True, sensitivity_orig_frac_multiple: int = 3) float | List[source]#

This method implements a bootstrapped estimation of the sensitivity values derived from the training data.

We compute the sensitivity value for num_bootstrap rounds of random split of the training data.

Parameters:
  • quantile – Returns the sensitivity value at the given quantile from the bootstrap set. Note that we use quantile > 0.5 because we want to detect whether the synthetic data tends to be closer to the training data than expected. The statistic computes synth_min < test_min, so if the synthetic data systematically copies observation from the training data, we expect that the statictic tends to become larger >> 0.

  • return_values – Instead of returning a single value based on the quantile argument, return the full set of boostrap values.

  • sensitivity_orig_frac_multiple – The size of the training data relative to the chosen frac that will be used in computing the sensitivity. The larger this value is, the more robust the sensitivity threshold will be. However, (sensitivity_orig_frac_multiple + 2) multiplied by frac must be less than 1.

class realtabformer.rtf_analyze.SyntheticDataExperiment(data_id: str, model_type: str, categorical: bool, target_col: str, target_pos_val: Any = None)[source]#

For each data and model: 1. Split train/test data -> save data 2. Train model with train data -> save model 3. Generate N x train+test synthetic data -> save samples 4. Perform analysis on the generated data.