Synthetic Ocean AI - Data Generation

Synthetic Ocean AI (SynDataGen)

An advanced Python framework for generating and evaluating synthetic tabular datasets using modern generative models, including diffusion and adversarial architectures. Designed for researchers and practitioners, it provides reproducible pipelines, fine-grained control over model configuration, and integrated evaluation metrics for realistic data synthesis.

🧠 Architectures Supported

Model Description Use Case
CGANConditional GANs conditioned on labels or attributesClass balancing, controlled generation
WGAN-GPWasserstein GAN with gradient penalty for stable trainingImbalanced datasets, complex distributions
AutoencoderLatent-space learning through compression-reconstructionFeature extraction, denoising
VAEProbabilistic Autoencoder with latent samplingProbabilistic generation and imputation
Denoising DiffusionProgressive noise-based generative modelRobust generation with high-quality samples
VQ-VAEDiscrete latent-space via quantizationCategorical and mixed-type data
Copy/PasteSimple sample replication baselineSanity checks, baseline comparison
Kernel DiffusionExperimental kernelized diffusion process (WIP)Future work

Architecture Overview

The Synthetic Ocean AI library provides several generative architectures:

Architecture Key Characteristics Typical Use Cases
Denoising Probabilistic Diffusion Iterative denoising process, high-quality outputs High-fidelity data generation
Conditional GAN (CGAN) Label-guided generation Conditional data augmentation
Wasserstein GAN-GP Stable training with gradient penalty Robust generation tasks
Conditional Autoencoder Deterministic reconstruction Data compression, denoising
Variational Autoencoder Probabilistic latent space Diverse sample generation

Example Workflows

Denoising Probabilistic Diffusion

import numpy
import tensorflow
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from SynDataGen.Engine.Models.Diffusion.DiffusionModelUnet import UNetModel
from SynDataGen.Engine.Algorithms.Diffusion.AlgorithmDiffusion import DiffusionModel
from SynDataGen.Engine.Algorithms.Diffusion.GaussianDiffusion import GaussianDiffusion
from SynDataGen.Engine.Models.Diffusion.VariationalAutoencoderModel import VariationalModelDiffusion
from SynDataGen.Engine.Algorithms.Diffusion.AlgorithmVariationalAutoencoderDiffusion import VariationalAlgorithmDiffusion

number_samples_per_class = {
    "classes": {1: 100, 2: 200, 3: 150},
    "number_classes": 3
}
input_shape = (1200, )

# Initialize UNet models
first_instance_unet = UNetModel(
    embedding_dimension=128,
    embedding_channels=1,
    list_neurons_per_level=[1, 2, 4],
    list_attentions=[False,True, True],
    number_residual_blocks=2,
    normalization_groups=1,
    intermediary_activation_function='swish',
    intermediary_activation_alpha=0.05,
    last_layer_activation='linear',
    number_samples_per_class=number_samples_per_class
)

second_instance_unet = UNetModel(
    embedding_dimension=128,
    embedding_channels=1,
    list_neurons_per_level=[1, 2, 4],
    list_attentions=[False, False, True, True],
    number_residual_blocks=2,
    normalization_groups=1,
    intermediary_activation_function='swish',
    intermediary_activation_alpha=0.05,
    last_layer_activation='linear',
    number_samples_per_class=number_samples_per_class
)

# Initialize Gaussian Diffusion
gaussian_diffusion_util = GaussianDiffusion(
    beta_start=1e-4,
    beta_end=0.02,
    time_steps=1000,
    clip_min=-1.0,
    clip_max=1.0
)

# Initialize Variational Autoencoder
variation_model_diffusion = VariationalModelDiffusion(
    latent_dimension=128,
    output_shape=input_shape,
    activation_function='swish',
    initializer_mean=0.0,
    initializer_deviation=0.02,
    dropout_decay_encoder=0.2,
    dropout_decay_decoder=0.4,
    last_layer_activation='sigmoid',
    number_neurons_encoder=[128, 64],
    number_neurons_decoder=[64, 128],
    dataset_type=numpy.float32,
    number_samples_per_class=number_samples_per_class
)

# Initialize Diffusion Algorithm
diffusion_algorithm = DiffusionModel(
    first_unet_model=first_instance_unet.build_model(),
    second_unet_model=second_instance_unet.build_model(),
    encoder_model_image=variation_model_diffusion.get_encoder(),
    decoder_model_image=variation_model_diffusion.get_decoder(),
    gdf_util=gaussian_diffusion_util,
    optimizer_autoencoder=Adam(learning_rate=0.0002),
    optimizer_diffusion=Adam(learning_rate=0.0002),
    time_steps=1000,
    ema=0.9999,
    margin=0.001,
    embedding_dimension=128
)

# Train and generate samples
diffusion_algorithm.compile(loss='mse', optimizer=Adam(learning_rate=0.002))
data_embedding = variation_model_diffusion.create_embedding([x_real_samples, to_categorical(y_real_samples)])
diffusion_algorithm.fit(data_embedding, epochs=1000, batch_size=32)
samples = diffusion_algorithm.get_samples(number_samples_per_class)

Diffusion Model Parameters

Parameter Description
--diffusion_unet_last_layer_activationActivation for last layer of U-Net
--diffusion_latent_dimensionDimension of latent space
--diffusion_unet_num_embedding_channelsNumber of embedding channels
--diffusion_unet_channels_per_levelChannels per level in U-Net
--diffusion_unet_batch_sizeBatch size for U-Net training
--diffusion_unet_attention_modeAttention mode for U-Net
--diffusion_unet_num_residual_blocksNumber of residual blocks
--diffusion_unet_group_normalizationGroup normalization value
--diffusion_unet_intermediary_activationIntermediary activation
--diffusion_unet_intermediary_activation_alphaAlpha for activation
--diffusion_unet_epochsTraining epochs
--diffusion_gaussian_beta_startStarting beta value
--diffusion_gaussian_beta_endEnding beta value
--diffusion_gaussian_time_stepsNumber of time steps
--diffusion_gaussian_clip_minMinimum clipping value
--diffusion_gaussian_clip_maxMaximum clipping value

Conditional GAN (CGAN)

import numpy
import tensorflow
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from SynDataGen.Engine.Models.Adversarial.AdversarialModel import AdversarialModel
from SynDataGen.Engine.Algorithms.Adversarial.AdversarialAlgorithm import AdversarialAlgorithm

number_samples_per_class = {
    "classes": {1: 100, 2: 200, 3: 150},
    "number_classes": 3
}
input_shape = (1200, )

# Initialize Adversarial Model
adversarial_model = AdversarialModel(
    latent_dimension=128,
    output_shape=input_shape,
    activation_function="LeakyReLU",
    initializer_mean=0.0,
    initializer_deviation=0.5,
    dropout_decay_rate_g=0.2,
    dropout_decay_rate_d=0.4,
    last_layer_activation="Sigmoid",
    dense_layer_sizes_g=[128],
    dense_layer_sizes_d=[128],
    dataset_type=numpy.float32,
    number_samples_per_class=number_samples_per_class
)

# Initialize Adversarial Algorithm
adversarial_algorithm = AdversarialAlgorithm(
    generator_model=adversarial_model.get_generator(),
    discriminator_model=adversarial_model.get_discriminator(),
    latent_dimension=128,
    loss_generator='binary_crossentropy',
    loss_discriminator='binary_crossentropy',
    file_name_discriminator="discriminator_model",
    file_name_generator="generator_model",
    models_saved_path="models_saved/",
    latent_mean_distribution=0.0,
    latent_stander_deviation=1.0,
    smoothing_rate=0.15
)

# Train and generate samples
adversarial_algorithm.compile(
    Adam(learning_rate=0.0002, beta_1=0.5),
    Adam(learning_rate=0.0002, beta_1=0.5),
    'binary_crossentropy',
    'binary_crossentropy'
)
adversarial_algorithm.fit(
    x_real_samples,
    to_categorical(y_real_samples, num_classes=number_samples_per_class["number_classes"]),
    epochs=1000,
    batch_size=32
)
samples = adversarial_algorithm.get_samples(number_samples_per_class)

CGAN Parameters

Parameter Description
--adversarial_number_epochsNumber of training epochs
--adversarial_batch_sizeTraining batch size
--adversarial_initializer_meanMean for weight initialization
--adversarial_initializer_deviationStd dev for weight initialization
--adversarial_latent_dimensionLatent space dimension
--adversarial_training_algorithmTraining algorithm
--adversarial_activation_functionActivation function
--adversarial_dropout_decay_rate_gGenerator dropout rate
--adversarial_dropout_decay_rate_dDiscriminator dropout rate
--adversarial_dense_layer_sizes_gGenerator layer sizes
--adversarial_dense_layer_sizes_dDiscriminator layer sizes
--adversarial_latent_mean_distributionLatent space mean
--adversarial_latent_stander_deviationLatent space std dev
--adversarial_loss_generatorGenerator loss function
--adversarial_loss_discriminatorDiscriminator loss function
--adversarial_smoothing_rateLabel smoothing rate

Wasserstein GAN-GP

import numpy
import tensorflow
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from SynDataGen.Engine.Models.Wasserstein.ModelWassersteinGAN import WassersteinModel
from SynDataGen.Engine.Algorithms.Wasserstein.AlgorithmWassersteinGan import WassersteinAlgorithm

number_samples_per_class = {
    "classes": {1: 100, 2: 200, 3: 150},
    "number_classes": 3
}
input_shape = (1200, )

# Initialize Wasserstein Model
wasserstein_model = WassersteinModel(
    latent_dimension=128,
    output_shape=input_shape,
    activation_function="LeakyReLU",
    initializer_mean=0.0,
    initializer_deviation=0.02,
    dropout_decay_rate_g=0.2,
    dropout_decay_rate_d=0.4,
    last_layer_activation="sigmoid",
    dense_layer_sizes_g=[128],
    dense_layer_sizes_d=[128],
    dataset_type=numpy.float32,
    number_samples_per_class=number_samples_per_class
)

# Initialize Wasserstein Algorithm
wasserstein_algorithm = WassersteinAlgorithm(
    generator_model=wasserstein_model.get_generator(),
    discriminator_model=wasserstein_model.get_discriminator(),
    latent_dimension=128,
    generator_loss_fn="binary_crossentropy",
    discriminator_loss_fn="binary_crossentropy",
    file_name_discriminator="discriminator_model",
    file_name_generator="generator_model",
    models_saved_path="models_saved/",
    latent_mean_distribution=0.0,
    latent_stander_deviation=1.0,
    smoothing_rate=0.15,
    gradient_penalty_weight=10.0,
    discriminator_steps=3
)

# Train and generate samples
wasserstein_algorithm.compile(
    Adam(learning_rate=0.0002, beta_1=0.5),
    Adam(learning_rate=0.0002, beta_1=0.5),
    generator_loss,
    discriminator_loss
)
wasserstein_algorithm.fit(
    x_real_samples,
    to_categorical(y_real_samples, num_classes=number_samples_per_class["number_classes"]),
    epochs=1000,
    batch_size=32
)
samples = wasserstein_algorithm.get_samples(number_samples_per_class)

WGAN-GP Parameters

Parameter Description
--wasserstein_latent_dimensionLatent space dimension
--wasserstein_training_algorithmTraining algorithm
--wasserstein_activation_functionActivation function
--wasserstein_dropout_decay_rate_gGenerator dropout rate
--wasserstein_dropout_decay_rate_dDiscriminator dropout rate
--wasserstein_dense_layer_sizes_generatorGenerator layer sizes
--wasserstein_dense_layer_sizes_discriminatorDiscriminator layer sizes
--wasserstein_batch_sizeTraining batch size
--wasserstein_number_epochsNumber of training epochs
--wasserstein_number_classesNumber of classes
--wasserstein_loss_functionLoss function
--wasserstein_momentumOptimizer momentum
--wasserstein_last_activation_layerLast layer activation
--wasserstein_initializer_meanWeight initialization mean
--wasserstein_initializer_deviationWeight initialization std dev
--wasserstein_gradient_penaltyGradient penalty weight

Conditional Autoencoder (CAE)

import numpy
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from SynDataGen.Engine.Models.Autoencoder.ModelAutoencoder import AutoencoderModel
from SynDataGen.Engine.Algorithms.Autoencoder.AutoencoderAlgorithm import AutoencoderAlgorithm

number_samples_per_class = {
    "classes": {1: 100, 2: 200, 3: 150},
    "number_classes": 3
}
input_shape = (1200, )

# Initialize Autoencoder Model
autoencoder_model = AutoencoderModel(
    latent_dimension=64,
    output_shape=input_shape,
    activation_function="LeakyReLU",
    initializer_mean=0.0,
    initializer_deviation=0.50,
    dropout_decay_encoder=0.2,
    dropout_decay_decoder=0.2,
    last_layer_activation="sigmoid",
    number_neurons_encoder=[256, 128],
    number_neurons_decoder=[128, 256],
    dataset_type=numpy.float32,
    number_samples_per_class=2
)

# Initialize Autoencoder Algorithm
autoencoder_algorithm = AutoencoderAlgorithm(
    encoder_model=autoencoder_model.get_encoder(input_shape),
    decoder_model=autoencoder_model.get_decoder(input_shape),
    loss_function="binary_crossentropy",
    file_name_encoder="encoder_model",
    file_name_decoder="decoder_model",
    models_saved_path="models_saved/",
    latent_mean_distribution=0.5,
    latent_stander_deviation=0.5,
    latent_dimension=64
)

# Train and generate samples
autoencoder_algorithm.compile(loss='mse')
autoencoder_algorithm.fit(
    (x_real_samples, to_categorical(y_real_samples, num_classes=number_samples_per_class["number_classes"])),
    x_real_samples,
    epochs=1000,
    batch_size=32
)
samples = autoencoder_algorithm.get_samples(number_samples_per_class)

CAE Parameters

Parameter Description
--autoencoder_latent_dimensionLatent space dimension
--autoencoder_training_algorithmTraining algorithm
--autoencoder_activation_functionActivation function
--autoencoder_dropout_decay_rate_encoderEncoder dropout rate
--autoencoder_dropout_decay_rate_decoderDecoder dropout rate
--autoencoder_dense_layer_sizes_encoderEncoder layer sizes
--autoencoder_dense_layer_sizes_decoderDecoder layer sizes
--autoencoder_batch_sizeTraining batch size
--autoencoder_number_classesNumber of classes
--autoencoder_number_epochsNumber of training epochs
--autoencoder_loss_functionLoss function
--autoencoder_momentumOptimizer momentum
--autoencoder_last_activation_layerLast layer activation
--autoencoder_initializer_meanWeight initialization mean
--autoencoder_initializer_deviationWeight initialization std dev

Variational Autoencoder (VAE)

import numpy
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from SynDataGen.Engine.Models.VariationalAutoencoder.VariationalAutoencoderModel import VariationalModel
from SynDataGen.Engine.Algorithms.VariationalAutoencoder.AlgorithmVariationalAutoencoder import VariationalAlgorithm

number_samples_per_class = {
    "classes": {1: 100, 2: 200, 3: 150},
    "number_classes": 3
}
input_shape = (1200, )

# Initialize Variational Model
variation_model = VariationalModel(
    latent_dimension=128,
    output_shape=input_shape,
    activation_function="LeakyReLU",
    initializer_mean=0.0,
    initializer_deviation=0.02,
    dropout_decay_encoder=0.2,
    dropout_decay_decoder=0.4,
    last_layer_activation="sigmoid",
    number_neurons_encoder=[128],
    number_neurons_decoder=[128],
    dataset_type=numpy.float32,
    number_samples_per_class=2
)

# Initialize Variational Algorithm
variational_algorithm = VariationalAlgorithm(
    encoder_model=variation_model.get_encoder(),
    decoder_model=variation_model.get_decoder(),
    loss_function="binary_crossentropy",
    latent_dimension=64,
    decoder_latent_dimension=128,
    latent_mean_distribution=0.0,
    latent_stander_deviation=0.5,
    file_name_encoder="encoder_model",
    file_name_decoder="decoder_model",
    models_saved_path="models_saved/"
)

# Train and generate samples
variational_algorithm.compile()
variational_algorithm.fit(
    (x_real_samples, to_categorical(y_real_samples, num_classes=number_samples_per_class["number_classes"])),
    epochs=1000,
    batch_size=32
)
samples = variational_algorithm.get_samples(number_samples_per_class)

VAE Parameters

Parameter Description
--variational_autoencoder_latent_dimensionLatent space dimension
--variational_autoencoder_training_algorithmTraining algorithm
--variational_autoencoder_activation_functionActivation function
--variational_autoencoder_dropout_decay_rate_encoderEncoder dropout rate
--variational_autoencoder_dropout_decay_rate_decoderDecoder dropout rate
--variational_autoencoder_dense_layer_sizes_encoderEncoder layer sizes
--variational_autoencoder_dense_layer_sizes_decoderDecoder layer sizes
--variational_autoencoder_number_epochsNumber of training epochs
--variational_autoencoder_batch_sizeTraining batch size
--variational_autoencoder_number_classesNumber of classes
--variational_autoencoder_loss_functionLoss function
--variational_autoencoder_momentumOptimizer momentum
--variational_autoencoder_last_activation_layerLast layer activation
--variational_autoencoder_initializer_meanWeight initialization mean
--variational_autoencoder_initializer_deviationWeight initialization std dev
--variational_autoencoder_mean_distributionLatent space mean
--variational_autoencoder_stander_deviationLatent space std dev

Common Parameters

Data Loading Parameters

Parameter Description
-i, --data_load_path_file_inputPath to input CSV file
--data_load_label_columnIndex of label column
--data_load_max_samplesMaximum samples to load
--data_load_max_columnsMaximum columns to consider
--data_load_start_columnFirst column index
--data_load_end_columnLast column index
--data_load_path_file_outputOutput CSV path
--data_load_exclude_columnsColumns to exclude

Classifier Parameters

Support Vector Machine

Parameter Description
--support_vector_machine_regularizationRegularization parameter
--support_vector_machine_kernelKernel type
--support_vector_machine_kernel_degreePolynomial kernel degree
--support_vector_machine_gammaKernel coefficient

Stochastic Gradient Descent

Parameter Description
--stochastic_gradient_descent_lossLoss function
--stochastic_gradient_descent_penaltyRegularization penalty
--stochastic_gradient_descent_alphaRegularization term
--stochastic_gradient_descent_max_iterationsMaximum iterations
--stochastic_gradient_descent_toleranceStopping criteria tolerance

Random Forest

Parameter Description
--random_forest_number_estimatorsNumber of trees
--random_forest_max_depthMaximum tree depth
--random_forest_max_leaf_nodesMaximum leaf nodes

Quadratic Discriminant Analysis

Parameter Description
--quadratic_discriminant_analysis_priorsClass probabilities
--quadratic_discriminant_analysis_regularizationRegularization parameter
--quadratic_discriminant_analysis_thresholdThreshold value

Multilayer Perceptron

Parameter Description
--perceptron_training_algorithmTraining algorithm
--perceptron_training_lossLoss function
--perceptron_layers_settingsLayer configurations
--perceptron_dropout_decay_rateDropout rate
--perceptron_training_metricEvaluation metrics
--perceptron_layer_activationLayer activation
--perceptron_last_layer_activationOutput activation
--perceptron_number_epochsTraining epochs

Spectral Clustering

Parameter Description
--spectral_number_clustersNumber of clusters
--spectral_eigen_solverEigenvalue decomposition method
--spectral_affinityAffinity matrix construction
--spectral_assign_labelsLabel assignment strategy
--spectral_random_stateRandom seed

Linear Regression

Parameter Description
--linear_regression_fit_interceptWhether to calculate intercept
--linear_regression_normalizeNormalize features
--linear_regression_copy_XCopy input data
--linear_regression_number_jobsNumber of parallel jobs

Naive Bayes

Parameter Description
--naive_bayes_priorsClass probabilities
--naive_bayes_variation_smoothingSmoothing parameter

K-Nearest Neighbors

Parameter Description
--knn_number_neighborsNumber of neighbors
--knn_weightsWeight function
--knn_algorithmAlgorithm used
--knn_leaf_sizeLeaf size for tree algorithms
--knn_metricDistance metric

K-Means

Parameter Description
--k_means_number_clustersNumber of clusters
--k_means_initInitialization method
--k_means_max_iterationsMaximum iterations
--k_means_toleranceConvergence tolerance
--k_means_random_stateRandom seed

Gradient Boosting

Parameter Description
--gradient_boosting_lossLoss function
--gradient_boosting_learning_rateLearning rate
--gradient_boosting_number_estimatorsNumber of estimators
--gradient_boosting_subsampleSubsample ratio
--gradient_boosting_criterionSplit quality measure

Gaussian Process

Parameter Description
--gaussian_process_kernelKernel function
--gaussian_process_max_iterationsMaximum iterations
--gaussian_process_optimizerOptimizer method

Decision Tree

Parameter Description
--decision_tree_criterionSplit quality measure
--decision_tree_max_depthMaximum tree depth
--decision_tree_max_featuresFeatures to consider
--decision_tree_max_leaf_nodesMaximum leaf nodes