๐ Jellyfish Edition (v1.0.0)
An advanced Python framework for generating and evaluating synthetic tabular datasets using modern generative models, including diffusion and adversarial architectures.
# Quick Start
git clone https://github.com/SBSeg25/SynDataGen.git
cd SynDataGen
pip install -r requirements.txt
# Run demo (3 minutes)
python3 run_campaign_sbseg.py -c sf
# Full experiments (7 hours)
python3 run_campaign_sbseg.py
Extensible design supporting multiple generative models with consistent APIs for easy integration and experimentation.
Full CUDA support for efficient training on large datasets. Optional GPU execution for faster model convergence.
Built-in evaluation suite including accuracy, precision, recall, F1-score, ROC-AUC, and distance metrics.
TS-TR and TR-TS validation strategies for thorough assessment of synthetic data quality and model generalization.
Publication-ready charts, confusion matrices, heatmaps, and clustering visualizations for data analysis.
Containerized execution environment ensuring reproducibility across different systems and configurations.
| Component | Minimum | Recommended |
|---|---|---|
| ๐ป CPU | Any x86_64 | Multi-core (i5/Ryzen 5+) |
| ๐ง RAM | 4 GB | 8 GB+ |
| ๐พ Storage | 10 GB | 20 GB SSD |
| ๐ฎ GPU | Optional | NVIDIA with CUDA 11+ |
Conditional GANs for controlled generation with label conditioning
Class BalancingWasserstein GAN with improved stability via gradient penalty
Stable TrainingVariational Autoencoder with probabilistic latent space
ProbabilisticProgressive noise-based generation for high-quality samples
State-of-the-artEfficient diffusion in compressed latent space
High ResolutionDiscrete latent representations via vector quantization
Categorical DataSynthetic Minority Over-sampling Technique
Imbalanced DataTabular VAE optimized for structured data synthesis
SDVStatistical modeling based on dependency functions
SDVConditional GAN with mode-specific normalization
SDVRecognized as the most innovative and impactful tool at the Brazilian Symposium on Cybersecurity
Awarded for outstanding contributions in the artifacts category with exceptional documentation and reproducibility
The SynDataGen framework follows a modular architecture with four main layers, providing a comprehensive pipeline for synthetic data generation and evaluation.
Handles data ingestion from CSV/XLS formats, preprocessing, and stratified k-fold splitting for robust evaluation.
Core implementation of 8 different native generative algorithms, from adversarial to diffusion-based approaches.
Seamless integration with Synthetic Data Vault (SDV) library models for enhanced generation capabilities.
Implements TS-TR and TR-TS evaluation strategies with cross-validation for comprehensive quality assessment.
Comprehensive evaluation metrics and publication-ready visualization capabilities for thorough analysis.
CSV/XLS Data
Clean & Split
Synthetic Data
TS-TR & TR-TS
Metrics & Viz
Generate synthetic samples using Conditional GANs with label conditioning for balanced datasets.
import numpy
import tensorflow
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.losses import BinaryCrossentropy
from SynDataGen.Engine.Models.Adversarial.AdversarialModel import AdversarialModel
from SynDataGen.Engine.Algorithms.Adversarial.AdversarialAlgorithm import AdversarialAlgorithm
# Define class distribution and input shape
number_samples_per_class = {
"classes": {1: 100, 2: 200, 3: 150},
"number_classes": 3
}
input_shape = (1200, )
# Initialize Adversarial Model
adversarial_model = AdversarialModel(
latent_dimension=128,
output_shape=input_shape,
activation_function="LeakyReLU",
initializer_mean=0.0,
initializer_deviation=0.5,
dropout_decay_rate_g=0.2,
dropout_decay_rate_d=0.4,
last_layer_activation="Sigmoid",
dense_layer_sizes_g=[128],
dense_layer_sizes_d=[128],
dataset_type=numpy.float32,
number_samples_per_class=number_samples_per_class
)
# Setup training algorithm
adversarial_algorithm = AdversarialAlgorithm(
generator_model=adversarial_model.get_generator(),
discriminator_model=adversarial_model.get_discriminator(),
latent_dimension=128,
loss_generator='binary_crossentropy',
loss_discriminator='binary_crossentropy',
file_name_discriminator="discriminator_model",
file_name_generator="generator_model",
models_saved_path="models_saved/",
latent_mean_distribution=0.0,
latent_stander_deviation=1.0,
smoothing_rate=0.15
)
# Compile and train
generator_optimizer = Adam(learning_rate=0.0002, beta_1=0.5, beta_2=0.9)
discriminator_optimizer = Adam(learning_rate=0.0002, beta_1=0.5, beta_2=0.9)
adversarial_algorithm.compile(
generator_optimizer, discriminator_optimizer,
BinaryCrossentropy(), BinaryCrossentropy()
)
adversarial_algorithm.fit(
x_real_samples,
to_categorical(y_real_samples, num_classes=3),
epochs=1000, batch_size=32
)
# Generate synthetic samples
samples = adversarial_algorithm.get_samples(number_samples_per_class)
State-of-the-art generation using denoising diffusion with latent space compression.
import numpy
import tensorflow
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import MeanSquaredError
from tensorflow.keras.utils import to_categorical
from SynDataGen.Engine.Models.Diffusion.DiffusionModelUnet import UNetModel
from SynDataGen.Engine.Algorithms.Diffusion.AlgorithmDiffusion import DiffusionModel
from SynDataGen.Engine.Algorithms.Diffusion.GaussianDiffusion import GaussianDiffusion
from SynDataGen.Engine.Models.Diffusion.VariationalAutoencoderModel import VariationalModelDiffusion
from SynDataGen.Engine.Algorithms.Diffusion.AlgorithmVariationalAutoencoderDiffusion import VariationalAlgorithmDiffusion
number_samples_per_class = {
"classes": {1: 100, 2: 200, 3: 150},
"number_classes": 3
}
input_shape = (1200, )
# Initialize UNet models
first_instance_unet = UNetModel(
embedding_dimension=128,
embedding_channels=1,
list_neurons_per_level=[1, 2, 4],
list_attentions=[False, True, True],
number_residual_blocks=2,
normalization_groups=1,
intermediary_activation_function='swish',
number_samples_per_class=number_samples_per_class
)
first_unet_model = first_instance_unet.build_model()
second_unet_model = first_instance_unet.build_model()
second_unet_model.set_weights(first_unet_model.get_weights())
# Initialize Gaussian Diffusion
gaussian_diffusion_util = GaussianDiffusion(
beta_start=1e-4,
beta_end=0.02,
time_steps=1000,
clip_min=-1.0,
clip_max=1.0
)
# Initialize Variational Model
variation_model_diffusion = VariationalModelDiffusion(
latent_dimension=128,
output_shape=input_shape,
activation_function='swish',
number_neurons_encoder=[128, 64],
number_neurons_decoder=[64, 128],
number_samples_per_class=number_samples_per_class
)
# Setup diffusion algorithm
diffusion_algorithm = DiffusionModel(
first_unet_model=first_unet_model,
second_unet_model=second_unet_model,
encoder_model_image=variation_model_diffusion.get_encoder(),
decoder_model_image=variation_model_diffusion.get_decoder(),
gdf_util=gaussian_diffusion_util,
time_steps=1000,
ema=0.9999
)
diffusion_algorithm.compile(
loss=MeanSquaredError(),
optimizer=Adam(learning_rate=0.002)
)
# Train and generate
diffusion_algorithm.fit(
data_embedding,
to_categorical(y_real_samples, num_classes=3),
epochs=1000, batch_size=32
)
samples = variational_algorithm_diffusion.get_samples(number_samples_per_class)
Complete API documentation with examples and best practices for all models and utilities.
Step-by-step guides for common use cases, from basic generation to advanced customization.
8 comprehensive Mermaid diagrams documenting system design, data flow, and evaluation strategies.