SynDataGen

๐ŸŒŠ Jellyfish Edition (v1.0.0)

An advanced Python framework for generating and evaluating synthetic tabular datasets using modern generative models, including diffusion and adversarial architectures.

โœ… Active ๐Ÿ Python 3.8+ ๐Ÿš€ GPU Supported ๐Ÿ“„ MIT License ๐Ÿ’š Open Source
# Quick Start
git clone https://github.com/SBSeg25/SynDataGen.git
cd SynDataGen
pip install -r requirements.txt

# Run demo (3 minutes)
python3 run_campaign_sbseg.py -c sf

# Full experiments (7 hours)
python3 run_campaign_sbseg.py

Key Features

๐ŸŽฏ

Modular Architecture

Extensible design supporting multiple generative models with consistent APIs for easy integration and experimentation.

โšก

GPU Acceleration

Full CUDA support for efficient training on large datasets. Optional GPU execution for faster model convergence.

๐Ÿ“Š

Comprehensive Metrics

Built-in evaluation suite including accuracy, precision, recall, F1-score, ROC-AUC, and distance metrics.

๐Ÿ”„

Dual Evaluation

TS-TR and TR-TS validation strategies for thorough assessment of synthetic data quality and model generalization.

๐Ÿ“ˆ

Visualization Tools

Publication-ready charts, confusion matrices, heatmaps, and clustering visualizations for data analysis.

๐Ÿณ

Docker Support

Containerized execution environment ensuring reproducibility across different systems and configurations.

System Requirements

Component Minimum Recommended
๐Ÿ’ป CPU Any x86_64 Multi-core (i5/Ryzen 5+)
๐Ÿง  RAM 4 GB 8 GB+
๐Ÿ’พ Storage 10 GB 20 GB SSD
๐ŸŽฎ GPU Optional NVIDIA with CUDA 11+

๐Ÿ“ฆ Software Dependencies

  • โœ… Python 3.8 or higher
  • โœ… TensorFlow 2.x / PyTorch
  • โœ… NumPy, Pandas, Scikit-learn
  • โœ… Docker (optional, for containerized execution)

Supported Models

Native Implementations

CGAN

Conditional GANs for controlled generation with label conditioning

Class Balancing

WGAN & WGAN-GP

Wasserstein GAN with improved stability via gradient penalty

Stable Training

VAE

Variational Autoencoder with probabilistic latent space

Probabilistic

Denoising Diffusion

Progressive noise-based generation for high-quality samples

State-of-the-art

Latent Diffusion

Efficient diffusion in compressed latent space

High Resolution

VQ-VAE

Discrete latent representations via vector quantization

Categorical Data

SMOTE

Synthetic Minority Over-sampling Technique

Imbalanced Data

Third-Party Integration (SDV)

TVAE

Tabular VAE optimized for structured data synthesis

SDV

Copula

Statistical modeling based on dependency functions

SDV

CTGAN

Conditional GAN with mode-specific normalization

SDV

Awards & Recognition

๐Ÿ†

Best Tool SBSEG 2025

Recognized as the most innovative and impactful tool at the Brazilian Symposium on Cybersecurity

๐Ÿ’Ž

Highlighted Artifact

Awarded for outstanding contributions in the artifacts category with exceptional documentation and reproducibility

System Architecture

The SynDataGen framework follows a modular architecture with four main layers, providing a comprehensive pipeline for synthetic data generation and evaluation.

1

Input Layer

Handles data ingestion from CSV/XLS formats, preprocessing, and stratified k-fold splitting for robust evaluation.

CSV/XLS Data Data Preprocessing K-Fold Split
2

Generative Models

Core implementation of 8 different native generative algorithms, from adversarial to diffusion-based approaches.

CGAN WGAN VAE Diffusion VQ-VAE SMOTE
3

Third-Party Integration

Seamless integration with Synthetic Data Vault (SDV) library models for enhanced generation capabilities.

CTGAN TVAE Copula
4

Evaluation Framework

Implements TS-TR and TR-TS evaluation strategies with cross-validation for comprehensive quality assessment.

TS-TR Strategy TR-TS Strategy Cross-Validation
5

Metrics & Analysis

Comprehensive evaluation metrics and publication-ready visualization capabilities for thorough analysis.

Binary Metrics Distance Metrics Efficiency Metrics Visualization

Data Flow Pipeline

๐Ÿ“ฅ

Ingest

CSV/XLS Data

โ†’
โš™๏ธ

Process

Clean & Split

โ†’
๐Ÿง 

Generate

Synthetic Data

โ†’
๐Ÿ”

Evaluate

TS-TR & TR-TS

โ†’
๐Ÿ“Š

Analyze

Metrics & Viz

Usage Examples

Conditional GAN (CGAN)

Generate synthetic samples using Conditional GANs with label conditioning for balanced datasets.

import numpy
import tensorflow
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.losses import BinaryCrossentropy

from SynDataGen.Engine.Models.Adversarial.AdversarialModel import AdversarialModel
from SynDataGen.Engine.Algorithms.Adversarial.AdversarialAlgorithm import AdversarialAlgorithm

# Define class distribution and input shape
number_samples_per_class = {
    "classes": {1: 100, 2: 200, 3: 150},
    "number_classes": 3
}
input_shape = (1200, )

# Initialize Adversarial Model
adversarial_model = AdversarialModel(
    latent_dimension=128,
    output_shape=input_shape,
    activation_function="LeakyReLU",
    initializer_mean=0.0,
    initializer_deviation=0.5,
    dropout_decay_rate_g=0.2,
    dropout_decay_rate_d=0.4,
    last_layer_activation="Sigmoid",
    dense_layer_sizes_g=[128],
    dense_layer_sizes_d=[128],
    dataset_type=numpy.float32,
    number_samples_per_class=number_samples_per_class
)

# Setup training algorithm
adversarial_algorithm = AdversarialAlgorithm(
    generator_model=adversarial_model.get_generator(),
    discriminator_model=adversarial_model.get_discriminator(),
    latent_dimension=128,
    loss_generator='binary_crossentropy',
    loss_discriminator='binary_crossentropy',
    file_name_discriminator="discriminator_model",
    file_name_generator="generator_model",
    models_saved_path="models_saved/",
    latent_mean_distribution=0.0,
    latent_stander_deviation=1.0,
    smoothing_rate=0.15
)

# Compile and train
generator_optimizer = Adam(learning_rate=0.0002, beta_1=0.5, beta_2=0.9)
discriminator_optimizer = Adam(learning_rate=0.0002, beta_1=0.5, beta_2=0.9)

adversarial_algorithm.compile(
    generator_optimizer, discriminator_optimizer,
    BinaryCrossentropy(), BinaryCrossentropy()
)

adversarial_algorithm.fit(
    x_real_samples,
    to_categorical(y_real_samples, num_classes=3),
    epochs=1000, batch_size=32
)

# Generate synthetic samples
samples = adversarial_algorithm.get_samples(number_samples_per_class)

Diffusion Model

State-of-the-art generation using denoising diffusion with latent space compression.

import numpy
import tensorflow
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import MeanSquaredError
from tensorflow.keras.utils import to_categorical

from SynDataGen.Engine.Models.Diffusion.DiffusionModelUnet import UNetModel
from SynDataGen.Engine.Algorithms.Diffusion.AlgorithmDiffusion import DiffusionModel
from SynDataGen.Engine.Algorithms.Diffusion.GaussianDiffusion import GaussianDiffusion
from SynDataGen.Engine.Models.Diffusion.VariationalAutoencoderModel import VariationalModelDiffusion
from SynDataGen.Engine.Algorithms.Diffusion.AlgorithmVariationalAutoencoderDiffusion import VariationalAlgorithmDiffusion

number_samples_per_class = {
    "classes": {1: 100, 2: 200, 3: 150},
    "number_classes": 3
}
input_shape = (1200, )

# Initialize UNet models
first_instance_unet = UNetModel(
    embedding_dimension=128,
    embedding_channels=1,
    list_neurons_per_level=[1, 2, 4],
    list_attentions=[False, True, True],
    number_residual_blocks=2,
    normalization_groups=1,
    intermediary_activation_function='swish',
    number_samples_per_class=number_samples_per_class
)

first_unet_model = first_instance_unet.build_model()
second_unet_model = first_instance_unet.build_model()
second_unet_model.set_weights(first_unet_model.get_weights())

# Initialize Gaussian Diffusion
gaussian_diffusion_util = GaussianDiffusion(
    beta_start=1e-4,
    beta_end=0.02,
    time_steps=1000,
    clip_min=-1.0,
    clip_max=1.0
)

# Initialize Variational Model
variation_model_diffusion = VariationalModelDiffusion(
    latent_dimension=128,
    output_shape=input_shape,
    activation_function='swish',
    number_neurons_encoder=[128, 64],
    number_neurons_decoder=[64, 128],
    number_samples_per_class=number_samples_per_class
)

# Setup diffusion algorithm
diffusion_algorithm = DiffusionModel(
    first_unet_model=first_unet_model,
    second_unet_model=second_unet_model,
    encoder_model_image=variation_model_diffusion.get_encoder(),
    decoder_model_image=variation_model_diffusion.get_decoder(),
    gdf_util=gaussian_diffusion_util,
    time_steps=1000,
    ema=0.9999
)

diffusion_algorithm.compile(
    loss=MeanSquaredError(),
    optimizer=Adam(learning_rate=0.002)
)

# Train and generate
diffusion_algorithm.fit(
    data_embedding,
    to_categorical(y_real_samples, num_classes=3),
    epochs=1000, batch_size=32
)

samples = variational_algorithm_diffusion.get_samples(number_samples_per_class)

Documentation

๐Ÿ“š

API Reference

Complete API documentation with examples and best practices for all models and utilities.

๐ŸŽ“

Tutorials

Step-by-step guides for common use cases, from basic generation to advanced customization.

๐Ÿ“

Architecture Diagrams

8 comprehensive Mermaid diagrams documenting system design, data flow, and evaluation strategies.

View Full Documentation