Configuration

Parameter	Type	Default	Description
`embedding_dim`	`int`	`64`	Embedding size per row. Larger values capture more structure but increase training time
`model_size`	`str`	`"medium"`	Model capacity: `"small"`, `"medium"`, or `"large"`
`max_epochs`	`int`	`100`	Maximum training epochs. Training may stop earlier via early stopping
`batch_size`	`int`	`256`	Training batch size
`early_stop_patience`	`int`	`4`	Epochs without validation loss improvement before stopping
`privacy_enabled`	`bool`	`False`	Add Gaussian noise to embeddings for differential privacy
`privacy_noise`	`float`	`0.1`	Noise scale (standard deviation) when `privacy_enabled=True`
`encoding_types`	`dict \| None`	`None`	Override auto-detected column encoding types

Small — fast training, good for prototyping and small datasets (<1K rows). Medium — balanced, recommended for most use cases. Large — highest fidelity, best for large datasets (10K+ rows) where quality matters most.

Privacy

The SDK encodes data locally. Only embeddings (64 floats per row) are sent to the API. Raw data does not reach the server.

For additional privacy guarantees, enable noise injection:

synthetic = dataxid.synthesize(
    data=df,
    config=dataxid.ModelConfig(
        privacy_enabled=True,
        privacy_noise=0.1,       # Gaussian std — higher = more privacy, less fidelity
    ),
)

`privacy_noise`	Effect
`0.05`	Minimal noise — near-original fidelity
`0.1`	Default — good privacy/fidelity balance
`0.2+`	Strong noise — higher privacy, lower fidelity

Even without privacy_enabled, the API receives embeddings, not raw values.

Logging

The SDK is silent by default. Enable logging to see training progress:

dataxid.enable_logging("info")   # training progress, epoch stats
dataxid.enable_logging("debug")  # verbose — includes HTTP requests
dataxid.disable_logging()        # turn off

Or set the DATAXID_LOG environment variable (no code change needed):

DATAXID_LOG=info python my_script.py

Log levels

Level	What you see
`info`	Model creation, training start/end, epoch count, generation stats
`debug`	Everything above + HTTP request/response details, encoder internals
`warning`	Only warnings and errors (e.g. generate failure with auto-cleanup)

Sensitive headers (API keys, tokens) are automatically masked in log output.

Advanced Parameters

These parameters are available for fine-tuning. Most users won't need to change them.

Parameter	Type	Default	Description
`val_split`	`float`	`0.1`	Fraction of data held out for validation (0.0–1.0)
`learning_rate`	`float \| None`	`None`	Initial learning rate. `None` = model-size-dependent default
`accumulation_steps`	`int`	`1`	Gradient accumulation steps — effectively increases batch size without more memory
`label_smoothing`	`float`	`0.0`	Label smoothing factor for cross-entropy loss
`embedding_dropout`	`float`	`0.5`	Dropout rate on embedding layer
`time_limit_seconds`	`float`	`0.0`	Wall-clock training time limit in seconds. `0.0` = no limit
`seed`	`int \| None`	`None`	Random seed for reproducibility
`timeout`	`float`	`14400.0`	Maximum wait time for server-side training (seconds). Raises `TrainingTimeoutError` if exceeded

Encoding Types

The SDK auto-detects column types (categorical, numeric, datetime). To override:

synthetic = dataxid.synthesize(
    data=df,
    config=dataxid.ModelConfig(
        encoding_types={
            "zip_code": "TABULAR_CATEGORICAL",     # treat as category, not number
            "event_date": "TABULAR_DATETIME",       # force datetime encoding
        },
    ),
)

Encoding Type	When to use
`TABULAR_CATEGORICAL`	Low-cardinality columns, codes, flags
`TABULAR_NUMERIC_AUTO`	Continuous numeric values (auto-selects best numeric encoding)
`TABULAR_DATETIME`	Timestamps, dates

Datetime auto-detection

String columns with datetime-like names (date, timestamp, created_at, etc.) are automatically detected and encoded as datetime if 80%+ of values parse successfully.

ModelConfig

Parameters

Privacy

Logging

Log levels

Advanced Parameters

Encoding Types

On this page