DataXID

Configuration

Model training config, privacy settings, logging, and advanced tuning.

ModelConfig

Pass a ModelConfig to synthesize() or Model.create() to control training behavior. All parameters have sensible defaults — you only need to set what you want to change.

import dataxid

synthetic = dataxid.synthesize(
    data=df,
    config=dataxid.ModelConfig(
        model_size="large",
        max_epochs=200,
        privacy_enabled=True,
    ),
)

Plain dict works too:

synthetic = dataxid.synthesize(
    data=df,
    config={"model_size": "large", "max_epochs": 200},
)

Parameters

ParameterTypeDefaultDescription
embedding_dimint64Embedding size per row. Larger values capture more structure but increase training time
model_sizestr"medium"Model capacity: "small", "medium", or "large"
max_epochsint100Maximum training epochs. Training may stop earlier via early stopping
batch_sizeint256Training batch size
early_stop_patienceint4Epochs without validation loss improvement before stopping
privacy_enabledboolFalseAdd Gaussian noise to embeddings for differential privacy
privacy_noisefloat0.1Noise scale (standard deviation) when privacy_enabled=True
encoding_typesdict | NoneNoneOverride auto-detected column encoding types

Model size guide

Small — fast training, good for prototyping and small datasets (<1K rows). Medium — balanced, recommended for most use cases. Large — highest fidelity, best for large datasets (10K+ rows) where quality matters most.


Privacy

The SDK encodes data locally. Only embeddings (64 floats per row) are sent to the API. Raw data does not reach the server.

For additional privacy guarantees, enable noise injection:

synthetic = dataxid.synthesize(
    data=df,
    config=dataxid.ModelConfig(
        privacy_enabled=True,
        privacy_noise=0.1,       # Gaussian std — higher = more privacy, less fidelity
    ),
)
privacy_noiseEffect
0.05Minimal noise — near-original fidelity
0.1Default — good privacy/fidelity balance
0.2+Strong noise — higher privacy, lower fidelity

Even without privacy_enabled, the API receives embeddings, not raw values.


Logging

The SDK is silent by default. Enable logging to see training progress:

dataxid.enable_logging("info")   # training progress, epoch stats
dataxid.enable_logging("debug")  # verbose — includes HTTP requests
dataxid.disable_logging()        # turn off

Or set the DATAXID_LOG environment variable (no code change needed):

DATAXID_LOG=info python my_script.py

Log levels

LevelWhat you see
infoModel creation, training start/end, epoch count, generation stats
debugEverything above + HTTP request/response details, encoder internals
warningOnly warnings and errors (e.g. generate failure with auto-cleanup)

Sensitive headers (API keys, tokens) are automatically masked in log output.


Advanced Parameters

These parameters are available for fine-tuning. Most users won't need to change them.

ParameterTypeDefaultDescription
val_splitfloat0.1Fraction of data held out for validation (0.0–1.0)
learning_ratefloat | NoneNoneInitial learning rate. None = model-size-dependent default
accumulation_stepsint1Gradient accumulation steps — effectively increases batch size without more memory
label_smoothingfloat0.0Label smoothing factor for cross-entropy loss
embedding_dropoutfloat0.5Dropout rate on embedding layer
time_limit_secondsfloat0.0Wall-clock training time limit in seconds. 0.0 = no limit
seedint | NoneNoneRandom seed for reproducibility
timeoutfloat14400.0Maximum wait time for server-side training (seconds). Raises TrainingTimeoutError if exceeded

Encoding Types

The SDK auto-detects column types (categorical, numeric, datetime). To override:

synthetic = dataxid.synthesize(
    data=df,
    config=dataxid.ModelConfig(
        encoding_types={
            "zip_code": "TABULAR_CATEGORICAL",     # treat as category, not number
            "event_date": "TABULAR_DATETIME",       # force datetime encoding
        },
    ),
)
Encoding TypeWhen to use
TABULAR_CATEGORICALLow-cardinality columns, codes, flags
TABULAR_NUMERIC_AUTOContinuous numeric values (auto-selects best numeric encoding)
TABULAR_DATETIMETimestamps, dates

Datetime auto-detection

String columns with datetime-like names (date, timestamp, created_at, etc.) are automatically detected and encoded as datetime if 80%+ of values parse successfully.

On this page