Configuration
Model training config, privacy settings, logging, and advanced tuning.
ModelConfig
Pass a ModelConfig to synthesize() or Model.create() to control training behavior.
All parameters have sensible defaults — you only need to set what you want to change.
import dataxid
synthetic = dataxid.synthesize(
data=df,
config=dataxid.ModelConfig(
model_size="large",
max_epochs=200,
privacy_enabled=True,
),
)Plain dict works too:
synthetic = dataxid.synthesize(
data=df,
config={"model_size": "large", "max_epochs": 200},
)Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
embedding_dim | int | 64 | Embedding size per row. Larger values capture more structure but increase training time |
model_size | str | "medium" | Model capacity: "small", "medium", or "large" |
max_epochs | int | 100 | Maximum training epochs. Training may stop earlier via early stopping |
batch_size | int | 256 | Training batch size |
early_stop_patience | int | 4 | Epochs without validation loss improvement before stopping |
privacy_enabled | bool | False | Add Gaussian noise to embeddings for differential privacy |
privacy_noise | float | 0.1 | Noise scale (standard deviation) when privacy_enabled=True |
encoding_types | dict | None | None | Override auto-detected column encoding types |
Model size guide
Small — fast training, good for prototyping and small datasets (<1K rows). Medium — balanced, recommended for most use cases. Large — highest fidelity, best for large datasets (10K+ rows) where quality matters most.
Privacy
The SDK encodes data locally. Only embeddings (64 floats per row) are sent to the API. Raw data does not reach the server.
For additional privacy guarantees, enable noise injection:
synthetic = dataxid.synthesize(
data=df,
config=dataxid.ModelConfig(
privacy_enabled=True,
privacy_noise=0.1, # Gaussian std — higher = more privacy, less fidelity
),
)privacy_noise | Effect |
|---|---|
0.05 | Minimal noise — near-original fidelity |
0.1 | Default — good privacy/fidelity balance |
0.2+ | Strong noise — higher privacy, lower fidelity |
Even without privacy_enabled, the API receives embeddings, not raw values.
Logging
The SDK is silent by default. Enable logging to see training progress:
dataxid.enable_logging("info") # training progress, epoch stats
dataxid.enable_logging("debug") # verbose — includes HTTP requests
dataxid.disable_logging() # turn offOr set the DATAXID_LOG environment variable (no code change needed):
DATAXID_LOG=info python my_script.pyLog levels
| Level | What you see |
|---|---|
info | Model creation, training start/end, epoch count, generation stats |
debug | Everything above + HTTP request/response details, encoder internals |
warning | Only warnings and errors (e.g. generate failure with auto-cleanup) |
Sensitive headers (API keys, tokens) are automatically masked in log output.
Advanced Parameters
These parameters are available for fine-tuning. Most users won't need to change them.
| Parameter | Type | Default | Description |
|---|---|---|---|
val_split | float | 0.1 | Fraction of data held out for validation (0.0–1.0) |
learning_rate | float | None | None | Initial learning rate. None = model-size-dependent default |
accumulation_steps | int | 1 | Gradient accumulation steps — effectively increases batch size without more memory |
label_smoothing | float | 0.0 | Label smoothing factor for cross-entropy loss |
embedding_dropout | float | 0.5 | Dropout rate on embedding layer |
time_limit_seconds | float | 0.0 | Wall-clock training time limit in seconds. 0.0 = no limit |
seed | int | None | None | Random seed for reproducibility |
timeout | float | 14400.0 | Maximum wait time for server-side training (seconds). Raises TrainingTimeoutError if exceeded |
Encoding Types
The SDK auto-detects column types (categorical, numeric, datetime). To override:
synthetic = dataxid.synthesize(
data=df,
config=dataxid.ModelConfig(
encoding_types={
"zip_code": "TABULAR_CATEGORICAL", # treat as category, not number
"event_date": "TABULAR_DATETIME", # force datetime encoding
},
),
)| Encoding Type | When to use |
|---|---|
TABULAR_CATEGORICAL | Low-cardinality columns, codes, flags |
TABULAR_NUMERIC_AUTO | Continuous numeric values (auto-selects best numeric encoding) |
TABULAR_DATETIME | Timestamps, dates |
Datetime auto-detection
String columns with datetime-like names (date, timestamp, created_at, etc.)
are automatically detected and encoded as datetime if 80%+ of values parse successfully.