DataXID

Quickstart

Generate your first synthetic data in 5 minutes.

1. Install the SDK

pip install dataxid

The SDK encodes data locally before sending to the API.

2. Set Your API Key

import dataxid

dataxid.api_key = "dx_test_..."

Or set the environment variable (useful for CI/CD):

export DATAXID_API_KEY="dx_test_..."

API Keys

Sign up at app.dataxid.com to get your API key.

3. Generate Synthetic Data

One-liner (small datasets)

import dataxid
import pandas as pd

dataxid.api_key = "dx_test_..."
df = pd.read_csv("customers.csv")

synthetic = dataxid.synthesize(data=df, n_samples=1000)
print(synthetic.head())

That's it. Behind the scenes:

  1. SDK processes your data locally → abstract embeddings
  2. Embeddings are sent to the API → model trains on the statistical structure
  3. Synthetic data is generated and returned as a DataFrame

Only embeddings cross the wire — not raw data.

Step-by-step (large datasets, custom config)

For more control over training and generation:

import dataxid
import pandas as pd

dataxid.api_key = "dx_test_..."
df = pd.read_csv("transactions.csv")

model = dataxid.Model.create(
    data=df,
    config=dataxid.ModelConfig(
        model_size="large",
        max_epochs=200,
    ),
)

synthetic_1k = model.generate(n_samples=1000)
synthetic_10k = model.generate(n_samples=10000)

model.delete()

Plain dict also works for quick experiments:

model = dataxid.Model.create(
    data=df,
    config={"model_size": "large", "max_epochs": 200},
)

Have multiple tables with foreign keys? synthesize_tables() generates them with valid foreign key references:

from dataxid import Table

accounts = Table(accounts_df, primary_key="account_id")
transactions = Table(transactions_df, foreign_keys={"account_id": accounts})

synthetic = dataxid.synthesize_tables({
    "accounts": accounts,
    "transactions": transactions,
})

synthetic["accounts"]       # synthetic accounts with auto-assigned PKs
synthetic["transactions"]   # synthetic transactions — per-account patterns preserved

Child tables are generated sequentially by default — the model learns per-entity patterns (transaction counts, temporal ordering, value distributions).

See Multi-Table Synthesis for fan-out schemas, N-parent tables, and the low-level Model.create() API.

5. Enable Logging

See what the SDK is doing during training:

dataxid.enable_logging("info")   # training progress, epoch stats
dataxid.enable_logging("debug")  # verbose — includes HTTP requests
dataxid.disable_logging()        # turn off (default state)

Or via environment variable:

DATAXID_LOG=info python my_script.py

Next Steps

On this page