DataXID

Multi-Table Synthesis

Synthesize related tables with referential integrity — foreign keys, sequential generation, and full database synthesis.

Dataxid can synthesize entire databases — not just single tables. Define your schema with Table objects, and synthesize_tables() handles dependency ordering, training, primary key assignment, and foreign key remapping automatically.

from dataxid import Table

accounts = Table(accounts_df, primary_key="account_id")
transactions = Table(transactions_df, foreign_keys={"account_id": accounts})

synthetic = dataxid.synthesize_tables({
    "accounts": accounts,
    "transactions": transactions,
})

Child tables are generated sequentially by default — preserving per-entity patterns (transaction counts, ordering, value distributions).


Table Class

Table wraps a DataFrame with schema information for multi-table synthesis.

from dataxid import Table

Table(
    data=df,                          # Training DataFrame
    primary_key="id",                 # Excluded from training, auto-assigned after generation
    foreign_keys={"fk_col": parent},  # FK column → parent Table object
    sequential=True,                  # Sequential generation (default: True)
    sequence_by="fk_col",            # Which FK to use for sequential context (N-parent only)
)

Parameters

ParameterTypeDefaultDescription
dataDataFramerequiredTraining data for this table
primary_keystr | NoneNonePK column — excluded from training, auto-assigned as 1-based integer after generation
foreign_keysdict[str, Table]{}Maps FK column name to parent Table object. IDE autocomplete, type-safe, typo → immediate error
sequentialboolTrueWhen True and foreign_keys is set, child rows are generated conditioned on the parent — preserving correlations. Set to False for independent generation with FK remapping only
sequence_bystr | NoneNoneRequired when a table has multiple foreign keys and sequential=True. Specifies which FK relationship drives the sequential generation

Two Tables

The simplest multi-table case: a parent and a child with a foreign key.

import dataxid
import pandas as pd
from dataxid import Table

dataxid.api_key = "dx_..."

accounts = pd.read_csv("accounts.csv")
transactions = pd.read_csv("transactions.csv")

accounts_tbl = Table(accounts, primary_key="account_id")
transactions_tbl = Table(transactions, foreign_keys={"account_id": accounts_tbl})

synthetic = dataxid.synthesize_tables({
    "accounts": accounts_tbl,
    "transactions": transactions_tbl,
})

synthetic["accounts"]       # synthetic accounts with auto-assigned PKs
synthetic["transactions"]   # synthetic transactions with valid FK references

What happens behind the scenes:

  1. Tables are sorted in dependency order (parents first)
  2. Accounts are trained and generated as a flat table
  3. account_id is excluded from training and auto-assigned (1, 2, 3, ...)
  4. Transactions are trained with accounts as context — the model learns per-account patterns
  5. Transactions are generated conditioned on synthetic accounts
  6. FK values in transactions reference valid synthetic account PKs

Referential integrity

All generated FK values reference valid parent PKs.


Three or More Tables (Fan-Out)

Multiple child tables can reference the same parent. Each child is trained and generated independently, all referencing the same synthetic parent.

accounts = Table(accounts_df, primary_key="account_id")

synthetic = dataxid.synthesize_tables({
    "accounts": accounts,
    "transactions": Table(transactions_df,
                          foreign_keys={"account_id": accounts}),
    "loans": Table(loans_df, primary_key="loan_id",
                   foreign_keys={"account_id": accounts}),
})

synthetic["accounts"]       # 1 synthetic parent table
synthetic["transactions"]   # sequential child — correlated with accounts
synthetic["loans"]          # sequential child — correlated with accounts

Generation order is determined automatically via topological sort. Circular dependencies are detected and rejected at validation time.


N-Parent Tables

When a table has foreign keys to multiple parents, use sequence_by to specify which relationship drives the sequential generation. The other FK is remapped for referential integrity but does not influence the generation pattern.

customers = Table(customers_df, primary_key="customer_id")
products = Table(products_df, primary_key="product_id")

orders = Table(
    orders_df,
    foreign_keys={"customer_id": customers, "product_id": products},
    sequence_by="customer_id",  # generate order sequences per customer
)

synthetic = dataxid.synthesize_tables({
    "customers": customers,
    "products": products,
    "orders": orders,
})
  • customer_id → sequential context (order patterns per customer are preserved)
  • product_id → FK remapped to valid synthetic product PKs

sequence_by is required for multiple FKs

If a table has more than one foreign key and sequential=True, you must specify sequence_by. The SDK raises an error with the available options if omitted.


Flat Remap (No Correlation)

By default, foreign keys trigger sequential generation — the child table learns per-entity patterns from the parent. If you only need referential integrity without correlation (rare), set sequential=False:

products = Table(
    products_df,
    foreign_keys={"category_id": categories},
    sequential=False,
)

The table is generated independently. FK values are remapped to valid parent PKs after generation, but the model does not learn category-level patterns.


Low-Level API

synthesize_tables() is a convenience wrapper. For fine-grained control over individual tables (custom config per table, reusing a trained model for multiple generations), use Model.create() directly:

accounts = Table(accounts_df, primary_key="account_id")
synthetic = dataxid.synthesize_tables({
    "accounts": accounts,
    "transactions": Table(transactions_df,
                          foreign_keys={"account_id": accounts}),
})
# Step 1: Train and generate accounts (flat)
acct_model = dataxid.Model.create(data=accounts_df.drop(columns=["account_id"]))
syn_accounts = acct_model.generate(n_samples=len(accounts_df))
syn_accounts.insert(0, "account_id", range(1, len(syn_accounts) + 1))
acct_model.delete()

# Step 2: Train transactions with accounts as parent (sequential)
tx_model = dataxid.Model.create(
    data=transactions_df,
    parent=accounts_df,
    foreign_key="account_id",
)
syn_transactions = tx_model.generate(parent=syn_accounts)
tx_model.delete()

Model.create() Parameters for Sequential

ParameterTypeDescription
parentDataFrameParent table for context-aware generation
foreign_keystrFK column in data linking rows to parent — enables sequential mode
parent_keystr | NonePK column in parent (inferred from foreign_key if column names match)
parent_encoding_typesdict | NoneEncoding overrides for parent columns

Validation Rules

The SDK validates your schema before training starts. All errors are raised as dataxid.InvalidRequestError with a descriptive message and the offending parameter.

RuleError
foreign_keys value is not a Tableforeign_keys values must be Table instances
FK column not in DataFrameColumn 'X' not found in DataFrame columns
Parent Table has no primary_keyReferenced table must have a primary_key defined
Circular dependencyCircular dependency detected
Multiple FKs + no sequence_byTable has N foreign keys. Use sequence_by to specify which relationship to use
sequential=False + sequence_bysequence_by and sequential=False are mutually exclusive
FK target not in tables dictreferences a Table object that is not in the tables dict

How It Works

  1. Topological sort — Tables are ordered so parents are always processed before children
  2. PK exclusion — Primary key columns are dropped before training (the model doesn't learn ID patterns)
  3. Sequential training — Child tables are trained with parent data as context, learning per-entity distributions (transaction counts, value ranges, temporal patterns)
  4. Generation — Parents first, then children conditioned on synthetic parents
  5. PK auto-assignment — 1-based auto-increment integers assigned after generation
  6. FK remapping — Foreign keys in child tables are mapped to valid synthetic parent PKs

On this page