Multi-Table Synthesis

Synthesize related tables with referential integrity — foreign keys, sequential generation, and full database synthesis.

Dataxid can synthesize entire databases — not just single tables. Define your schema with Table objects, and synthesize_tables() handles dependency ordering, training, primary key assignment, and foreign key remapping automatically.

from dataxid import Table

accounts = Table(accounts_df, primary_key="account_id")
transactions = Table(transactions_df, foreign_keys={"account_id": accounts})

synthetic = dataxid.synthesize_tables({
    "accounts": accounts,
    "transactions": transactions,
})

Child tables are generated sequentially by default — preserving per-entity patterns (transaction counts, ordering, value distributions).

Table Class

Table wraps a DataFrame with schema information for multi-table synthesis.

from dataxid import Table

Table(
    data=df,                          # Training DataFrame
    primary_key="id",                 # Excluded from training, auto-assigned after generation
    foreign_keys={"fk_col": parent},  # FK column → parent Table object
    sequential=True,                  # Sequential generation (default: True)
    sequence_by="fk_col",            # Which FK to use for sequential context (N-parent only)
)

Parameters

Parameter	Type	Default	Description
`data`	`DataFrame`	required	Training data for this table
`primary_key`	`str \| None`	`None`	PK column — excluded from training, auto-assigned as 1-based integer after generation
`foreign_keys`	`dict[str, Table]`	`{}`	Maps FK column name to parent `Table` object. IDE autocomplete, type-safe, typo → immediate error
`sequential`	`bool`	`True`	When `True` and `foreign_keys` is set, child rows are generated conditioned on the parent — preserving correlations. Set to `False` for independent generation with FK remapping only
`sequence_by`	`str \| None`	`None`	Required when a table has multiple foreign keys and `sequential=True`. Specifies which FK relationship drives the sequential generation

Two Tables

The simplest multi-table case: a parent and a child with a foreign key.

import dataxid
import pandas as pd
from dataxid import Table

dataxid.api_key = "dx_..."

accounts = pd.read_csv("accounts.csv")
transactions = pd.read_csv("transactions.csv")

accounts_tbl = Table(accounts, primary_key="account_id")
transactions_tbl = Table(transactions, foreign_keys={"account_id": accounts_tbl})

synthetic = dataxid.synthesize_tables({
    "accounts": accounts_tbl,
    "transactions": transactions_tbl,
})

synthetic["accounts"]       # synthetic accounts with auto-assigned PKs
synthetic["transactions"]   # synthetic transactions with valid FK references

What happens behind the scenes:

Tables are sorted in dependency order (parents first)
Accounts are trained and generated as a flat table
account_id is excluded from training and auto-assigned (1, 2, 3, ...)
Transactions are trained with accounts as context — the model learns per-account patterns
Transactions are generated conditioned on synthetic accounts
FK values in transactions reference valid synthetic account PKs

Referential integrity

All generated FK values reference valid parent PKs.

Three or More Tables (Fan-Out)

Multiple child tables can reference the same parent. Each child is trained and generated independently, all referencing the same synthetic parent.

accounts = Table(accounts_df, primary_key="account_id")

synthetic = dataxid.synthesize_tables({
    "accounts": accounts,
    "transactions": Table(transactions_df,
                          foreign_keys={"account_id": accounts}),
    "loans": Table(loans_df, primary_key="loan_id",
                   foreign_keys={"account_id": accounts}),
})

synthetic["accounts"]       # 1 synthetic parent table
synthetic["transactions"]   # sequential child — correlated with accounts
synthetic["loans"]          # sequential child — correlated with accounts

Generation order is determined automatically via topological sort. Circular dependencies are detected and rejected at validation time.

N-Parent Tables

When a table has foreign keys to multiple parents, use sequence_by to specify which relationship drives the sequential generation. The other FK is remapped for referential integrity but does not influence the generation pattern.

customers = Table(customers_df, primary_key="customer_id")
products = Table(products_df, primary_key="product_id")

orders = Table(
    orders_df,
    foreign_keys={"customer_id": customers, "product_id": products},
    sequence_by="customer_id",  # generate order sequences per customer
)

synthetic = dataxid.synthesize_tables({
    "customers": customers,
    "products": products,
    "orders": orders,
})

customer_id → sequential context (order patterns per customer are preserved)
product_id → FK remapped to valid synthetic product PKs

sequence_by is required for multiple FKs

If a table has more than one foreign key and sequential=True, you must specify sequence_by. The SDK raises an error with the available options if omitted.

Flat Remap (No Correlation)

By default, foreign keys trigger sequential generation — the child table learns per-entity patterns from the parent. If you only need referential integrity without correlation (rare), set sequential=False:

products = Table(
    products_df,
    foreign_keys={"category_id": categories},
    sequential=False,
)

The table is generated independently. FK values are remapped to valid parent PKs after generation, but the model does not learn category-level patterns.

Low-Level API

synthesize_tables() is a convenience wrapper. For fine-grained control over individual tables (custom config per table, reusing a trained model for multiple generations), use Model.create() directly:

accounts = Table(accounts_df, primary_key="account_id")
synthetic = dataxid.synthesize_tables({
    "accounts": accounts,
    "transactions": Table(transactions_df,
                          foreign_keys={"account_id": accounts}),
})

# Step 1: Train and generate accounts (flat)
acct_model = dataxid.Model.create(data=accounts_df.drop(columns=["account_id"]))
syn_accounts = acct_model.generate(n_samples=len(accounts_df))
syn_accounts.insert(0, "account_id", range(1, len(syn_accounts) + 1))
acct_model.delete()

# Step 2: Train transactions with accounts as parent (sequential)
tx_model = dataxid.Model.create(
    data=transactions_df,
    parent=accounts_df,
    foreign_key="account_id",
)
syn_transactions = tx_model.generate(parent=syn_accounts)
tx_model.delete()

Model.create() Parameters for Sequential

Parameter	Type	Description
`parent`	`DataFrame`	Parent table for context-aware generation
`foreign_key`	`str`	FK column in `data` linking rows to parent — enables sequential mode
`parent_key`	`str \| None`	PK column in parent (inferred from `foreign_key` if column names match)
`parent_encoding_types`	`dict \| None`	Encoding overrides for parent columns

Validation Rules

The SDK validates your schema before training starts. All errors are raised as dataxid.InvalidRequestError with a descriptive message and the offending parameter.

Rule	Error
`foreign_keys` value is not a `Table`	`foreign_keys values must be Table instances`
FK column not in DataFrame	`Column 'X' not found in DataFrame columns`
Parent `Table` has no `primary_key`	`Referenced table must have a primary_key defined`
Circular dependency	`Circular dependency detected`
Multiple FKs + no `sequence_by`	`Table has N foreign keys. Use sequence_by to specify which relationship to use`
`sequential=False` + `sequence_by`	`sequence_by and sequential=False are mutually exclusive`
FK target not in tables dict	`references a Table object that is not in the tables dict`

How It Works

Topological sort — Tables are ordered so parents are always processed before children
PK exclusion — Primary key columns are dropped before training (the model doesn't learn ID patterns)
Sequential training — Child tables are trained with parent data as context, learning per-entity distributions (transaction counts, value ranges, temporal patterns)
Generation — Parents first, then children conditioned on synthetic parents
PK auto-assignment — 1-based auto-increment integers assigned after generation
FK remapping — Foreign keys in child tables are mapped to valid synthetic parent PKs

On this page