Data Modeling Best Practices

📘

Who is this guide for?

This guide is written for the data engineering or analytics engineering team that owns your cloud data warehouse and is preparing data to share with Simon AI. If you are a marketer or business user looking to set up Simon, start with the Connected Deployment onboarding guide.

Simon AI's composable application connects directly to your cloud data warehouse. There is no ETL pipeline running on a nightly cadence copying your data into Simon's systems — Simon reads your tables and views directly at query time, exactly as they exist in your warehouse.

That's a feature, not a limitation: it means your data is always current, you retain full ownership and governance of your data, and there's no waiting for a refresh to propagate changes. But it also means that Simon inherits the quality and performance characteristics of the objects you share. A well-modeled warehouse means a fast, reliable Simon account. An un-optimized one means slow segment queries and a frustrating user experience for your marketers.

The good news: if your team already follows standard data warehousing best practices, you're most of the way there. This guide explains what Simon expects, why it expects it, and how to organize and share your data cleanly and securely.


Start with Use Cases, Not Tables

Before you start pulling table names together, align with your marketing team on the use cases they want to power in Simon. For each use case, ask:

  • What data do you need to define a segment? (e.g. "customers who purchased a product on sale in the last 7 days")
  • What data do you need to personalize messages? (e.g. "the customer's first name and most recently viewed product category")
  • What data do you need to measure success? (e.g. "revenue attributed to each campaign")

Map those answers to your warehouse objects. Don't share everything you have — share what powers the use cases on your roadmap. Keeping your Simon account focused makes it faster to query, easier for marketers to navigate, and simpler to maintain.


The Three Data Model Types

Simon's composable application uses three types of customer data objects, each with a specific structure and purpose. These map directly to standard dimensional modeling concepts — if you've built a data warehouse before, you've seen all of these.

Identity Table 🪪

The identity table is the backbone of Simon. Every segment evaluation, every campaign send, and every AI agent query starts by joining to this table. It must be shared before you can do anything else in Simon.

What it is: A single row per customer profile, containing the identifiers Simon uses to recognize and reach each customer. Think of it as your dim_customers table collapsed to its current state.

Requirements:

  • One row per profile. No duplicates. Simon does not deduplicate this table for you.
  • A stable primary key (user_id). This must be a string, unique per profile, and never reused. Whatever you choose, it will be the join key for every other table you share.
  • Channel identifiers — the fields Simon uses to activate customers across your marketing channels (e.g. email, phone, braze_id, sfmc_id). These must be columns directly on the identity table.

On identifiers with multiple values:

Simon supports identifier fields that contain arrays of values (e.g. a customer who has multiple email addresses). However, be aware of the downstream implications: Simon will match a profile for segmentation if any value in the array satisfies the criteria, but will select a value at random for content personalization. For most accounts, a single canonical value per identifier type is the simpler and safer choice. If you do use multi-valued identifiers, ensure your deduplication and survivorship logic is solid upstream.

🚧

Materialize this table

The identity table is joined on virtually every query Simon executes. Do not expose a view with complex joins or subqueries here. This table should be a fully materialized object — a physical table or an incrementally refreshed materialized view — so that lookups are fast and consistent.


Contact Properties 🙋

Property tables contain attributes with a one-to-one relationship to each customer profile: demographics, preferences, aggregated behavioral metrics, loyalty status, and so on. They extend the identity table without changing its grain.

What they are: 1:1 extension tables, each covering a logical grouping of profile attributes. The warehouse analog is a degenerate dimension or a fact extension table (e.g. fct_customer_engagement).

Requirements:

  • One row per stable identifier. Simon does not aggregate or deduplicate property tables. If your source produces multiple rows per customer, fix that upstream before sharing.
  • A join key that matches the stable identifier in your identity table.

Best practices:

  • Organize by semantic domain, not by source system. A table named crm_data that mixes demographics, preferences, and engagement metrics is harder to maintain and harder for marketers to navigate in the segment builder. Prefer contact_demographics, contact_preferences, and contact_engagement_summary as separate tables.
  • Pre-compute your aggregates. Fields like "last order date," "total lifetime spend," and "email open rate over last 90 days" belong here, computed and landed as columns. Do not leave those aggregations to be computed at query time inside Simon.
  • Materialize these tables. If your source data updates continuously, use your CDW's incremental materialization feature — Snowflake Dynamic Tables, BigQuery Materialized Views, Redshift Materialized Views, or equivalent — so that the materialized object stays current without requiring a scheduled job.

Contact Events 🛍️

Event tables contain behavioral data with a one-to-many relationship to each customer: orders, page views, email interactions, app sessions, and so on. Simon uses these for behavioral segmentation — rolling up raw events into segment criteria like "has placed at least two orders in the last 30 days."

What they are: 1:many fact tables, each representing a single event type. The warehouse analog is a standard fct_orders or fct_events table.

Requirements:

  • One event type per table. Do not combine multiple event types into a single table filtered by an event_type column. Separate email_sends, email_opens, and email_clicks into three tables — Simon's segment builder works at the table level, so coarser schemas create friction for marketers and slower queries.
  • A join key that matches the stable identifier in your identity table.
  • A timestamp column indicating when the event occurred. This is required for time-windowed behavioral segmentation.
  • A unique event key (e.g. order_id, event_id) is strongly recommended to support deduplication and investigation.

Best practices:

  • Share events "raw" by type. Unlike property tables, you do not pre-aggregate events before sharing them — Simon's behavioral segmentation is the rollup layer. What "raw" means here is: one row per event occurrence, clean, deduplicated, and with a timestamp.
  • Cluster or sort on stable identifier and timestamp. Simon's event queries almost always filter by customer and time window. If your event tables are large, apply your CDW's appropriate physical optimization — clustering keys in Snowflake, clustering in BigQuery, sort keys in Redshift, Z-ordering in Databricks. Your data team will be familiar with the guidance. The goal in all cases is to align the physical layout with (stable_id, event_timestamp) so that time-windowed lookups are fast.
  • If your source is a single large event spine, split it before mapping. If your warehouse has a single massive events table or a fct_transactions table combining many transaction types, build separate filtered and clustered objects for each event type you want to share with Simon — via ETL or an incremental materialization — rather than mapping the combined object directly.

Organizing Your Schema for Simon

How you organize tables within your warehouse matters for Simon, both for performance and for the marketer experience in the segment builder.

Each property table should represent one cohesive semantic domain. Fields from the same table are grouped together in Simon's segment builder UI. A table called contact_demographics containing age, gender, location, and language preference is intuitive to navigate. A table called contact_everything with 80 columns spanning unrelated domains is not.

Separate event types into their own tables. As noted above, this is both a performance recommendation and a usability one. A marketer building a segment around "customers who opened an email" should not have to add a filter for event_type = 'email_open' — they should simply pick the email_opens table.

Keep the total number of mapped tables manageable. More tables aren't always better. A focused, well-organized data model is easier to document, easier to maintain, and easier for marketers to use. Start with the tables that directly power your initial use cases and expand from there.


Materialize Before You Map

🚧

This is the single most important recommendation in this guide.

Do not map un-optimized warehouse objects — complex views, multi-table joins, un-clustered large tables — directly into Simon. Simon queries your data every time a segment is evaluated, every time a journey runs, and every time an AI agent analyzes your audience. If the objects you've shared require expensive computation at read time, that cost is paid on every one of those queries.

The fix is straightforward: build materialized objects on top of your raw data, and map those into Simon. Incremental materialization features — Snowflake Dynamic Tables, BigQuery Materialized Views, Redshift Materialized Views — are particularly well-suited for this: they refresh incrementally, stay current automatically, and can be physically optimized for the query patterns Simon generates.

This is standard data warehousing best practice. If your team hasn't already applied it to the objects you're planning to share, now is the right time. Your data team will be familiar with the concept; point them to your CDW's documentation on materialization and physical clustering if they need a starting point.

Simon may in some cases add a materialization layer internally to address performance issues with inbound data. However, this is a fallback, not a substitute for proper upstream modeling — you won't have full visibility into how it's configured, it won't be tuned to your organization's broader query patterns, and it won't be under your team's control to adjust as your data model evolves. Materializing upstream, as described here, ensures your data team owns the optimization end-to-end.

To be specific about where this matters most:

  • Identity table: Materialize it. This is queried on virtually every operation Simon performs.
  • Property tables with complex aggregations: If computing "email open rate over last 90 days" requires a multi-table join and a window function, compute it once and land it in a materialized property table. Do not leave that computation to run at query time inside Simon.
  • Large event tables from a combined spine: If you're splitting a large source table into per-event-type objects for Simon, build those as materialized, clustered objects — not as views with a WHERE event_type = '...' filter on top of a 500M-row table.

Direct Mapping vs. Simon Datasets

When connecting your data to Simon via the Schema Builder, you have two options:

Direct mapping connects a table or view from your warehouse directly to Simon, as-is. This is the default and preferred approach for objects that are already clean, well-structured, and appropriately named for marketing use.

Simon Datasets let your marketing or marketing operations team write a SQL query inside Simon's dataset editor to define a view over your connected warehouse objects, which is then made available in the Schema Builder. This is intended for lightweight, marketing-specific last-mile work — but it's important to understand what "last-mile" means in this context.

When Simon Datasets are the right choice

Simon Datasets are appropriate for simple, stateless, row-level expressions that adapt your data for marketing use — things like:

  • Display formatting: combining first_name and last_name into a full_name field, or formatting a date into a human-readable string for message copy
  • Light renaming: aliasing a field from an internal naming convention (cust_acq_src_cd) to something a marketer can read (acquisition_source)
  • Simple conditional labels: a CASE expression that translates an internal status code into a human-readable value (e.g. CASE WHEN tier_code = 'G' THEN 'Gold' WHEN tier_code = 'P' THEN 'Platinum' END)
  • Column subsetting: projecting a subset of columns from a shared table when only some of them are relevant to a particular marketing use case

These are lightweight, self-contained transformations that a marketing ops team can reasonably own and maintain directly within Simon, without needing to coordinate with the data team on every change.

When to use direct mapping instead

If the transformation involves any of the following, it does not belong in Simon Datasets — it belongs upstream in your warehouse, owned by your data team:

  • Aggregations or window functions: computing last order date, lifetime value, rolling averages, or any metric that requires grouping or ranking across rows
  • Multi-table joins: combining data from multiple source tables to produce a derived attribute or event
  • Business logic: any calculation that encodes a definition your organization needs to use consistently across systems — how "active customer" is defined, how LTV is calculated, what counts as a completed order
  • Deduplication or survivorship: resolving duplicate rows or selecting a canonical value across multiple records
🚧

Keep business logic upstream

Centralizing transformations in your warehouse — rather than fragmenting them into Simon Datasets — means your data team owns them, they can be tested and versioned, and they stay consistent with how the rest of the business uses that data. A Simon Dataset that reimplements your LTV definition is a liability: it may drift from the definition used elsewhere, and when the definition changes, it won't be updated in Simon automatically.

A useful rule of thumb: if the transformation is something your data team would want to review, test, or reuse outside of Simon, it belongs in your warehouse. If it's a cosmetic adaptation that only your marketing team cares about, Simon Datasets may be the right fit.


Last-Mile Permissioning: The Shared Schema Pattern

The cleanest way to grant Simon access to your data while maintaining governance over your warehouse is a simple three-step pattern. The specific mechanism varies by CDW, but the principle is the same across all of them.

1. Create a dedicated schema for Simon.

Create a standalone schema in your data warehouse (e.g. SIMON_PROD) that is separate from your core warehouse schemas. This schema will contain only the objects Simon accesses — nothing else.

2. Create views in that schema pointing to your materialized tables.

Rather than granting Simon direct access to your warehouse tables, create views in the dedicated schema that SELECT from your materialized tables. This last-mile layer gives you:

  • Column-level governance: expose only the columns Simon needs; exclude anything Simon shouldn't see.
  • Naming control: rename columns to match Simon's conventions (e.g. user_id, channel identifier column names) without changing your source tables.
  • Auditability: everything Simon can access is enumerable by listing views in one schema.

On Snowflake, Secure Views provide an additional layer of protection by preventing view consumers from inferring the underlying query structure. Other CDWs offer equivalent access control mechanisms through role-based permissions on the schema.

3. Share the schema with Simon.

Grant Simon's database user or role SELECT on the dedicated schema. Simon only needs read access. Future tables can be added to Simon's view of your warehouse simply by adding a new view to the schema — no need to renegotiate access.

This pattern is low-overhead to set up and gives you clean, auditable control over what Simon can and cannot access. It also makes it easy to evolve your underlying data model without breaking Simon's access: update the source table, update the view, and Simon automatically sees the new structure.


Validation Checklist

Before mapping any object into Simon via the Schema Builder, verify the following:

Identity table:

  • One row per stable identifier — no duplicates
  • user_id (or email) is a non-null string on every row
  • All channel identifiers you need for active integrations are present as columns
  • The table is materialized (not a complex view)

Property tables:

  • One row per stable identifier — verify with SELECT COUNT(*) c, COUNT(DISTINCT <join_key>) d FROM <table> HAVING c > d (should return zero rows)
  • Join key values match the stable identifier in the identity table
  • Aggregated metrics are pre-computed as columns, not derived at read time
  • The table is materialized

Event tables:

  • One event type per table
  • Join key values match the stable identifier in the identity table
  • A non-null timestamp column is present
  • A unique event key is present
  • The table is clustered or sorted on (stable_id, event_timestamp) or equivalent
  • The table is materialized (not a view over an un-clustered source)

Schema access:

  • Materialized objects live in your internal warehouse schemas
  • Views in the dedicated Simon schema point to those objects
  • Only the columns Simon needs are exposed in those views
  • Simon's database user or role has SELECT on the dedicated schema

Ready to connect your data?

Once your objects are prepared and your shared schema is in place, head to the Schema Builder to map your identity table, property tables, and event tables into Simon.