LLM-ready Taxonomy for Ecommerce Analytics: A Field Guide

11 de setembro de 2025 por
LLM-ready Taxonomy for Ecommerce Analytics: A Field Guide
WarpDriven
Illustrated
Image Source: statics.mylandingpages.co

Part I — Why “LLM-ready” taxonomies matter

If you’ve ever asked an LLM to “show weekly conversion rate by category and campaign” and received five different SQLs for the same metric, you’ve felt the pain of ambiguous data. The fix isn’t a bigger model—it’s better structure. An LLM-ready taxonomy makes product catalogs, user behavior, and metrics unambiguous so natural-language analytics becomes trustworthy.

In practice, that means:

  • Your catalog has stable IDs and consistent hierarchies that map to external standards (so meaning travels across channels and tools).
  • Your events carry the right item-level attributes every time (so metrics don’t drift when contexts change).
  • Your semantic layer defines metrics once, in code, and everything—dashboards, notebooks, and LLMs—queries through it.
  • Your governance enforces contracts, versioning, and validation before data reaches humans or models.

This guide stitches together proven standards—GS1 GPC, schema.org Product, Google Product Taxonomy—with analytics event taxonomies (GA4, Snowplow), semantic layers (dbt, Cube), and RAG/knowledge-graph patterns to give LLMs a safe, reliable interface. Where relevant, we anchor guidance to canonical documentation such as the GS1 overview of Global Product Classification and browser tools for hierarchies and attributes, the schema.org Product definitions and releases, Google Merchant Center’s product category rules and downloadable taxonomy, GA4’s ecommerce events and item-scoped parameters, Snowplow’s ecommerce tracking and Iglu schemas, the dbt Semantic Layer/MetricFlow references, and Cube’s semantic modeling and AI API materials.

Part II — Catalog foundations: standards and crosswalks

An LLM can’t guess what “Handbags” means in your warehouse if you call it “Bags—Women’s” in ads and “HB” in your BI tool. You need a stable spine with crosswalks.

1) GS1 GPC as the neutral backbone

GS1’s Global Product Classification (GPC) is a four-level hierarchy—Segment > Family > Class > Brick—where each Brick has a code and attributes that describe product characteristics. The official GS1 overview explains the purpose and structure, and you can explore categories via the GS1 Navigator or the GPC Browser. Aligning at least to Brick-level where relevant makes your taxonomy interoperable with trading partners and GDSN.

Practical advice:

  • Don’t replace your internal categories with GPC; map them. Keep stable internal category IDs and maintain a crosswalk table to GPC Brick codes.
  • Use GPC Brick Attributes to inform canonical product attributes (e.g., material, size system) where applicable.

2) schema.org Product for web and knowledge consistency

Use schema.org to standardize identifiers and commercial attributes on your website. The Product type includes identifiers (sku, gtin8/12/13/14), brand, category, and Offer details (price, currency, availability). The schema.org releases page documents ongoing updates; always check the latest version when modeling variants and relationships.

When you expose a consistent JSON-LD block on PDPs, you create a machine-readable truth that both search engines and your own LLM pipelines can ingest.

Example (JSON-LD) with identifiers, variant linkage, and offers:

{
  "@context": "https://schema.org/",
  "@type": "Product",
  "name": "Executive Anvil",
  "image": [
    "https://example.com/photos/1x1/photo.jpg",
    "https://example.com/photos/4x3/photo.jpg",
    "https://example.com/photos/16x9/photo.jpg"
  ],
  "description": "Sleek executive anvil for all your business needs.",
  "sku": "0446310786",
  "gtin13": "0123456789012",
  "brand": {"@type": "Brand", "name": "Acme"},
  "category": "Tools > Anvils",
  "color": "Black",
  "size": "Medium",
  "offers": {
    "@type": "Offer",
    "url": "https://example.com/executive-anvil",
    "priceCurrency": "USD",
    "price": "119.99",
    "availability": "https://schema.org/InStock",
    "itemCondition": "https://schema.org/NewCondition"
  },
  "isVariantOf": {"@type": "Product", "name": "Anvil"}
}

Tips that LLMs love:

  • Always include one of GTIN fields and a stable SKU; include brand and canonical URL.
  • Use isVariantOf/hasVariant to link variants to a parent model (e.g., color/size variants).

3) Google Product Taxonomy (GPT) for feeds and ads

Google expects one most-specific google_product_category per product in Merchant Center feeds, while product_type carries your own hierarchy for campaign organization. The official Help docs explain usage, and you can download the taxonomy TXT with IDs to ensure accurate mapping.

Notes:

  • Do not confuse schema.org Product category with google_product_category—they’re separate systems. Follow Merchant Center updates via the announcements hub when changes may affect feeds.

4) Optional: IAB Content Taxonomy for media context

If your use case spans content or contextual advertising (e.g., marketplace editorial, UGC, retail media), the IAB Tech Lab Content Taxonomy standardizes content topics and brand safety flags. Use IDs and version markers consistently (e.g., OpenRTB cattax fields).

5) The crosswalk model (catalog → standards → analytics)

Create a crosswalk layer that maps your internal category IDs to:

  • GS1 GPC Brick code (where relevant)
  • Google Product Taxonomy category ID
  • Optional IAB content category ID (for media/editorial)

Keep this mapping table versioned. Your semantic layer will later join it to events for consistent rollups.

A minimal crosswalk table schema:

CREATE TABLE taxonomy_crosswalk (
  internal_category_id STRING PRIMARY KEY,
  internal_category_path STRING, -- e.g., Apparel > Women > Handbags
  gpc_brick_code STRING,         -- e.g., 10005678
  google_category_id INT64,      -- from taxonomy-with-ids.txt
  iab_content_cat_id STRING,     -- optional
  valid_from DATE,
  valid_to DATE
);

Part III — Event taxonomy for behavior: GA4 vs. Snowplow

Good catalogs without good events lead to “unknown variant,” “null price,” or “which list was this from?” dead ends. Your event taxonomy must carry item-level attributes and promotional/list context.

GA4 recommended ecommerce events

Google documents a standard set of ecommerce events and item-scoped parameters. At a minimum, implement view_item_list, select_item, add_to_cart, begin_checkout, add_payment_info, add_shipping_info, purchase, refund, and promotion events, with the items[] array fully populated. See Google’s GA4 ecommerce setup and item-scoped parameter docs for canonical definitions and validation steps.

Example purchase payload (gtag.js style), annotated:

gtag('event', 'purchase', {
  transaction_id: 'T_12345',
  affiliation: 'Online Store',
  value: 72.05,
  currency: 'USD',
  tax: 4.90,
  shipping: 5.99,
  coupon: 'SUMMER_SALE',
  items: [
    {
      item_id: 'SKU_12345',     // stable SKU or GTIN
      item_name: 'Canvas Tote',
      item_brand: 'ACME',
      item_category: 'Accessories',
      item_category2: 'Bags',
      item_category3: 'Totes',
      item_variant: 'Blue',
      price: 15.25,
      quantity: 1,
      item_list_id: 'home_rec_1',
      item_list_name: 'Homepage Recommendations'
    },
    {
      item_id: 'SKU_12346',
      item_name: 'Leather Satchel',
      item_brand: 'ACME',
      item_category: 'Accessories',
      item_category2: 'Bags',
      item_category3: 'Satchels',
      item_variant: 'Brown',
      price: 20.00,
      quantity: 2
    }
  ]
});

Implementation tips:

  • Use item_category through item_category5 to carry the full internal category path. Keep IDs in parallel custom parameters if you want stable joins.
  • Always send price and quantity at item-level; otherwise AOV, GMV, and margin calculations drift.
  • Be explicit with promotion/list context using item_list_id/item_list_name and promotion parameters.

Privacy note: GA4 prohibits sending PII such as email or phone. Google’s policy and data redaction features are described in the 2024 GA4 documentation; use pseudonymous user_id and configure retention/consent appropriately.

Snowplow ecommerce tracking

If you need more flexible schemas and warehouse-ready data with rich contexts, Snowplow’s self-describing events and context entities are a strong option. The web tracker’s ecommerce plugin represents actions (e.g., add_to_cart, purchase) with attached product/cart contexts, all validated against Iglu schemas.

Example custom self-describing event with contexts:

window.snowplow('trackSelfDescribingEvent', {
  event: {
    schema: 'iglu:com.acme/viewed_product/jsonschema/2-0-0',
    data: { productId: 'ASO01043', category: 'Dresses', brand: 'ACME', price: 49.95 }
  },
  context: [
    {
      schema: 'iglu:com.acme/cart/jsonschema/1-0-0',
      data: { cartId: 'cart123', totalValue: 150.00 }
    }
  ]
});

Modeling tip: Use Snowplow’s ecommerce dbt package to transform raw events into standardized tables for analysis.

Part IV — The semantic layer: define metrics once

LLMs should never write SQL directly against raw fact tables. A semantic layer defines metrics and dimensions once, enforces joins/filters, and exposes a safe API for humans and models.

Two common approaches: the dbt Semantic Layer (MetricFlow) and the Cube universal semantic layer.

dbt Semantic Layer (MetricFlow)

dbt’s Semantic Layer lets you declare semantic models (entities, dimensions, measures) and derive metrics (ratios, cumulative windows). The official documentation covers architecture, semantic models, and metric types, including limitations and API access.

Example: defining base measures and common ecommerce metrics (YAML):

semantic_models:
  - name: orders
    model: ref('fct_orders')
    entities:
      - name: order_id
        type: primary
      - name: user_id
        type: foreign
    defaults:
      agg_time_dimension: order_date
    measures:
      - name: revenue
        agg: sum
        expr: total_amount
      - name: orders
        agg: count_distinct
        expr: order_id
    dimensions:
      - name: channel
        type: categorical
      - name: device
        type: categorical
      - name: category_id
        type: categorical

metrics:
  - name: gmv
    type: simple
    type_params:
      measure: revenue
  - name: aov
    type: ratio
    type_params:
      numerator: revenue
      denominator: orders
  - name: conversion_rate
    type: ratio
    type_params:
      numerator: orders
      denominator: sessions
  - name: refund_rate
    type: ratio
    type_params:
      numerator: refunded_orders
      denominator: orders
  - name: wau_rolling_7
    type: cumulative
    type_params:
      measure:
        name: active_users
        fill_nulls_with: 0
        join_to_timespine: true
      cumulative_type_params:
        window: 7 days

Notes for LLM safety:

  • Require a primary time dimension for measures; define default grains.
  • Model dimensional conformance (e.g., category crosswalk joins) inside the semantic layer.
  • Include human-readable descriptions in YAML; index these into your RAG store.

Cube Universal Semantic Layer

Cube models data as cubes with measures, dimensions, and joins, and compiles to optimized SQL across warehouses. It exposes SQL, REST, and GraphQL APIs, plus an AI API for natural-language queries.

Example Cube model (JavaScript-style config):

cube('Orders', {
  sql: `SELECT * FROM analytics.fct_orders`,

  measures: {
    revenue: { sql: 'total_amount', type: 'sum' },
    orders: { sql: 'order_id', type: 'countDistinct' },
    aov: { sql: 'total_amount', type: 'number', drillMembers: ['order_id'],
      format: 'currency', title: 'Average Order Value',
      // Computed in query layer: revenue / orders
    }
  },

  dimensions: {
    order_id: { sql: 'order_id', type: 'string', primaryKey: true },
    order_date: { sql: 'order_date', type: 'time' },
    channel: { sql: 'channel', type: 'string' },
    device: { sql: 'device', type: 'string' },
    category_id: { sql: 'category_id', type: 'string' }
  },

  joins: {
    Categories: {
      relationship: 'belongsTo',
      sql: `${CUBE}.category_id = ${Categories}.internal_category_id`
    }
  }
});

Tools/Stack (neutral)

Disclosure: WarpDriven is our product.

  • ERP/Commerce Ops (upstream of analytics): WarpDriven — AI-first ERP/operations platform that unifies catalog, orders, inventory, logistics; can feed governed data into your semantic layer. Peers: Oracle NetSuite, SAP S/4HANA Public Cloud, Shopify.
  • Semantic layer: dbt Semantic Layer; Cube.
  • Event tracking: GA4; Snowplow.

(We keep stack mentions neutral and focus on conceptual fit; choose tools based on data volume, governance needs, and team skills.)

Part V — Governance: data contracts, CI tests, lineage, versioning

A taxonomy that changes silently will break LLMs. Governance keeps structure stable and transparent.

Data contracts

A data contract is a declarative spec that defines what a dataset or event guarantees: schema, constraints, SLAs, and ownership. ThoughtWorks has advocated templates for platform DX and contract tests for pipelines in recent years.

Example data contract (YAML) for an orders dataset:

version: 1
owner: analytics-platform@acme.com
name: analytics.fct_orders
sla:
  freshness_minutes: 60
  availability_percent: 99.5
schema:
  primary_key: order_id
  columns:
    - name: order_id
      type: STRING
      constraints: [not_null, unique]
    - name: user_id
      type: STRING
      constraints: [not_null]
      pii: false
    - name: order_date
      type: TIMESTAMP
      constraints: [not_null]
    - name: total_amount
      type: NUMERIC
      constraints: [not_null]
    - name: currency
      type: STRING
      constraints: [not_null]
compatibility:
  on_breaking_change: create_new_version
  deprecation_period_days: 90
lineage:
  upstream: [raw.orders_stream, dim_users]
  downstream: [semantic.orders_model]
validation:
  checks:
    - type: schema_match
    - type: freshness
    - type: completeness
    - type: range
      column: total_amount
      min: 0
privacy:
  prohibited_fields: [email, phone]
  pseudonymous_ids: [user_id]

Validation in CI/CD

Automate checks in your pipeline so violations block deploys:

  • dbt tests for not_null, unique, accepted_values on models.
  • Great Expectations or Soda for distribution checks, freshness, and custom constraints.
  • OpenLineage for lineage capture and impact analysis.

References:

Versioning and rollback

  • Version your taxonomy (catalog trees, crosswalk tables) and contracts; store history with valid_from/valid_to.
  • Adopt a compatibility policy (e.g., only additive changes are non-breaking; removals require new versions and backfills).
  • Maintain rollback scripts for the semantic layer to revert metric definitions if a change causes drift.

Part VI — LLM integration: RAG, knowledge graphs, and safe prompting

Even with a semantic layer, LLMs need context to choose the right metrics and dimensions and to respect business rules. Retrieval-Augmented Generation (RAG) and, for multi-hop logic, GraphRAG help LLMs ground answers.

Retrieval-Augmented Generation (RAG)

RAG retrieves relevant documents—metric definitions, taxonomy glossaries, data contracts—and feeds them into the model. IBM’s 2024 overview explains how grounding improves factuality and governance in enterprise setups, including deployment considerations like access control.

Practical RAG setup for analytics:

  • Index: semantic layer YAML (with descriptions), your metric dictionary, taxonomy crosswalk documentation, and event schemas.
  • Chunking: keep semantic model + metrics in small, coherent chunks (e.g., per metric, per dimension) to avoid context overload.
  • Metadata: tag chunks by domain (catalog, events, metrics), version, and owner for filtering.

Knowledge graphs and GraphRAG

Knowledge graphs encode entities (Products, Categories, Brands, Promotions, Channels) and relationships (belongs_to, promoted_in, supplied_by). GraphRAG combines graph traversal with vector search for multi-hop, explainable retrieval—useful for questions like “Which promotions drove margin lift for Women’s Handbags in EU marketplaces last month?”

Design notes:

  • Build the graph from deterministic sources (catalog DB, promotion system) to ensure traceability; reserve LLM extraction for weak signals and human-review them.
  • Return citations and subgraph paths alongside answers to increase trust.

Safe prompting and system rules

  • System message: “Only query via the semantic layer. Never write SQL against raw tables.”
  • Tools: Provide the LLM a “metrics browser” tool to discover metrics/dimensions; route natural language to metric queries, not ad hoc SQL.
  • Guardrails: Require the model to return the metric definition names it used (e.g., conversion_rate, aov) for auditing.

Part VII — 80/20 quick-start: the minimum viable, LLM-ready stack

You can ship a working, reliable v1 in two sprints by focusing on a minimal but complete path from catalog to LLM.

A. Catalog MVP

  • Internal category tree: 3–4 levels deep with stable IDs.
  • Crosswalk table to Google Product Taxonomy ID; optional GPC Brick for critical categories.
  • schema.org Product markup on PDPs: sku, gtin, brand, category path, offers (price, currency, availability), isVariantOf.

B. Event MVP (GA4 or Snowplow)

  • GA4: Implement view_item_list, select_item, add_to_cart, begin_checkout, purchase, refund. Populate items[] fully with item_id, item_brand, item_category1–5, item_variant, price, quantity, item_list_id/name.
    • Use GA4’s item-scoped parameter definitions to ensure parity across events.
  • Snowplow: Use the ecommerce plugin and attach product/cart contexts; validate against Iglu schemas.

C. Semantic layer MVP

  • Choose one (dbt SL or Cube) and define 10–15 canonical metrics:
    • GMV (sum revenue), AOV (revenue/orders), conversion_rate (orders/sessions), refund_rate, add_to_cart_rate, checkout_start_rate, repeat_purchase_rate, inventory_turns (if available), promotion_lift.
  • Conform dimensions: channel, device, category (via crosswalk), campaign, promotion, geo, time.

D. Governance MVP

  • Data contract for orders and events tables; CI checks (schema, not_null, freshness, PII guards).
  • Versioned crosswalk table with valid_from/valid_to.

E. LLM MVP

  • RAG index containing: semantic YAML, metric dictionary, taxonomy crosswalk docs, GA4/Snowplow schema notes.
  • System prompt enforcing “use semantic layer only.”
  • Evaluation set: 20–30 canonical questions; target ≤5% correction rate after iteration.

Part VIII — Advanced extensions

  • Snowplow deep contexts: attach promotions, impressions, and checkout step contexts; use Snowplow’s dbt models for richer path analysis.
  • Knowledge graph: Build a graph of Products–Categories–Brands–Promotions–Channels and expose graph queries beside vector search for GraphRAG.
  • Multi-catalog mapping: Maintain crosswalks among GS1 GPC Brick, internal categories, Google Product Taxonomy, and IAB content categories for retail media.
  • Multilingual aliasing: Store canonical category IDs with localized labels and synonyms; let LLMs map user queries in different languages to the same IDs.
  • Subscriptions/returns: Extend your event taxonomy with refund and return reasons; model cohort metrics (retention, churn) in the semantic layer.

Part IX — Anti-patterns and failure modes (and what to do instead)

  1. Taxonomy sprawl: Marketing renames categories without ID stability.
  • Fix: Use stable internal_category_id keys; treat labels as translatable attributes. Enforce changes through versioned crosswalks.
  1. Missing item attributes in events: No price/quantity or variant = wrong GMV and AOV.
  • Fix: Add validation that purchase/add_to_cart events must have price and quantity; block deployments if missing.
  1. Overloading custom parameters: 50 different meanings for one “custom_dimension_1”.
  • Fix: Maintain a governed parameter registry; prefer standard fields (GA4 items[]). Document any custom fields with owners and tests.
  1. Letting LLMs hit raw tables.
  • Fix: Force all queries through the semantic layer; expose only approved metrics/dimensions to the LLM tool.
  1. Evolving taxonomy without versioning/backfills.
  • Fix: Use valid_from/valid_to on crosswalks and backfill historical mappings; maintain deprecation periods.
  1. PII leakage in analytics payloads.
  1. “Category by label” joins.
  • Fix: Join on IDs, not names. Carry both ID and label in your event payloads or enrich in the warehouse.

Part X — Validation checklist and resources

Use this checklist as a preflight before inviting LLMs to your data.

Catalog & crosswalks

  • Internal category tree has stable IDs and 3–4 levels for 80/20 coverage.
  • Crosswalk table joins internal IDs to Google Product Taxonomy ID; optional GPC Brick for priority categories.
  • PDPs expose schema.org Product JSON-LD with sku, gtin, brand, category, offers, isVariantOf.

Events

Semantic layer

Governance

LLM enablement

  • RAG index contains semantic YAML, metric dictionary, taxonomy glossary, and event schemas; metadata includes version and owner.
  • For advanced cases, a knowledge graph stores product-category-promotion relationships; GraphRAG retrieves subgraphs with citations.

Merchant feeds & standards


Appendix — Sample schemas you can copy

  1. GA4 item-level parameter registry (warehouse side)
CREATE TABLE staging.ga4_items (
  event_timestamp TIMESTAMP,
  event_name STRING,
  transaction_id STRING,
  item_id STRING,
  item_name STRING,
  item_brand STRING,
  item_category STRING,
  item_category2 STRING,
  item_category3 STRING,
  item_category4 STRING,
  item_category5 STRING,
  item_variant STRING,
  price NUMERIC,
  quantity INT64,
  item_list_id STRING,
  item_list_name STRING
);
  1. dbt test examples (schema.yml)
version: 2
models:
  - name: fct_orders
    columns:
      - name: order_id
        tests: [not_null, unique]
      - name: user_id
        tests: [not_null]
      - name: total_amount
        tests:
          - not_null
          - relationships:
              to: ref('dim_currency')
              field: currency
  1. SodaCL checks (soda.yml)
checks for fct_orders:
  - row_count > 0
  - freshness(order_date) < 2h
  - missing_count(order_id) = 0
  - invalid_percent(total_amount) < 0.1 %
  - schema:
      warn:
        when schema changes: any
  1. Minimal knowledge index (for RAG)
{
  "doc_id": "metric:aov",
  "title": "Average Order Value",
  "body": "AOV = revenue / orders. revenue is sum(total_amount) on fct_orders; orders is count_distinct(order_id).",
  "metadata": {"domain": "metrics", "version": "1.3", "owner": "analytics"}
}
  1. Crosswalk enrichment SQL (semantic layer staging)
SELECT
  o.order_id,
  o.order_date,
  o.total_amount,
  cx.google_category_id,
  cx.internal_category_id,
  cx.internal_category_path
FROM analytics.fct_orders o
LEFT JOIN analytics.taxonomy_crosswalk cx
  ON o.category_id = cx.internal_category_id
  AND o.order_date BETWEEN cx.valid_from AND COALESCE(cx.valid_to, DATE '9999-12-31');

Closing thoughts

In my experience, teams that treat taxonomy as a product—complete with contracts, versions, and documentation—unlock reliable LLM analytics much faster than those who chase prompts. Give your models a consistent world to reason about, and they’ll return the favor with consistent answers.

LLM-ready Taxonomy for Ecommerce Analytics: A Field Guide
WarpDriven 11 de setembro de 2025
Share this post
Etiquetas
Arquivar