Data Quality Automation for a Global Marketplace
Overview
A global marketplace aggregating products from hundreds of boutiques struggled with duplicate listings and inconsistent product content. We designed and shipped a data-quality pipeline that detects duplicates across feeds and cleans/enriches titles, descriptions, and metadata with generative AI, so new items land with consistent, SEO-friendly content and one canonical product page.
At a glance
Industry
E-commerce
Goal
Provide a good, consistent product-related data stream to improve UX and SEO while lowering operational cost
Tech highlights
Python, OpenCV, GPT-4
Core Business Challenge
-
Duplicate SKUs across sources means confused shoppers, split reviews, and SEO cannibalization.
-
Incoherent/incomplete content missing attributes; manual rewrite backlog and cost.
-
Continuous changes with new drops daily, the solution must run at ingestion time, not quarterly cleanups.
Our Approach
-
Map the mess: profiled feeds, aligned taxonomies, and defined “canonical product” rules.
-
Design for precision: staged candidate generation → pair scoring → clustering to keep false matches low.
-
Automate enrichment: LLM prompts + rules to standardize titles/descriptions and fill key attributes + add guardrails to avoid fabrication.
-
Ship as a service: one pipeline triggered on import; human-review loop for edge cases.
AI Solution
-
Candidate generation (fast): text normalisation (brand, model, color, size), fuzzy matching, blocking by brand/category, and image pHash to quickly find likely duplicates.
-
Pair scoring (accurate): feature set combining text embeddings, image embeddings (CV), price/GTIN/SKU proximity, and attribute overlap; learned classifier outputs a duplicate probability.
-
Clustering & canonicals: connect matching pairs into product clusters; pick a canonical record; merge variants (size/color) under one PDP; keep canonical URL and IDs.
-
Content cleanup & enrichment (LLM)
-
MLOps: batch + streaming ingestion; vector store for embeddings; evaluation harness measuring precision/recall on a labeled pair set; review queue for low-confidence cases; versioned prompts and models.
Outcome
-
Duplicates down and manual review limited to edge cases.
-
Content quality up: titles/descriptions standardized; key attributes auto-filled; catalog search and filters more reliable.
-
SEO posture improved: fewer duplicate pages indexed; clean canonicals; stronger PDP authority.
-
Operational savings: hours of manual editing removed; new items enriched at ingestion time.