Data Quality Automation for a Global Marketplace

Overview

A global marketplace aggregating products from hundreds of boutiques struggled with duplicate listings and inconsistent product content. We designed and shipped a data-quality pipeline that detects duplicates across feeds and cleans/enriches titles, descriptions, and metadata with generative AI, so new items land with consistent, SEO-friendly content and one canonical product page.

At a glance
Industry

E-commerce

Goal

Provide a good, consistent product-related data stream to improve UX and SEO while lowering operational cost

Tech highlights

Python, OpenCV, GPT-4

Italist-main.svg
Core Business Challenge
  • Duplicate SKUs across sources means confused shoppers, split reviews, and SEO cannibalization.

  • Incoherent/incomplete content missing attributes; manual rewrite backlog and cost.

  • Continuous changes with new drops daily, the solution must run at ingestion time, not quarterly cleanups.

Our Approach
  • Map the mess: profiled feeds, aligned taxonomies, and defined “canonical product” rules.

  • Design for precision: staged candidate generation → pair scoring → clustering to keep false matches low.

  • Automate enrichment: LLM prompts + rules to standardize titles/descriptions and fill key attributes + add guardrails to avoid fabrication.

  • Ship as a service: one pipeline triggered on import; human-review loop for edge cases.

AI Solution
  • Candidate generation (fast): text normalisation (brand, model, color, size), fuzzy matching, blocking by brand/category, and image pHash to quickly find likely duplicates.

  • Pair scoring (accurate): feature set combining text embeddings, image embeddings (CV), price/GTIN/SKU proximity, and attribute overlap; learned classifier outputs a duplicate probability.

  • Clustering & canonicals: connect matching pairs into product clusters; pick a canonical record; merge variants (size/color) under one PDP; keep canonical URL and IDs.

  • Content cleanup & enrichment (LLM)

  • MLOps: batch + streaming ingestion; vector store for embeddings; evaluation harness measuring precision/recall on a labeled pair set; review queue for low-confidence cases; versioned prompts and models.

Outcome
  • Duplicates down and manual review limited to edge cases.

  • Content quality up: titles/descriptions standardized; key attributes auto-filled; catalog search and filters more reliable.

  • SEO posture improved: fewer duplicate pages indexed; clean canonicals; stronger PDP authority.

  • Operational savings: hours of manual editing removed; new items enriched at ingestion time.

Sparkbit has the intellectual rigor to handle quite difficult problems that are at the intersection of, for example, distributed systems, AI and the specific vertical that your business is.
Tim MansfieldCTOItalist