Data Quality Automation for a Global Marketplace

Overview

A global marketplace aggregating products from hundreds of boutiques struggled with duplicate listings and inconsistent product content. We designed and shipped a data-quality pipeline that detects duplicates across feeds and cleans/enriches titles, descriptions, and metadata with generative AI, so new items land with consistent, SEO-friendly content and one canonical product page.

At a glance

Industry

E-commerce

Goal

Provide a good, consistent product-related data stream to improve UX and SEO while lowering operational cost

Tech highlights

Python, OpenCV, GPT-4

Core Business Challenge

Duplicate SKUs across sources means confused shoppers, split reviews, and SEO cannibalization.
Incoherent/incomplete content missing attributes; manual rewrite backlog and cost.
Continuous changes with new drops daily, the solution must run at ingestion time, not quarterly cleanups.

Our Approach

Map the mess: profiled feeds, aligned taxonomies, and defined “canonical product” rules.
Design for precision: staged candidate generation → pair scoring → clustering to keep false matches low.
Automate enrichment: LLM prompts + rules to standardize titles/descriptions and fill key attributes + add guardrails to avoid fabrication.
Ship as a service: one pipeline triggered on import; human-review loop for edge cases.

AI Solution

Candidate generation (fast): text normalisation (brand, model, color, size), fuzzy matching, blocking by brand/category, and image pHash to quickly find likely duplicates.
Pair scoring (accurate): feature set combining text embeddings, image embeddings (CV), price/GTIN/SKU proximity, and attribute overlap; learned classifier outputs a duplicate probability.
Clustering & canonicals: connect matching pairs into product clusters; pick a canonical record; merge variants (size/color) under one PDP; keep canonical URL and IDs.
Content cleanup & enrichment (LLM)
MLOps: batch + streaming ingestion; vector store for embeddings; evaluation harness measuring precision/recall on a labeled pair set; review queue for low-confidence cases; versioned prompts and models.

Outcome

Duplicates down and manual review limited to edge cases.
Content quality up: titles/descriptions standardized; key attributes auto-filled; catalog search and filters more reliable.
SEO posture improved: fewer duplicate pages indexed; clean canonicals; stronger PDP authority.
Operational savings: hours of manual editing removed; new items enriched at ingestion time.

Client’s opinion

Sparkbit has the intellectual rigor to handle quite difficult problems that are at the intersection of, for example, distributed systems, AI and the specific vertical that your business is.

Tim MansfieldCTOItalist

Want to read more?

Project

Industry

Years

AI Assistant for Technical SupportManufacturing2025 Context-aware Telematics for UBI & FleetsMobility/Telematics2020-2022 US Health-Tech: ML core for posture assessmentHealth-tech2020-2022 Anti-counterfeit Label VerificationSupply chain & Packaging2020-2022 Data Quality Automation for a Global MarketplaceHealth-tech2020-2022 Flavor Recommendations for a Smart Kitchen ApplianceTelecom2021-2022 Telecom Towers: From Drone Photos to Production AI on Digital TwinsTelecom2021-2022 Unique E-commerce SystemE-commerce2018-2022 Backend of a prototype telematics systemTelematics2014-2020 Military-grade communication toolGovernmental / Military2021-2022 Smart Parking SystemSmart City2018-2020 Personalized Videos For Marketing CampaignsMarketing2018-2020 Customer-centric telematics solutionTelematics2018-2022

See all case studies