AI-powered Attribute Management in E-Commerce: How I Harmonized Millions of Product Data

Most e-commerce platforms talk about major technical challenges: large-scale search, real-time inventory, personalized recommendations. But there is a hidden problem that almost every retailer faces: attribute value consistency. These may seem superficially unimportant, but they are the foundation for product discovery, filtering, comparisons, and search relevance.

In real product catalogs, the state is chaotic. Size indications appear as “XL”, “Small”, “12cm”, “Large” mixed together. Colors are recorded as “RAL 3020”, “Crimson”, “Red”, and “Dark Red”. Multiply these inconsistencies across millions of SKUs with dozens of attributes per product – and the system becomes unusable. Filters behave unpredictably, search engines lose quality, and customers get frustrated navigating.

The Problem at Scale

As a full-stack engineer at Zoro, I faced exactly this task: building a system that not only manages these attributes but intelligently structures them. The goal was simple, but execution was complex: provide 3 million+ SKUs with consistent, traceable attribute values.

The challenge: you can’t manually code rules for every category. You need something that thinks but remains controllable. That’s where AI comes into play – not as a black-box solution, but as a partner for deterministic logic.

The Hybrid Strategy: AI with Guardrails

My approach was radically different: a hybrid pipeline combining LLM intelligence with clear rules and business controls. The result: explainable, predictable, scalable, and human-controlled.

The system processes attributes not in real-time but in offline background jobs. This may sound like a compromise, but it’s a deliberate architectural decision with major advantages:

  • High throughput: processing huge data volumes without burdening live systems
  • Reliability: failures never impact customer traffic
  • Cost efficiency: calculations run during off-peak hours
  • Isolation: LLM latency never affects product pages
  • Consistency: updates are atomic and predictable

Real-time processing would lead to unpredictable latency, higher costs, and fragile dependencies. Offline jobs give us batching efficiency, asynchronous AI calls, and human review points.

Preparation: Cleaning Before Intelligence

Before the LLM looks at attributes, I perform a cleaning step:

  • Trim whitespace
  • Remove empty values
  • Deduplicate duplicates
  • Convert category context into structured strings

The LLM receives clean, clear inputs. Garbage in, garbage out – at this scale, small errors become big problems. Cleaning is the foundation for everything that follows.

The AI Service: Thinking with Context

The LLM service receives more than just raw values. It gets:

  • cleaned attributes
  • category breadcrumbs
  • attribute metadata

With this context, the model understands that “Voltage” in power tools is numeric, “Size” in clothing follows a known progression, and “Color” may respect RAL standards. The model returns: ordered values, refined attribute names, and a decision on whether deterministic or contextual sorting is needed.

This allows the pipeline to handle different attribute types without hard-coding new rules for each category.

Smart Fallbacks: Not Everything Needs AI

Not every attribute requires artificial intelligence. Numeric ranges, unit-based values, and simple quantities benefit more from deterministic logic:

  • faster processing
  • predictable sorting
  • lower costs
  • no ambiguity

The pipeline automatically detects these cases and uses rules instead of AI. This keeps the system efficient and avoids unnecessary model calls.

Retailer Control

Each category can be tagged as:

  • LLM_SORT: let the model decide
  • MANUAL_SORT: retailers define the order manually

This dual system enables true human control. AI does the work, humans make the final decisions. It built trust – retailers could override the model without breaking the pipeline.

Persistence and Synchronization

All results are stored in a MongoDB product database – the central nervous system for:

  • sorted attributes
  • refined attribute names
  • category-specific sort tags
  • product-specific sortOrder fields

From there, outbound jobs synchronize data with:

  • Elasticsearch for keyword-driven search
  • Vespa for semantic and vector-based search

Filters appear in logical order, product pages show consistent attributes, search engines rank products more accurately.

From Chaos to Order: The Transformation

Here’s where the system’s power shows in practice:

Attribute Raw Input Sorted Output
Size XL, Small, 12cm, Large, M, S Small, M, Large, XL, 12cm
Color RAL 3020, Crimson, Red, Dark Red Red, Dark Red, Crimson, Red (RAL 3020)
Material Steel, Carbon Steel, Stainless, Stainless Steel Steel, Stainless Steel, Carbon Steel
Numeric 5cm, 12cm, 2cm, 20cm 2cm, 5cm, 12cm, 20cm

Chaotic inputs become logical, consistent sequences.

The Architecture in Motion

The entire pipeline follows this flow:

  1. Product data flows from the PIM system
  2. Extraction job collects attributes and category context
  3. The AI Sorting Service processes these intelligently
  4. MongoDB stores the results
  5. Outbound jobs sync data back to the PIM system
  6. Elasticsearch and Vespa sync jobs distribute data to search systems
  7. API services connect search with customer pages

This flow ensures no attribute value is lost – whether sorted by AI or manually set, it’s reflected everywhere.

Why Not Real-Time?

A real-time pipeline might sound attractive but would lead to:

  • unpredictable latency
  • higher peak loads
  • fragile dependencies
  • operational complexity

Offline jobs provide throughput efficiency, error tolerance, and predictable costs. The small downside: a slight delay between data ingestion and display. The big advantage: consistency at scale that customers truly appreciate.

The Impact

The system delivers measurable results:

  • consistent sorting across 3M+ SKUs
  • predictable numeric attributes via rules
  • retailer control mechanisms through manual tagging
  • cleaner product pages, more intuitive filters
  • improved search relevance and higher conversions
  • strengthened customer trust

This was more than a technical victory – it improved user experience and revenue.

Key Takeaways

  • Hybrid beats pure AI: at scale, you need guardrails, not just intelligence
  • Context is king: the right environment dramatically improves LLM accuracy
  • Offline is the new online: for throughput and reliability, not real-time
  • Humans retain control: override mechanisms build real trust
  • Clean input is fundamental: Garbage In, Garbage Out – always clean first

Conclusion

Sorting attribute values sounds simple. But across millions of products, it becomes a real challenge. By combining LLM intelligence with clear rules and business controls, I transformed a hidden problem into a clean, scalable system.

This is the power of hybrid approaches: they combine the best of human and machine. And sometimes the biggest successes come from solving the dullest problems – those that are easy to overlook but appear on every product page.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
  • Pin

Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
  • بالعربية
  • Português (Brasil)
  • 简体中文
  • English
  • Español
  • Français (Afrique)
  • Bahasa Indonesia
  • 日本語
  • Português (Portugal)
  • Русский
  • 繁體中文
  • Українська
  • Tiếng Việt