Most e-commerce platforms talk about major technical challenges: large-scale search, real-time inventory, personalized recommendations. But there is a hidden problem that almost every retailer faces: attribute value consistency. These may seem superficially unimportant, but they are the foundation for product discovery, filtering, comparisons, and search relevance.
In real product catalogs, the state is chaotic. Size indications appear as “XL”, “Small”, “12cm”, “Large” mixed together. Colors are recorded as “RAL 3020”, “Crimson”, “Red”, and “Dark Red”. Multiply these inconsistencies across millions of SKUs with dozens of attributes per product – and the system becomes unusable. Filters behave unpredictably, search engines lose quality, and customers get frustrated navigating.
The Problem at Scale
As a full-stack engineer at Zoro, I faced exactly this task: building a system that not only manages these attributes but intelligently structures them. The goal was simple, but execution was complex: provide 3 million+ SKUs with consistent, traceable attribute values.
The challenge: you can’t manually code rules for every category. You need something that thinks but remains controllable. That’s where AI comes into play – not as a black-box solution, but as a partner for deterministic logic.
The Hybrid Strategy: AI with Guardrails
My approach was radically different: a hybrid pipeline combining LLM intelligence with clear rules and business controls. The result: explainable, predictable, scalable, and human-controlled.
The system processes attributes not in real-time but in offline background jobs. This may sound like a compromise, but it’s a deliberate architectural decision with major advantages:
High throughput: processing huge data volumes without burdening live systems
Reliability: failures never impact customer traffic
Cost efficiency: calculations run during off-peak hours
Isolation: LLM latency never affects product pages
Consistency: updates are atomic and predictable
Real-time processing would lead to unpredictable latency, higher costs, and fragile dependencies. Offline jobs give us batching efficiency, asynchronous AI calls, and human review points.
Preparation: Cleaning Before Intelligence
Before the LLM looks at attributes, I perform a cleaning step:
Trim whitespace
Remove empty values
Deduplicate duplicates
Convert category context into structured strings
The LLM receives clean, clear inputs. Garbage in, garbage out – at this scale, small errors become big problems. Cleaning is the foundation for everything that follows.
The AI Service: Thinking with Context
The LLM service receives more than just raw values. It gets:
cleaned attributes
category breadcrumbs
attribute metadata
With this context, the model understands that “Voltage” in power tools is numeric, “Size” in clothing follows a known progression, and “Color” may respect RAL standards. The model returns: ordered values, refined attribute names, and a decision on whether deterministic or contextual sorting is needed.
This allows the pipeline to handle different attribute types without hard-coding new rules for each category.
Smart Fallbacks: Not Everything Needs AI
Not every attribute requires artificial intelligence. Numeric ranges, unit-based values, and simple quantities benefit more from deterministic logic:
faster processing
predictable sorting
lower costs
no ambiguity
The pipeline automatically detects these cases and uses rules instead of AI. This keeps the system efficient and avoids unnecessary model calls.
Retailer Control
Each category can be tagged as:
LLM_SORT: let the model decide
MANUAL_SORT: retailers define the order manually
This dual system enables true human control. AI does the work, humans make the final decisions. It built trust – retailers could override the model without breaking the pipeline.
Persistence and Synchronization
All results are stored in a MongoDB product database – the central nervous system for:
sorted attributes
refined attribute names
category-specific sort tags
product-specific sortOrder fields
From there, outbound jobs synchronize data with:
Elasticsearch for keyword-driven search
Vespa for semantic and vector-based search
Filters appear in logical order, product pages show consistent attributes, search engines rank products more accurately.
From Chaos to Order: The Transformation
Here’s where the system’s power shows in practice:
Attribute
Raw Input
Sorted Output
Size
XL, Small, 12cm, Large, M, S
Small, M, Large, XL, 12cm
Color
RAL 3020, Crimson, Red, Dark Red
Red, Dark Red, Crimson, Red (RAL 3020)
Material
Steel, Carbon Steel, Stainless, Stainless Steel
Steel, Stainless Steel, Carbon Steel
Numeric
5cm, 12cm, 2cm, 20cm
2cm, 5cm, 12cm, 20cm
Chaotic inputs become logical, consistent sequences.
The Architecture in Motion
The entire pipeline follows this flow:
Product data flows from the PIM system
Extraction job collects attributes and category context
The AI Sorting Service processes these intelligently
MongoDB stores the results
Outbound jobs sync data back to the PIM system
Elasticsearch and Vespa sync jobs distribute data to search systems
API services connect search with customer pages
This flow ensures no attribute value is lost – whether sorted by AI or manually set, it’s reflected everywhere.
Why Not Real-Time?
A real-time pipeline might sound attractive but would lead to:
unpredictable latency
higher peak loads
fragile dependencies
operational complexity
Offline jobs provide throughput efficiency, error tolerance, and predictable costs. The small downside: a slight delay between data ingestion and display. The big advantage: consistency at scale that customers truly appreciate.
The Impact
The system delivers measurable results:
consistent sorting across 3M+ SKUs
predictable numeric attributes via rules
retailer control mechanisms through manual tagging
cleaner product pages, more intuitive filters
improved search relevance and higher conversions
strengthened customer trust
This was more than a technical victory – it improved user experience and revenue.
Key Takeaways
Hybrid beats pure AI: at scale, you need guardrails, not just intelligence
Context is king: the right environment dramatically improves LLM accuracy
Offline is the new online: for throughput and reliability, not real-time
Humans retain control: override mechanisms build real trust
Clean input is fundamental: Garbage In, Garbage Out – always clean first
Conclusion
Sorting attribute values sounds simple. But across millions of products, it becomes a real challenge. By combining LLM intelligence with clear rules and business controls, I transformed a hidden problem into a clean, scalable system.
This is the power of hybrid approaches: they combine the best of human and machine. And sometimes the biggest successes come from solving the dullest problems – those that are easy to overlook but appear on every product page.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
AI-powered Attribute Management in E-Commerce: How I Harmonized Millions of Product Data
Most e-commerce platforms talk about major technical challenges: large-scale search, real-time inventory, personalized recommendations. But there is a hidden problem that almost every retailer faces: attribute value consistency. These may seem superficially unimportant, but they are the foundation for product discovery, filtering, comparisons, and search relevance.
In real product catalogs, the state is chaotic. Size indications appear as “XL”, “Small”, “12cm”, “Large” mixed together. Colors are recorded as “RAL 3020”, “Crimson”, “Red”, and “Dark Red”. Multiply these inconsistencies across millions of SKUs with dozens of attributes per product – and the system becomes unusable. Filters behave unpredictably, search engines lose quality, and customers get frustrated navigating.
The Problem at Scale
As a full-stack engineer at Zoro, I faced exactly this task: building a system that not only manages these attributes but intelligently structures them. The goal was simple, but execution was complex: provide 3 million+ SKUs with consistent, traceable attribute values.
The challenge: you can’t manually code rules for every category. You need something that thinks but remains controllable. That’s where AI comes into play – not as a black-box solution, but as a partner for deterministic logic.
The Hybrid Strategy: AI with Guardrails
My approach was radically different: a hybrid pipeline combining LLM intelligence with clear rules and business controls. The result: explainable, predictable, scalable, and human-controlled.
The system processes attributes not in real-time but in offline background jobs. This may sound like a compromise, but it’s a deliberate architectural decision with major advantages:
Real-time processing would lead to unpredictable latency, higher costs, and fragile dependencies. Offline jobs give us batching efficiency, asynchronous AI calls, and human review points.
Preparation: Cleaning Before Intelligence
Before the LLM looks at attributes, I perform a cleaning step:
The LLM receives clean, clear inputs. Garbage in, garbage out – at this scale, small errors become big problems. Cleaning is the foundation for everything that follows.
The AI Service: Thinking with Context
The LLM service receives more than just raw values. It gets:
With this context, the model understands that “Voltage” in power tools is numeric, “Size” in clothing follows a known progression, and “Color” may respect RAL standards. The model returns: ordered values, refined attribute names, and a decision on whether deterministic or contextual sorting is needed.
This allows the pipeline to handle different attribute types without hard-coding new rules for each category.
Smart Fallbacks: Not Everything Needs AI
Not every attribute requires artificial intelligence. Numeric ranges, unit-based values, and simple quantities benefit more from deterministic logic:
The pipeline automatically detects these cases and uses rules instead of AI. This keeps the system efficient and avoids unnecessary model calls.
Retailer Control
Each category can be tagged as:
This dual system enables true human control. AI does the work, humans make the final decisions. It built trust – retailers could override the model without breaking the pipeline.
Persistence and Synchronization
All results are stored in a MongoDB product database – the central nervous system for:
From there, outbound jobs synchronize data with:
Filters appear in logical order, product pages show consistent attributes, search engines rank products more accurately.
From Chaos to Order: The Transformation
Here’s where the system’s power shows in practice:
Chaotic inputs become logical, consistent sequences.
The Architecture in Motion
The entire pipeline follows this flow:
This flow ensures no attribute value is lost – whether sorted by AI or manually set, it’s reflected everywhere.
Why Not Real-Time?
A real-time pipeline might sound attractive but would lead to:
Offline jobs provide throughput efficiency, error tolerance, and predictable costs. The small downside: a slight delay between data ingestion and display. The big advantage: consistency at scale that customers truly appreciate.
The Impact
The system delivers measurable results:
This was more than a technical victory – it improved user experience and revenue.
Key Takeaways
Conclusion
Sorting attribute values sounds simple. But across millions of products, it becomes a real challenge. By combining LLM intelligence with clear rules and business controls, I transformed a hidden problem into a clean, scalable system.
This is the power of hybrid approaches: they combine the best of human and machine. And sometimes the biggest successes come from solving the dullest problems – those that are easy to overlook but appear on every product page.