Introduction
SkuSync is a specialized data transformation tool designed to convert standard e-commerce product exports into formats optimized for Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) pipelines.
rocket_launchWhy use SkuSync?
Raw HTML and CSV data are noisy for AI models. SkuSync strips away presentation layers and structures your product catalog into semantic data that AI agents can "read" efficiently, reducing token usage and improving context understanding.
1Getting Started
Follow these steps to generate your first AI-ready dataset. No coding knowledge is required.
1. Export Data
Go to your SHOPLINE admin panel, navigate to Products, and export your catalog as "All products".
2. Upload & Convert
Drag your CSV into SkuSync. The browser-based engine parses it instantly without server uploads.
SHOPLINE CSV Structure
SkuSync expects a standard SHOPLINE export format. Ensure your CSV file contains the following headers for optimal parsing:
Handle,Title,Body (HTML),Vendor,Price,Image Src
classic-tee,"Classic Cotton Tee","<p>100% organic cotton</p>",BrandX,29.99,https://.../tee.jpg
slim-jeans,"Slim Fit Denim","<p>Indigo wash</p>",BrandX,89.00,https://.../jeans.jpgPro Tip: Image Handling
SkuSync automatically filters for the primary product image (the first image in the list) to keep your JSON payloads lightweight for vision models.
Format Specifications
data_objectJSON Schema
The JSON Schema output provides a strict type definition for your product data. This is essential when using "Function Calling" or "Tools" with OpenAI's GPT-4 or Anthropic's Claude, ensuring the model generates valid parameters.
descriptionllms.txt
Following the proposed/llms.txtstandard, this format uses simplified Markdown to present content. It strips HTML tags, script blocks, and CSS classes, leaving only the semantic content relevant for training or context windows.
# Product Catalog Context
## Classic Cotton Tee
ID: 10234
Price: $29.99
Description: 100% organic cotton, pre-shrunk, available in earth tones.
## Slim Fit Denim
ID: 10235
...codeNDJSON (Newline Delimited JSON)
NDJSON is the preferred format for bulk data ingestion into vector databases like Pinecone or Weaviate. Each line is a standalone valid JSON object, allowing for stream processing without loading the entire dataset into memory.
3Usage Guide
Learn how to integrate SkuSync outputs into your AI-powered workflows and applications.
smart_toyOpenAI Integration
Use the generated JSON Schema with OpenAI's Function Calling for structured product queries.
// Using JSON Schema with OpenAI Function Calling
const response = await openai.chat.completions.create({
model: "gpt-4",
messages: [{ role: "user", content: "Find red shoes under $50" }],
functions: [{
name: "search_products",
parameters: jsonSchema // Use generated JSON Schema
}]
});
storageVector Database Import
Import NDJSON output into Pinecone or other vector databases for semantic search.
// Importing NDJSON to Pinecone
const fs = require('fs');
const ndjsonLines = fs.readFileSync('products.ndjson', 'utf-8').split('\n');
for (const line of ndjsonLines) {
if (line.trim()) {
const product = JSON.parse(line);
await pineconeIndex.upsert({
vectors: [{
id: product.id,
metadata: product,
values: await embed(product.triples.join(' '))
}]
});
}
}
hubRAG Pipeline Setup
Build a Retrieval-Augmented Generation pipeline using llms.txt as context.
// RAG Pipeline with llms.txt context
async function queryProductCatalog(userQuery) {
// 1. Retrieve relevant context from llms.txt
const context = retrieveContext(userQuery, llmsTxtContent);
// 2. Augment query with context
const prompt = `Context:
${context}
Question: ${userQuery}`;
// 3. Generate response
return await llm.generate(prompt);
}
4Best Practices
checklistData Quality Checklist
- check_circleEnsure CSV is UTF-8 encoded for proper character handling
- check_circleVerify required columns (Handle, Title*, Vendor, Tags, Collections)
- check_circleCheck for duplicate SKUs to avoid data conflicts
- check_circleValidate image URLs are accessible and properly formatted
- check_circleRemove any unnecessary HTML from product descriptions
datasetLarge Dataset Strategy
- tips_and_updatesSplit files larger than 10,000 products for better performance
- tips_and_updatesUse batch mode for multiple files to merge results efficiently
- tips_and_updatesMonitor browser memory usage when processing large datasets
- tips_and_updatesConsider using NDJSON for streaming large datasets to databases
integration_instructionsIntegration Best Practices
- tips_and_updatesStore NDJSON in version control for data tracking and rollback
- tips_and_updatesCreate automated CI/CD pipelines for regular data updates
- tips_and_updatesSet up scheduled syncs to keep AI systems updated with latest products
- tips_and_updatesDeploy llms.txt to your website root for AI crawler discovery
Troubleshooting
What's the maximum file size supported?expand_more
How are special characters in product titles handled?expand_more
How can I improve conversion speed for large files?expand_more
• Use a modern browser (Chrome or Edge recommended)
• Close unnecessary browser tabs to free up memory
• Ensure your device has sufficient RAM (4GB+ recommended)
• Use batch mode for multiple smaller files instead of one large file
Is my data sent to any server?expand_more
My file is not parsing correctlyexpand_more
Images are missing in the outputexpand_more
Image Srccolumn. If your export uses a different header (e.g., from a custom app), rename the column header toImage Srcbefore uploading.