Well-Architected Review - Catalog Upload Process
- Overview
- 1. Operational Excellence
- 1.1 Catalog Data Preparation Operations
- 1.2 Catalog Upload Operations
- 1.3 Upload Status Tracking Operations
- 1.4 SNS/SQS Notification Operations
- 1.5 Retry and Error Handling
- 1.6 Observability, Deployment, and Readiness
- 1.7 Monitoring and Change Management
- 1.8 Change Management Readiness
- 1.8.1 Change Control Process
- 1.8.2 Catalog Data and Schema Changes
- 1.8.3 Catalog Upload Pipeline Changes
- 1.8.4 SNS/SQS Notification Pipeline Changes
- 1.8.5 IAM and Security Configuration Changes
- 1.8.6 Source System and Integration Changes
- 1.8.7 CloudFormation and Infrastructure Changes
- 1.8.8 Testing and Validation
- 1.8.9 Rollback and Recovery
- 1.8.10 Communication and Coordination
- 2. Security
- 3. Reliability
- 4. Performance Efficiency
- 5. Cost Optimization
- 6. Catalog Reporting
- 7. Sustainability
Disclaimer: This document contains sample content for illustrative purposes only. Organizations should follow their own established best practices, security requirements, and compliance standards to ensure solutions are production-ready.
Overview
This questionnaire is designed for Just Walk Out store implementations that use the Catalog API to programmatically upload and maintain product catalogs. In this model, the retailer's systems manage catalog data preparation, API invocation, upload status tracking, and processing of SNS/SQS notifications for upload results. The following APIs and components are in scope:
- Upload Catalog API (
POST /v1/catalog/upload) - Get Catalog Upload Status API (
POST /v1/catalog/getCatalogUploadStatus) - SNS/SQS notification pipeline for upload results
- S3 presigned URL for downloading upload result reports
1. Operational Excellence
1.1 Catalog Data Preparation Operations
- How do you validate catalog data (SKU, barcode, item name, store ID) before submitting to the Upload Catalog API?
- What process ensures catalog field lengths comply with API limits (e.g., 255 characters for item_sku, item_name, external_product_id)?
- How do you handle catalogs exceeding the 10,000 item upload limit (batching strategy)?
- What monitoring detects stale or outdated catalog data in your source systems (ERP, merchandising)?
- How do you ensure store IDs in catalog payloads are valid and associated with the same retailer?
- How do you handle keeping your POS catalog in sync with the Amazon JWO catalog? For example archiving items in your POS
- What is your process to notify the customer that an item has been successfully uploaded and can be merchandised in the store?
- What process do you use to notify the store when an item is removed or archived from the catalog?
- How do you communicate catalog changes or additions to your catalog source system (e.g., ERP, merchandising platform)?
- What is your migration plan for catalog data when transitioning to a new POS or source system (e.g., field mapping, data validation, parallel run)?
1.2 Catalog Upload Operations
- How do you monitor Upload Catalog API success/failure rates in real time?
- What runbooks exist for handling catalog upload failures (400, 429, 500 errors)?
- How do you track ingestion IDs returned by successful uploads for downstream correlation?
- What alerting is in place when catalog uploads consistently fail or return unexpected errors?
- How do you schedule and coordinate catalog uploads to avoid exceeding the 10 requests per second rate limit?
1.3 Upload Status Tracking Operations
- What mechanism do you use to review catalog upload status?
- How do you poll or monitor the Get Catalog Upload Status API for upload progress (INQUEUE, INPROGRESS, DONE, CANCELLED, FATAL)?
- What alerting is in place when uploads remain in INQUEUE or INPROGRESS beyond expected processing times?
- How do you handle CANCELLED or FATAL upload statuses?
- What process downloads and analyzes the report from the reportDownloadLink?
- How do you communicate catalog upload failure and fixes?
1.4 SNS/SQS Notification Operations
- How do you monitor the SQS queue (JWOCatalogIngestionStatusQueue) for incoming upload result notifications?
- What alerting is in place for SQS message processing failures or dead-letter queue accumulation?
- How do you handle expired S3 presigned URLs (60-minute validity window)?
- What process ensures SNS subscription remains confirmed and active?
- How do you detect and handle duplicate or out-of-order SQS messages?
1.5 Retry and Error Handling
- What retry strategy is implemented for failed Upload Catalog API calls (especially 500 errors with retry recommendation)?
- How do you distinguish between retryable (500 with "please retry again") and non-retryable (500 without retry recommendation) errors?
- What retry strategy is implemented for failed Get Catalog Upload Status API calls?
- What alerting is in place when maximum retry attempts are exhausted?
- How do you handle 429 (Too Many Requests) responses with appropriate backoff?
1.6 Observability, Deployment, and Readiness
- How do you implement observability in your catalog upload workload?
- How do you mitigate deployment risks for catalog integration changes?
- How do you know that you are ready to support the catalog upload workload?
1.7 Monitoring and Change Management
- How do you monitor catalog upload workload resources (Lambda, Step Functions, DynamoDB, SQS)?
- How do you implement change to catalog upload infrastructure and processing logic?
1.8 Change Management Readiness
1.8.1 Change Control Process
- What formal change management process governs modifications to the catalog upload pipeline and supporting infrastructure?
- Who approves changes to production catalog systems and what is the approval workflow?
- How do you classify changes by risk level (standard, normal, emergency) and what controls apply to each?
- What change advisory board (CAB) or review process evaluates changes before deployment?
- How do you maintain a change log that records all modifications, approvers, and deployment timestamps?
1.8.2 Catalog Data and Schema Changes
- What is the process for adding new required or optional fields to catalog upload payloads?
- How do you handle Amazon-initiated catalog API schema changes (new fields, deprecations, validation rule updates)?
- What testing validates that catalog schema changes do not break existing upload pipelines?
- How do you coordinate catalog field mapping changes between your source systems (ERP, merchandising) and the Catalog API?
- What is the rollback procedure if a catalog schema change causes upload failures?
1.8.3 Catalog Upload Pipeline Changes
- What is the process for modifying Lambda functions, Step Functions state machines, or DynamoDB table schemas in the upload pipeline?
- How do you deploy pipeline changes without disrupting in-flight catalog uploads?
- What blue/green or canary deployment strategies are used for Lambda function updates?
- How do you handle changes to the catalog batching strategy (e.g., adjusting batch size or upload frequency)?
- What is the rollback procedure if a pipeline change causes upload processing failures?
1.8.4 SNS/SQS Notification Pipeline Changes
- What is the process for modifying SQS queue configurations, dead-letter queue policies, or Lambda message processors?
- How do you deploy notification pipeline changes without losing in-flight upload result messages?
- What testing validates that SQS message processing changes handle all message formats correctly?
- How do you handle changes to the SNS subscription (e.g., re-subscription after policy updates)?
- What is the rollback procedure if a notification pipeline change causes message processing failures?
1.8.5 IAM and Security Configuration Changes
- What is the process for updating IAM role permissions, STS credential policies, or API Gateway authorization settings?
- How do you deploy security configuration changes without causing HTTP 403 errors on active catalog uploads?
- What testing validates that IAM changes maintain the correct Invoke API permissions?
- How do you coordinate IAM role ARN changes with the Amazon allowlisting process?
- What is the rollback procedure if a security configuration change blocks catalog API access?
1.8.6 Source System and Integration Changes
- What is the process for handling changes to your catalog source systems (ERP upgrades, merchandising platform migrations)?
- How do you ensure catalog data extraction and validation logic remains compatible after source system changes?
- What testing validates that source system changes do not introduce data quality issues in catalog uploads?
- How do you coordinate catalog field mapping updates when source system schemas change?
- What is the fallback process if a source system change disrupts the catalog upload pipeline?
1.8.7 CloudFormation and Infrastructure Changes
- What is the process for updating CloudFormation stacks (SQSSetupTemplate, ConnectivityTestTemplate)?
- How do you deploy infrastructure changes without disrupting the SQS queue or Lambda functions?
- What testing validates that CloudFormation stack updates do not alter queue policies or Lambda configurations unexpectedly?
- How do you handle AWS service updates or deprecations that affect your catalog infrastructure?
1.8.8 Testing and Validation
- What pre-deployment testing is required for all catalog pipeline change types (unit, integration, E2E)?
- How do you validate changes in a staging environment that mirrors production before deployment?
- What smoke tests confirm catalog upload health immediately after a production deployment?
- How do you test changes against the full catalog lifecycle (prepare → upload → status check → notification → result download)?
1.8.9 Rollback and Recovery
- What is the maximum acceptable rollback time for each component (Lambda, Step Functions, DynamoDB, SQS, IAM)?
- How do you ensure every deployment is reversible and what automated rollback triggers are in place?
- How do you handle catalog data inconsistencies that may result from a partial deployment or rollback?
- What is the process for re-uploading catalogs after a rollback to ensure Amazon systems have the correct catalog state?
1.8.10 Communication and Coordination
- How do you communicate planned catalog pipeline changes to stakeholders (store operations, Amazon team, merchandising team)?
- What maintenance windows are defined for catalog pipeline changes and how are they communicated?
- How do you coordinate changes that span multiple teams (e.g., source system team + catalog pipeline team + Amazon onboarding)?
- What post-deployment review process captures lessons learned from each catalog pipeline change?
2. Security
2.1 IAM and Credential Security
- How are IAM role credentials used to invoke the Catalog API managed and rotated?
- Does the IAM role follow least privilege principles with only the necessary Invoke API permissions?
- How do you detect and respond to unauthorized API invocation attempts (HTTP 403 errors)?
- What controls prevent the IAM role ARN from being exposed or misused?
2.2 Catalog Data Security
- Are the catalog data (product names, SKUs, pricing) encrypted in transit and at rest?
- What access controls restrict who can modify catalog data in your source systems?
- How do you prevent injection attacks through malformed catalog item fields?
- What audit logging captures all catalog upload requests and their outcomes?
2.3 SNS/SQS Security
- How are SQS queue policies configured to only accept messages from the authorized Amazon SNS topic?
- What controls prevent unauthorized access to the SQS queue or tampering with messages?
- How is the S3 presigned URL for upload results protected from unauthorized access?
- What controls ensure the SNS subscription confirmation process is secure?
2.4 API Authentication and Authorization
- How are STS credentials acquired and managed for Catalog API invocations?
- What controls ensure only allowlisted AWS accounts can invoke the Catalog API?
- How do you detect and respond to abnormal API usage patterns?
2.5 Security Events, Data Classification, and Incident Response
- How do you detect and investigate security events related to catalog operations?
- How do you classify catalog data (product information, pricing, store IDs)?
- How do you protect catalog data at rest in S3, DynamoDB, and other storage?
- How do you anticipate, respond to, and recover from catalog-related security incidents?
3. Reliability
3.1 Catalog Upload Availability
- What is the target availability SLA for your catalog upload pipeline?
- What fallback process exists when the Upload Catalog API is unavailable (e.g., manual template upload via Merchant Portal)?
- How do you handle partial upload failures where some items succeed and others fail within a batch?
- What reconciliation process ensures all catalog items are eventually uploaded successfully?
3.2 Upload Status Reliability
- What is the target availability SLA for the Get Catalog Upload Status API?
- How do you handle scenarios where the status API returns inconsistent or stale data?
- What fallback exists when the SNS/SQS notification pipeline fails to deliver upload results?
- How do you correlate ingestion IDs across the upload and status tracking flow?
3.3 SNS/SQS Pipeline Reliability
- What is the target availability for the SQS queue processing pipeline?
- How do you handle SQS message processing failures (dead-letter queue strategy)?
- What happens when the S3 presigned URL expires before the result is downloaded?
- How do you handle SNS subscription expiration or loss of subscription?
- What process re-subscribes to the SNS topic if the subscription is lost?
3.4 Data Integrity
- How do you ensure catalog data consistency between your source systems and Amazon's systems?
- What validation ensures uploaded catalog items match expected SKU and barcode formats?
- How do you detect and resolve catalog drift (differences between intended and actual catalog state)?
- How do you handle the up to 1-hour propagation delay for catalog items in Amazon systems?
3.5 End-to-End Resilience
- How do you handle cascading failures across the data preparation → upload → status tracking → notification pipeline?
- What circuit breaker patterns prevent repeated failed uploads from overwhelming the API?
- How do you handle Step Functions execution failures or timeouts?
- What is the recovery process after an extended catalog upload pipeline outage?
3.6 Data Protection and Fault Tolerance
- How do you back up catalog data and upload tracking records?
- How do you design your catalog upload workload to withstand component failures?
3.7 Backup and Recovery
- What is the backup strategy and frequency for catalog source data in your ERP or merchandising systems?
- How do you back up catalog upload tracking records (ingestion IDs, upload status, timestamps)?
- What is the backup strategy for catalog files stored in S3 (input files and upload result reports)?
- What is the Recovery Point Objective (RPO) for each critical data store (catalog source data, DynamoDB tracking records, S3 catalog files)?
- What is the Recovery Time Objective (RTO) for restoring the catalog upload pipeline after a failure?
- How do you validate that backups are complete, consistent, and restorable through regular restore testing?
- What automated backup verification processes confirm backup integrity on a scheduled basis?
- Do you ensure backups are stored in a separate AWS region or account for disaster recovery?
- What is the process for restoring catalog upload tracking records from DynamoDB backups after a data loss event?
- What is the recovery procedure when the SNS subscription is lost and must be re-established?
- How do you re-process failed or lost SQS messages from the dead-letter queue after a pipeline recovery?
- How do you reconcile catalog state in Amazon systems after a prolonged upload pipeline outage?
- What is the escalation process when automated recovery fails and manual intervention is required?
- Do you conduct disaster recovery drills and how frequently are they performed?
4. Performance Efficiency
4.1 Catalog Upload Performance
- What is the p99 response time for Upload Catalog API calls?
- How does upload performance scale with catalog size (approaching the 10,000 item limit)?
- What is the optimal batch size for catalog uploads to balance throughput and reliability?
- How do you manage the 10 requests per second rate limit during bulk catalog updates?
4.2 Upload Status Tracking Performance
- What is the p99 response time for Get Catalog Upload Status API calls?
- What polling interval is used for status checks and how is it optimized?
- How do you minimize unnecessary status polling (e.g., using SNS/SQS notifications instead)?
4.3 SNS/SQS Processing Performance
- What is the latency from upload completion to SQS message receipt?
- How quickly does the Lambda function process incoming SQS messages and download results?
- What is the throughput capacity for concurrent SQS message processing?
4.4 Data Preparation Performance
- How long does catalog data extraction and validation take from source systems?
- What optimizations reduce the time to prepare and validate catalog payloads?
- How does the Step Functions orchestration perform under peak catalog update volumes?
4.5 Rate Limiting
- How does the system handle 429 (Too Many Requests) responses?
- What retry-after and exponential backoff strategies are implemented?
- How do you distribute catalog uploads across time windows to stay within rate limits?
4.6 Demand Management
- How do you design your catalog upload workload to adapt to changes in catalog update frequency?
5. Cost Optimization
5.1 Compute and Infrastructure
- How are Lambda, Step Functions, and DynamoDB resources scaled for catalog upload operations?
- What auto-scaling policies handle peak vs. off-peak catalog update volumes?
- Are there opportunities to use reserved capacity or savings plans for predictable catalog update workloads?
5.2 API and Data Transfer Costs
- What is the cost per catalog upload API call?
- How do you minimize unnecessary API calls (e.g., only uploading changed items rather than full catalog)?
- What is the cost impact of retry logic on failed upload and status check calls?
- How do you optimize S3 storage costs for catalog files and upload results?
5.3 SNS/SQS Costs
- What is the cost per SQS message for upload result notifications?
- How do you minimize unnecessary SQS polling or Lambda invocations?
- What data retention policies govern SQS dead-letter queue messages?
5.4 Catalog Update Frequency Optimization
- How do you batch catalog updates to reduce the number of API calls?
- Do you only upload changed items rather than the full catalog each time?
- How do you balance catalog freshness requirements against API and infrastructure costs?
6. Catalog Reporting
6.1 Reporting Mode Selection
- Have you evaluated which reporting mode best fits your catalog operations (Merchant Portal daily reports, Intra-day S3 reporting, Event feed via EventBridge)?
- What is your required frequency for catalog-related reporting data (daily reconciliation, hourly sync validation, near real-time event processing)?
- Does your existing data ingestion infrastructure support CSV-based (Intra-day) or JSON/API-based (Event feed) formats?
6.2 Merchant Portal Reporting
- Are daily Catalog reports being downloaded and reviewed from the JWO Merchant Portal?
- How do you use Merchant Portal reports to validate that catalog uploads were processed correctly?
- What process compares Merchant Portal catalog data against your source systems (ERP, merchandising) to detect drift?
- Who is responsible for reviewing catalog reports and what is the escalation process for discrepancies?
6.3 Intra-Day Reporting for Catalog Validation
- How do you use Intra-day order reports to validate that catalog items (SKUs, pricing) are correctly reflected in shopping transactions?
- What process detects orders containing items that were recently added, updated, or archived in the catalog?
- How do you correlate Intra-day report data with catalog upload ingestion IDs to confirm end-to-end catalog accuracy?
- What alerting is in place when Intra-day reports contain items with unexpected prices or missing SKUs?
6.4 Event Feed for Catalog Monitoring
Note: There is currently no dedicated event feed for catalog upload monitoring. The questions below apply only if your implementation uses CART events from the Event feed to indirectly validate catalog accuracy through transaction data.
- How do you use CART events from the Event feed to validate that catalog items are being recognized and priced correctly in transactions?
- What monitoring detects CART events with merchantSku values that do not match your current catalog?
- How do you use CART event promotion data (merchantPromotionId, promotionValue) to validate that catalog-linked promotions are applied correctly?
- What alerting triggers when CART events show pricing discrepancies compared to the uploaded catalog?
6.5 Catalog Reporting Data Integrity
- How do you reconcile catalog upload results (ingestion reports) against reporting data to confirm all items are active and correctly priced?
- What process detects catalog items that were uploaded but never appear in transaction reports (potentially inactive or misconfigured items)?
- How do you validate that archived catalog items no longer appear in new transaction reports?
- What reconciliation process compares the catalog state in Amazon systems against your source-of-truth catalog data?
6.6 Reporting Security and Access
- How are IAM roles and KMS keys for reporting access managed and rotated?
- What access controls restrict who can view or download catalog and transaction reporting data?
- How do you ensure product and pricing data in reports is handled per your data classification policies?
- What audit logging captures all reporting data access and downloads?
7. Sustainability
7.1 Resource Efficiency
- How do you minimize compute usage during periods with no catalog updates?
- What strategies reduce unnecessary processing for unchanged catalog items?
- How do you optimize Lambda execution duration and memory allocation for catalog operations?
- What idle resource management reduces environmental impact of the catalog pipeline?
7.2 Data Lifecycle Management
- How do you optimize retention of catalog upload tracking records in DynamoDB?
- What archiving strategies minimize long-term storage for upload results and reports?
- How do you efficiently purge obsolete catalog upload logs and SQS dead-letter messages?
- What compression techniques reduce storage requirements for catalog files in S3?
7.3 Network and Transfer Optimization
- How do you minimize network traffic by only uploading catalog deltas (changed items)?
- What batching strategies reduce the number of API calls and transmission overhead?
- How do you optimize catalog payload sizes to reduce data transfer?

