Reliability Design

Achieving stable operation through automatic token renewal, locking mechanisms, and error handling

ReliabilityToken RenewalLocking MechanismError HandlingIdempotency
6 min read

Background

When building a multi-location compatible integration app, system reliability is extremely important.

Error handling and duplicate prevention that were handled internally by standard integration must be designed ourselves in custom implementations. We implemented various countermeasures to prevent situations where "it stopped working without us noticing."

Design Challenges

The custom integration app needed to address the following issues:

  • Authentication token expiration - NextEngine API tokens have expiration dates
  • Duplicate processing - Risk of same order being processed multiple times due to Webhook resends
  • Concurrent execution - Risk of scheduled processes running simultaneously
  • Missing errors - Not noticing when problems occur

Automatic Authentication Token Renewal

Integration with NextEngine API requires authentication tokens. These tokens have expiration dates, and the system stops working when they expire.

We implemented a mechanism that constantly monitors token expiration and automatically renews before expiration.

Token Renewal Flow
Before API Call

Check token expiration

Expiration Check

Check remaining days

30+ days
Use as is
Less than 30 days
Start renewal process
Expired
Emergency renewal

Renewal process details:

Renewal Process
Cooldown Check

5-minute continuous renewal prevention

Get New Token

Request to NextEngine API

Save Token

Check save result

Success
Continue with new token
Failure
Fallback to environment variables

Key Points of Token Management

  • Renewal with margin - Start renewal attempts 30 days before expiration
  • Prevent continuous renewal - 5-minute cooldown prevents wasteful requests
  • Multiple fallback paths - Alternative measures even when renewal fails
  • Alert display - Display alerts in admin panel on errors

In-Progress Lock Mechanism

When the same process runs simultaneously multiple times, data inconsistencies and duplicate sends occur. We adopted distributed locking to prevent this.

Lock Acquisition and Release

Lock Mechanism
Attempt Lock Key Acquisition

SETNX operation

Acquisition Result

Check lock status

Acquisition success
Start process, set TTL=25min
Acquisition failed
Another process running, skip
Execute Main Process

Main processing

Release Lock

Ensured execution in finally block

Lock Safety Design

  • With TTL (Time To Live) - Lock doesn't persist even on abnormal termination
  • 25-minute TTL - Setting with margin for normal processing time
  • Guaranteed release - Released even on error via finally block

Ensuring Idempotency

The property where executing the same process multiple times produces the same result as executing once is called "idempotency." This is a very important concept since Webhooks may be resent.

Idempotency Key Generation

Idempotency Check
Generate Key

SHA1 hash (Order ID + Ship datetime + Tracking number)

Check Key Existence

Determine if already processed

Key exists
Already processed, skip
Key doesn't exist
New process, save key after completion (90 days)

Why Idempotency Matters

  • Network failures - Resends occur when transmission succeeded but response didn't arrive
  • Timeouts - Processing completed but treated as failure due to timeout
  • Retries - Same request arrives multiple times from automatic retries on error

In all cases, idempotency prevents duplicate processing.

Log Hierarchy Structure

Appropriate log levels are set according to purpose:

DEBUG
PurposeDetailed diagnostic info
Output ConditionDevelopment only
INFO
PurposeMajor operation records
Output ConditionProduction & Development
WARN
PurposeRecoverable issues
Output ConditionProduction & Development
ERROR
PurposeCritical errors
Output ConditionProduction & Development

Information Included in Logs

  • Timestamp - Recorded in ISO 8601 format
  • Environment name - production/staging/development
  • Process type - order_sync, fulfillment_sync, etc.
  • Process result - Success/failure, processed count, etc.
  • Error details - Message and stack trace on errors

Sensitive Information Protection

  • No token output - Authentication tokens not output to logs
  • Personal info exclusion - Customer personal info excluded from logs
  • Preview display - When needed, show only first 10 characters

Multi-Layer Data Persistence Structure

Data storage is redundant across multiple layers:

Data Persistence Layer
Layer 1: KV Storage (Primary Data)

Tokens (encrypted) · Store settings · Sync state · Locks

Fallback
Layer 2: Environment Variables

Initial tokens · Encryption keys · Single-store settings (backwards compatibility)

Audit
Layer 3: Database (Audit Logs)

Webhook logs · Order sync logs · Error logs (7-day retention)

Error Recovery Patterns

Appropriate recovery processing is designed for various errors:

Token expired
ResponseAuto-renew → Retry
ResultProcessing continues
Temporary API failure
ResponseRetry on next schedule
ResultAuto recovery
Tracking number not entered
ResponseSave as PENDING, reprocess later
ResultNo data loss
Duplicate Webhook
ResponseSkip via idempotency check
ResultPrevent double processing
Lock conflict
ResponseSkip, process next time
ResultMaintain data consistency

Monitoring and Alerts

The following monitoring is in place for early problem detection:

Regular Check Items

  • Token expiration - Warning at less than 30 days, alert at less than 7 days
  • Failure rate - Notify if failures continue above threshold
  • Unprocessed orders - Warning if orders remain unprocessed for long periods

Alert Notifications

When problems occur, alerts are displayed in the admin panel. Alerts include:

  • Occurrence datetime
  • Problem type
  • Impact scope
  • Recommended actions

Comparison with Standard Integration

Error handling
Standard IntegrationBlack box
This SystemExplicitly designed & implemented
Duplicate prevention
Standard IntegrationInternal processing
This SystemManaged with idempotency keys
Logs
Standard IntegrationHard to check
This SystemDetailed records, searchable
Token management
Standard IntegrationAutomatic (details unknown)
This SystemAuto-renewal with fallback

Benefits

This design provides the following benefits:

  • Auto recovery - Many problems recover automatically
  • Visibility - Immediately aware when problems occur
  • Data preservation - Prevent duplicate processing and data loss
  • Peace of mind - No "stopped without noticing" situations

Operational Tips

Regular Log Review

Even when no problems occur, we recommend checking logs about once a week. You can verify that warning-level issues aren't accumulating.

Manual Token Renewal

If automatic renewal continues to fail for some reason, you can also manually reissue and configure tokens.

Error Response

When error alerts appear, first check the details in logs. In many cases, issues are temporary and auto-recovered, but if they persist, root cause investigation is needed.

Related Topics