Reliability Design

Background

When building a multi-location compatible integration app,
system reliability is extremely important.

Error handling and duplicate prevention that were handled internally by standard integration must be designed by us in custom implementations.
We implemented several countermeasures to prevent situations where "it stopped working without us noticing."

This part may look quiet, but it is where operational stability is won.

Design Challenges

The custom integration app needed to address the following issues:

Authentication token expiration - NextEngine API tokens have expiration dates
Duplicate processing - Risk of same order being processed multiple times due to Webhook resends
Concurrent execution - Risk of scheduled processes running simultaneously
Missing errors - Not noticing when problems occur

Automatic Authentication Token Renewal

Integration with NextEngine API requires authentication tokens.
These tokens have expiration dates, and the system stops working when they expire.

We implemented a mechanism that constantly monitors token expiration and automatically renews before expiration.

Token Renewal Flow

Before API Call

Check token expiration

Expiration Check

Check remaining days

30+ days

Use as is

Less than 30 days

Start renewal process

Expired

Emergency renewal

Renewal process details:

Renewal Process

Cooldown Check

5-minute continuous renewal prevention

Get New Token

Request to NextEngine API

Save Token

Check save result

Success

Continue with new token

Failure

Fallback to environment variables

Key Points of Token Management

Renewal with margin - Start renewal attempts 30 days before expiration
Prevent continuous renewal - 5-minute cooldown prevents wasteful requests
Multiple fallback paths - Alternative measures even when renewal fails
Alert display - Display alerts in admin panel on errors

In-Progress Lock Mechanism

When the same process runs simultaneously multiple times, data inconsistencies and duplicate sends occur. We adopted distributed locking to prevent this.

Lock Acquisition and Release

Lock Mechanism

Attempt Lock Key Acquisition

SETNX operation

Acquisition Result

Check lock status

Acquisition success

Start process, set TTL=25min

Acquisition failed

Another process running, skip

Execute Main Process

Main processing

Release Lock

Ensured execution in finally block

Lock Safety Design

With TTL (Time To Live) - Lock doesn't persist even on abnormal termination
25-minute TTL - Setting with margin for normal processing time
Guaranteed release - Released even on error via finally block

Ensuring Idempotency

The property where executing the same process multiple times produces the same result as executing once is called "idempotency."
This is a very important concept since Webhooks may be resent.

Idempotency Key Generation

Idempotency Check

Generate Key

SHA1 hash (Order ID + Ship datetime + Tracking number)

Check Key Existence

Determine if already processed

Key exists

Already processed, skip

Key doesn't exist

New process, save key after completion (90 days)

Why Idempotency Matters

Network failures - Resends occur when transmission succeeded but response didn't arrive
Timeouts - Processing completed but treated as failure due to timeout
Retries - Same request arrives multiple times from automatic retries on error

In all cases, idempotency prevents duplicate processing.