2019-05-21 | Voice Service Advisory
Kyle Olexa -
Event Description: Partial Voice Outage
Event Date: 05/21/2019
RFO Issue Date: 05/23/2019
Ticket Number: 70000
Scope of Impact
Customers on a subset of softswitch nodes experienced call completion errors on calls utilizing audio prompts and music resource files.
On May 21st, 2019, wile performing routine, non-service affecting maintenance, a Teleflex engineer inadvertently issued a command which reset network services on a LXD storage pool. Following this event the storage pool was active and accessible by the majority of softswitch nodes, which had failed over to alternate storage pools and maintained normal operation. TeleFlex NOC began inspection of dependent nodes. Shortly thereafter, Teleflex received reports of call issues from users on a subset of softswitches and engineers focused on the reported nodes to determine the cause of isolated service issues.
Logs on the subset of impacted softswitches showed errors reading certain audio files which, in certain dialplan actions, resulted in delayed call processing and call completion errors. The impacted softswitches had immediately re-established connections to shared storage pool following the network event; however, the file handler pointers were flagged as stale on some nodes. Since the effected softswitches had re-established the underlying storage maps to the pool linked with the network event, the process for failover to secondary storage pools was interrupted on these particular nodes.
To correct the issue, Teleflex went through each of the identified softswitches to validate active file handler pointers. For the effected nodes, engineers flushed the stale pointers and forced new connections to the active storage pools. Once pointers were re-established services for each impacted softswitch node resumed normal operation with full access to audio source files. Softswitch #25 was manually restored during troubleshooting activities. Once isolated a script to handle the refresh of storage pointers was successfully validated against Softswitch #46. The script was then deployed to the remaining identified nodes #30 and #34. Following resolution on the impacted nodes, the script was applied across all remaining (non-impacted) nodes to protect against any delayed impacts to nodes not initially effected.
Please note that all times listed in the timeline below are in 24-hour clock format, and refer to Central Daylight Time.
11:53 - Inadvertent restart of network services for LXD0001.
11:55 - LXD0001 back online and fully functional. Subsystem check completed.
12:03 - First reported call issues Softswitches #34, #25.
12:05 - NOC Escalation to Senior Engineering
12:10 - Log errors identified on #25 indicated problems opening audio resource files.
12:16 - Issue cause isolated to stale file handler pointers on #25.
12:24 - Softswitch #25 stale file pointers flushed and re-activated & verified resolution of node issue.
12:38 - Script applied to SoftSwitch #46 and validated. Full service restored.
12:54 - Script applied to remaining identified nodes #30, #32, #34, #35 and #37. Full service restored.
13:14 - Script applied to all SoftSwitches (not impacted by event) as precaution against any further impacts.
Maintenance schedules and practices will be reviewed to determine what can be moved to off-hour/planned maintenance windows. Teleflex will be reviewing supervision and review processes related to technician activities. Engineering is reviewing logs on the impacted nodes to determine the operational state of each node impacted versus the nodes that successfully rolled to alternate connections and will adjust failover algorithms to protect against the identified corner-cases.
1510 Primewest Parkway | Suite 800
Katy, TX 77449