Trains.com

Norfolk Southern system-wide issue cancels VRE trains

4170 views
24 replies
1 rating 2 rating 3 rating 4 rating 5 rating
  • Member since
    July 2008
  • 2,325 posts
Posted by rdamon on Tuesday, September 12, 2023 4:41 PM

Active/Active systems can become split-brained when they loose communication.  Each side thinks it is in charge.  The fact that nscorp.com was down show how wide spread this hit.  Active Directory? 

  • Member since
    May 2003
  • From: US
  • 25,275 posts
Posted by BaltACD on Sunday, September 10, 2023 5:51 PM

blue streak 1
I thought that best practices require that when having 2 systems running in parallel that they are isolated from each other whenever software or hardware is installed or changed on one of the systems to prevent what happened to NS. 

It is easy to have parallel computer systems - for companies like NS that have a communications network that relatively spans the country from Kansas City to the Northern, Southern and Atlantic Coast borders of the country - that is a computer 'tool' that is difficult to duplicate and a systematic failure of which will render any computer system that has to communicate with its users through the communications system as dead in the water.

Never too old to have a happy childhood!

              

  • Member since
    December 2007
  • From: Georgia USA SW of Atlanta
  • 11,919 posts
Posted by blue streak 1 on Sunday, September 10, 2023 3:51 PM

I thought that best practices require that when having 2 systems running in parallel that they are isolated from each other whenever software or hardware is installed or changed on one of the systems to prevent what happened to NS. 

  • Member since
    May 2003
  • From: US
  • 25,275 posts
Posted by BaltACD on Wednesday, September 6, 2023 3:40 PM

adkrr64
 
BaltACD

Why things didn't fail over to backups is a good question.  

I'm pretty sure the "fail over backups" oltmannd refers to are redundant computer systems that exist for no other reason than to come online and allow the business to run normally even after the primary system goes down. The failover system usually sits in the background, doing very little, though continually updating its data to match the primary system, so that the transition is relatively seamless when it needs to come on line. There is no way any modern business could switch to paper and run the business, especially one as far flung as a Class I railroad.

Of course, that means the company has paid to acquire and maintain the proper failover hardware, updates the software to keep it functional, and periodically tests the failover process to make sure it will work when its really needed.

Note the NS statement - the failure took down both their Primary everyday box, but it also took down there 'Backup' system.  Don't know too many companies that have Tertiary and Quadiary back up systems, up, running and able to handle the load of the primary system when Systems #1 and #2 both go belly up.  There are only so many levels of total system failure you can plan for.  NS's failures exceeded their limits of total system failure recovery in real time.

When I was involved with Chessie Computer Services Inc. in the late 1980's.  CSX had two entire Main Frame computer systems that had the same level of computational power.  Supposedly, it was reported to be the 7th largest IBM installation IN THE WORLD (not my claim, just what I had heard).  Both boxes were loaded with similar if not identical software.  The boxes were partitioned to run various company wide applications in various partitions - supposedly either box could run the entire Corporation if and when necessary.  Each was the backup for the other.  One box was in Jacksonville and the other box was in Baltimore and were linked with the highest speed data link that was available at the time.

Never too old to have a happy childhood!

              

  • Member since
    February 2018
  • 299 posts
Posted by adkrr64 on Wednesday, September 6, 2023 3:17 PM

BaltACD

Why things didn't fail over to backups is a good question.  

 

I'm pretty sure the "fail over backups" oltmannd refers to are redundant computer systems that exist for no other reason than to come online and allow the business to run normally even after the primary system goes down. The failover system usually sits in the background, doing very little, though continually updating its data to match the primary system, so that the transition is relatively seamless when it needs to come on line. There is no way any modern business could switch to paper and run the business, especially one as far flung as a Class I railroad.

Of course, that means the company has paid to acquire and maintain the proper failover hardware, updates the software to keep it functional, and periodically tests the failover process to make sure it will work when its really needed.

  • Member since
    May 2003
  • From: US
  • 25,275 posts
Posted by BaltACD on Wednesday, September 6, 2023 12:29 PM

oltmannd
 
rdamon

During a routine maintenance procedure performed by a vendor, the vendor's software created an error that caused the company's primary and recovery data-storage systems to become unresponsive, which then affected its core operational systems. Norfolk Southern didn't name the vendor but described it as a "leading global technology provider."

https://www.wric.com/business/press-releases/cision/20230901PH98837/norfolk-southern-provides-technology-outage-update/ 

It was more than PTC, but remember PTC sits on top of the heap.  Talks to train dispatching system, signal system, Yard inventory system (consist for braking algorithm), Crew call (sign in for train crew), and more.  Any one of these goes dark, PTC goes dark.

Why things didn't fail over to backups is a good question.  

When computerized applications are implemented in the areas you have mentioned - there are no longer enough people available, knowledgeable enough, or having paper systems in place to be a back up to a failed computer and its applications.  In today's business/railroad world - when the computer stops, so does everything else.

Never too old to have a happy childhood!

              

  • Member since
    January 2001
  • From: Atlanta
  • 11,971 posts
Posted by oltmannd on Wednesday, September 6, 2023 10:22 AM

rdamon

During a routine maintenance procedure performed by a vendor, the vendor's software created an error that caused the company's primary and recovery data-storage systems to become unresponsive, which then affected its core operational systems. Norfolk Southern didn't name the vendor but described it as a "leading global technology provider."

https://www.wric.com/business/press-releases/cision/20230901PH98837/norfolk-southern-provides-technology-outage-update/

 

It was more than PTC, but remember PTC sits on top of the heap.  Talks to train dispatching system, signal system, Yard inventory system (consist for braking algorithm), Crew call (sign in for train crew), and more.  Any one of these goes dark, PTC goes dark.

Why things didn't fail over to backups is a good question.  

-Don (Random stuff, mostly about trains - what else? http://blerfblog.blogspot.com/

  • Member since
    July 2008
  • 2,325 posts
Posted by rdamon on Monday, September 4, 2023 6:39 PM

During a routine maintenance procedure performed by a vendor, the vendor's software created an error that caused the company's primary and recovery data-storage systems to become unresponsive, which then affected its core operational systems. Norfolk Southern didn't name the vendor but described it as a "leading global technology provider."

https://www.wric.com/business/press-releases/cision/20230901PH98837/norfolk-southern-provides-technology-outage-update/

  • Member since
    December 2001
  • From: Northern New York
  • 25,008 posts
Posted by tree68 on Sunday, September 3, 2023 7:26 PM

BaltACD
Finding software issues CAN be maddening.

When I worked in an Army installation data processing center, our resident IBM tech told of a time he went on vacation.

It seems that IBM would put several functional areas on the cards for their computers.  In slot one, section A would get used.  In slot two, section B got used, etc.  

It was not uncommon to put a card with a non-functional section B in slot one, as that slot did not use section B. And a given slot may use several sections.

Multiply that times numerous cards and you may see where I'm going with this.

The regular tech knew which cards had which section non-functional, so would not put a car with a bad section A in slot one, etc. 

The replacement tech didn't know which cards had which sections OOS.

When the regular tech returned, it took him a while to sort everything out and get the subject computer back on line...

LarryWhistling
Resident Microferroequinologist (at least at my house) 
Everyone goes home; Safety begins with you
My Opinion. Standard Disclaimers Apply. No Expiration Date
Come ride the rails with me!
There's one thing about humility - the moment you think you've got it, you've lost it...

  • Member since
    May 2003
  • From: US
  • 25,275 posts
Posted by BaltACD on Sunday, September 3, 2023 6:10 PM

abdkl
Way back in the mid-1970s I worked in the TOPS (Computer) Control Center OC) for SP. On one swing shift our system (computer) terminal monitor printer reported a station with an error code of "SNO"

This code was not in our documentation. SP's Communication Data Control team had no alarms or lights to indicate a problem. Calling the yard with the terminal (an IBM 1050) didn't find a problem. We sent a test message to the machine and it went through. Everything was working and there was nothing we could do. We shrugged our shoulders and logged the even in our daily log and finished our shift.

Came to work the next day to be greeted by two of SP's System Programmers. While writing the program they identified all the conditions that could happen and assigned a code to each one. Then, just in case something COULD happen for which they had not coded a code…they assigned "SNO" for Should Not Occur.

Fortunately, the detail logs kept by the TOPS system -did- list more detail and the programmers were able to add the situation, and an additional code, to cover future events.

Back 'in the olden days' B&O installed a IBM 370 computer to operate Chicago Terminal.  The computer was connected to a number of ACI scanners to read the three foot or so high reflective bar codes that were applied to cars in the 1970's as the AAR's first attempt at Automatic Car Identification.  Around 1978 or 79 that 'experiment' was stopped as the bar codes would get too dirty to be readable or get burned off the cars from thaw sheds or loading hot steel or similar high temperature into cars.  After the AAR ended the ACI requirements, the computer was patched to remove the ACI scanners from the computer.

About 1984 or 85, Chicago personnel undertook a routine maintenance shutdown and restart.  The computer would not come up.  Local personnel dealt with the situation for a number of hours and then informed the IT people in Baltimore, who immediately flew to Chicago to dig into the situation.  Another day goes by and still no joy.  With that lack of success IBM at the field level was brought to bear against the situation.  Failure after Failure it got bucked up to the top level of IBM's technology people - who ultimately found a 'programatical black hole' in the patches that had been installed about six years earlier that was causing the system to crash rather than reboot as intended.  Finding software issues CAN be maddening.

Never too old to have a happy childhood!

              

  • Member since
    August 2006
  • 38 posts
Posted by abdkl on Sunday, September 3, 2023 12:43 PM

Way back in the mid-1970s I worked in the TOPS (Computer) Control Center OC) for SP. On one swing shift our system (computer) terminal monitor printer reported a station with an error code of "SNO"

This code was not in our documentation. SP's Communication Data Control team had no alarms or lights to indicate a problem. Calling the yard with the terminal (an IBM 1050) didn't find a problem. We sent a test message to the machine and it went through. Everything was working and there was nothing we could do. We shrugged our shoulders and logged the even in our daily log and finished our shift.

Came to work the next day to be greeted by two of SP's System Programmers. While writing the program they identified all the conditions that could happen and assigned a code to each one. Then, just in case something COULD happen for which they had not coded a code…they assigned "SNO" for Should Not Occur.

Fortunately, the detail logs kept by the TOPS system -did- list more detail and the programmers were able to add the situation, and an additional code, to cover future events.

  • Member since
    December 2007
  • From: Georgia USA SW of Atlanta
  • 11,919 posts
Posted by blue streak 1 on Saturday, September 2, 2023 11:07 PM

Now we are learning it was a software problem.  Evidently when a piece of hardware was installed the existing software crashed.  Can understand why NS thought it was hardware that was installed.   So, as it appears that the software did not like the new piece of hardware?  That hardware probably passed all the tests before being used and may have had other installations that worked fine? The full explanation may take months?? 

Read all about it in the computer geek pubs.

  • Member since
    May 2003
  • From: US
  • 25,275 posts
Posted by BaltACD on Saturday, September 2, 2023 8:43 PM

jeffhergert

tree68

 

Perhaps their system got hacked...

 

Apparently that's happened to a couple of area hospitals lately.

 

 

 

 

NS put out a statement saying an IT vendor's system had a problem and it affected NS.

 

I've heard that UP had some issues about the same time. It must not have been too bad. I've been on vacation, but haven't heard much discussion about any problems from coworkers. 

 

Jeff 

The stuff that has been hitting hospitals has been ransomware attacks, pay us and get your data back.  However, some have been hacked and had various forms of data stolen.  Locally Johns Hopkins has been hacked - fortunately just BEFORE I got referred into their system.  My data is on too many medical information systems!

Never too old to have a happy childhood!

              

  • Member since
    March 2003
  • From: Central Iowa
  • 6,898 posts
Posted by jeffhergert on Saturday, September 2, 2023 7:33 PM

[quote user="tree68"]

Perhaps their system got hacked...

Apparently that's happened to a couple of area hospitals lately.

 

[/quote

NS put out a statement saying an IT vendor's system had a problem and it affected NS.

I've heard that UP had some issues about the same time. It must not have been too bad. I've been on vacation, but haven't heard much discussion about any problems from coworkers. 

Jeff 

  • Member since
    May 2003
  • From: US
  • 25,275 posts
Posted by BaltACD on Friday, September 1, 2023 10:40 PM

One has to understand how dependent companies are today on computers and their applications for EVERY aspect of the company's operation - at all levels from the field to Board Room decisions.

Computers have been designed to replace people and the job functions they once performed and the manual paper systems that these people worked with.  When the computer/application comes the people and their manual systems go - never to be replaced.

Never too old to have a happy childhood!

              

  • Member since
    December 2001
  • From: Northern New York
  • 25,008 posts
Posted by tree68 on Friday, September 1, 2023 9:28 PM

Perhaps their system got hacked...

Apparently that's happened to a couple of area hospitals lately.

LarryWhistling
Resident Microferroequinologist (at least at my house) 
Everyone goes home; Safety begins with you
My Opinion. Standard Disclaimers Apply. No Expiration Date
Come ride the rails with me!
There's one thing about humility - the moment you think you've got it, you've lost it...

  • Member since
    December 2017
  • 100 posts
Posted by PennsyBoomer on Friday, September 1, 2023 7:04 PM

I find it interesting that NS says they expect full recovery in the coming weeks (!). Granted it was evidently a system wide outage - for half a day apparently - but the repurcussions, given their time estimate, hint loudly at very poor recovery ability. And bad choice of "leading tech provider". You would think a rail system would have some backup capability as opposed to holding everything (presumably). Just more of the recent shut down mentality malaise that affects a floundering republic.

  • Member since
    May 2003
  • From: US
  • 25,275 posts
Posted by BaltACD on Tuesday, August 29, 2023 5:54 PM

mudchicken
Unintended consequences of putting an untested/untried system in place back at the beginning of PTC? Next??

(Insert Chad Thomas popcorn emoji here.)

I don't know how NS performed their PTC installation.

CSX installed theirs incrementally a few subdivisions at a time with a big 'Help Desk' cadre to respond to issues as that happened and those issues would be resolved before installation was done on the next group of subdivisions.  Many Dispatcher Desks operate territories that have both PTC and non-PTC subdivisions.

I find it ironic that NS's spokesperson is named Conner Spielmaker.  Making his spiel heard around the industry. 

Never too old to have a happy childhood!

              

  • Member since
    December 2001
  • From: Denver / La Junta
  • 10,820 posts
Posted by mudchicken on Tuesday, August 29, 2023 5:41 PM

Unintended consequences of putting an untested/untried system in place back at the beginning of PTC? Next??

(Insert Chad Thomas popcorn emoji here.)

Mudchicken Nothing is worth taking the risk of losing a life over. Come home tonight in the same condition that you left home this morning in. Safety begins with ME.... cinscocom-west
  • Member since
    September 2003
  • 21,669 posts
Posted by Overmod on Tuesday, August 29, 2023 4:44 PM

More specifically, Railway Age said it was NS access to the PTC network, perhaps from their side as the supposed computer outages seem to have been at the same general time.  NS said it was fixed by 7pm yesterday (the Web site linked in the previous post was fully live by 10:30pm Central at my location).

The 'several weeks' was for the delays and problems from the outage to 'work themselves out', not additional repair or programming time.

 

  • Member since
    January 2019
  • From: Henrico, VA
  • 9,728 posts
Posted by Flintlock76 on Tuesday, August 29, 2023 3:57 PM

According to this source it's a PTC system problem.

https://railfan.com/norfolk-southern-snarled-by-positive-train-control-outage/

  • Member since
    July 2008
  • 2,325 posts
Posted by rdamon on Tuesday, August 29, 2023 7:12 AM

https://nscorp.mediaroom.com/2023-08-28-System-outage-update

ATLANTA, Aug. 28, 2023 /PRNewswire/ -- Norfolk Southern Corporation (NYSE: NSC) provided an update Monday on a technology outage that impacted rail operations:

 

This morning, Norfolk Southern experienced a hardware-related technology outage that impacted rail operations. At this time, we have no indication that this was a cybersecurity incident. Our teams worked throughout the day and successfully restored all systems at 7:00 p.m. ET. We are safely bringing our rail network back online. Throughout this, we have been in contact with our customers and will work with them on updated timing for their shipments. We expect the impact to our operations to last at least a couple of weeks. 

  • Member since
    July 2008
  • 2,325 posts
Posted by rdamon on Monday, August 28, 2023 3:41 PM

I am unable to hit https://www.nscorp.com/

  • Member since
    May 2003
  • From: US
  • 25,275 posts
Posted by BaltACD on Monday, August 28, 2023 2:51 PM

Heard that NS is having 'computer issues'.  Which computers and where are unknown.

Recall about 20 years ago when CSX got attacked with a computer virus that ended up crashing the CADS computers - took about 72 hours to resolve all issues and get back to norma.

Never too old to have a happy childhood!

              

  • Member since
    September 2008
  • 1,112 posts
Norfolk Southern system-wide issue cancels VRE trains
Posted by aegrotatio on Monday, August 28, 2023 2:07 PM

Norfolk Southern system-wide issue cancels VRE trains today.  No word on what the "system-wide issue" is.

 

Join our Community!

Our community is FREE to join. To participate you must either login or register for an account.

Search the Community

Newsletter Sign-Up

By signing up you may also receive occasional reader surveys and special offers from Trains magazine.Please view our privacy policy