Topic: MISSION CRITICAL NETWORK DESIGN


Author: luse@ll.mit.edu (Paul Luse)
Date: Sun, 24 Jan 93 02:32:23 GMT
Raw View
I'm running out of ideas, I hope someone can help. My problem is to design
a network consisting of two 486 data aquisition machines connected to a
file server of some kind that can publish it's data through NFS so that
multiple SGI machines can be used to display the collected data. The initial
solution was to run some kind of PC-NFS package on the aquisition machines
and have them write their data to a central SGI, we tested the timing for
a PC to write to a remote NFS volume and it was too long. The real trick here
is to complete the entire open-write-close process of 2048 bytes at a time
in under 30 ms. So far I have only found two ways of doing this: (1)
write the data _locally_ to a _cached_ hard drive, or (2) write the data to
a Novell fileserver. I have _not_ yet tested running the aquisition machines
under OS/2, LAN manager, or Windows for workgroups- does anyone think that
any of these will make a difference? Anyway, back to the configuration that
is _almost_ what I want: A Novell fileserver acting as storage for the
aquisition machines, running NFS for Netware so that the SGI's can see the
data (there is no time constraint on the SGI's reading the data, only on the
aquisition machines writing it). This works great except for one small thing,
if a network cable goes bad, or the fileserver crashes then I am in trouble.
Suffice to say that it is mission critical that this entire system either
stay UP or know the reason why- exaclty. The damned 30 ms timeframe is the
killer here. I've modified the aquisition code to catch int 24 errors and
clear the EOJ status of the netware to connection so that I can totally
ignore Abort,Retry,Ignore errors without hanging things up and report to
my customized control hardware that something has DEFINATELY gone wrong with
the network, BUT this takes almost a total of 4 seconds because
(I'm assuming) of the timeout values used by the Netware shell. I set the
IPX RETRY COUNT to just 1 and the 4 second mark seems to be as quick as I
can get conrtol back to my program. I verified that the IPX RETRY COUNT
was indeed affecting the time because as I raised it I saw a corresponding
raise in the 4 second time. To further break down the 4 second interval,
about 3600ms is taken when the data file is trying to be opened (with a cable
break prohibiting it) and about 1100ms to do the Netware EOJ int21 call that
removes the "Error receiving on fileserver Abort,Retry,Ignore" error after
an attempt is made to write the file that couldn't be opened.

If anyone has ANY suggestions on either (1) how I can more quickly determine
whether or not the network is up (other than trying to open a file and
waiting the timeout period) (2) how I can kill the Novell Abort,Retry mssg
faster than 1100ms (currently I'm doing int21 AH=0xD6 BX=0x0000) (3) another
network configuration that might lend itself to my needs better.

Also, I am required to keep the PC aquisition machines as PC's because
of the custom aquisition interfaces in them.

Please reply by EMAIL...

Thanks,

--
| Paul Luse            |  Kwajalein/Roi-Namur  |  (617) 981-2471  GMT-12   |
| GE International     |  Republic of the      |  cserve:71611,1767        |
| PO Box 8323          |  Marshall Islands     |  internet:luse@ll.mit.edu |
| APO AP  96557        | "Where the men are men and so are the women...."  |