Ethernet runs so much of
the modern world, and most of the time it works so well at
the physical layer, that the various layers of
retransmissions and timeouts are rarely needed.
There is a huge range in the performance of networks though,
and small issues can amplify significantly if they are not
handled correctly.
How confident would you be that an automated vehicle running
safety-critical data over 1000BaseT1 Ethernet is going to be
tolerant to data corruption caused by a damaged connector
and excess vibration?
It is critical that we can test this type of scenario, and
Quarch has the tools to do so. This month we’ve been using
our own tools to help track down a customer issue. |
![](http://esaindia.com/emailer/images/purplergt-arrow-icon.png) |
Customer's Challenge |
Our customer was running
a Quarch PPM and developing a complex automation script to
qualify storage devices. The issue was that the occasional
command timed out and failed. This was weird, as the main
power capture application was still running, and later
commands worked fine.
We were unable to recreate the issue in our lab, but the
customer helpfully shared a Wireshark trace (a capture of
the network traffic).
Analysing this showed a major difference between our network
and his. When capturing power data, our instrument streams
data back to the PC over a TCP socket.
Wireshark has a handy tool to chart TCP streams, showing the
data rate and packet sizes going past. First, we see the
performance of our network: |
![](http://esaindia.com/emailer/images/fault-injecting-on-ethernet.jpg) |
Quarch LAN
traffic during the test |
The customer network
looks very different, the data rate is all over the place,
and there are gaps of 1-2 seconds where no data is sent at
all. |
![](http://esaindia.com/emailer/images/fault-injecting-on-ethernet2.jpg) |
Customer LAN
traffic during a similar test |
We also saw TCP
‘retransmissions’ on the customer network, where a packet
has been corrupted or otherwise dropped by the network and
then has to be sent again. This is unusual, but a known
issue on a LAN which should be handled easily. |
![](http://esaindia.com/emailer/images/purplergt-arrow-icon.png) |
Recreating the issue |
So now we could see that
something unusual was going on and that our product was not
acting as expected, but we did not know what the specific
issue was. It’s much easier to debug something when you can
repeat the failure in your own lab, with access to all the
development tools. It also avoids hassling the customer to
keep |
![](http://esaindia.com/emailer/images/rj-45-cable-module.png) |
‘trying things’ to give you more information.
Fortunately, Quarch is good at breaking networks. I grabbed
QTL2022 – RJ-45 Cable Breaker, a physical layer fault
injection tool that can disconnect or glitch a network link. |
Of course, I didn’t end
up making a nice photogenic test setup, but it was a working
one! |
![](http://esaindia.com/emailer/images/quarch-rj-45-breaker.png) |
Quarch RJ-45
breaker in line with the LAN connection to a (dismantled)
Quarch HD PPM |
With this, I could inject
a physical layer error into the LAN link, disrupting the
passing data. I didn’t need scripted automation, just simple
faults, so I used Testmonkey to control the QTL2022.
Using the ‘Signal Glitching’ option, I started with a single
10mS glitch on all pairs, while I have the PPM running. This
immediately terminated the test, as the network link went
A couple more experiments and I found that a 1uS glitch was
short enough to avoid the ethernet link going down but would
be plenty long to disrupt a packet if the glitch fell in the
middle of one. |
![](http://esaindia.com/emailer/images/glitch.png) |
Next, I had to actually
hit a packet. I used the PRBS (Pseudo Random Bit Sequence)
generator to create a random series of glitches across the
link. A 1 in 512 ratio is equal to 0.19% of the total time
on the link being disconnected on average. |
![](http://esaindia.com/emailer/images/purplergt-arrow-icon.png) |
Successful recreation of the
customer issue |
With this running, I got
a successful recreation of the customer issue. I saw the
occasional command time out and also saw disruption in the
TCP throughput |
![](http://esaindia.com/emailer/images/tcp-trace-with-the-quarch-breaker.png) |
TCP trace with
the Quarch Breaker injecting faults from about 9-15 seconds |
We don’t see exactly the
same as the customer, but I had created the same fault
scenario. This quickly allowed us to debug and find two
issues |
An incorrectly set
command timeout at 1 second, when it should have matched
the much longer TCP timeout.
The HD network stack
was using a 1-second retransmission timeout. Rather than
adapting to the average round trip time of the link, as
is standard (meaning we were halting for longer than
Both of these issues were
quickly patched, and the customer was back up and running. |
Conclusions |
It’s incredibly
useful to have a way to recreate issues locally. Having
the right test kit for a problem can vastly reduce the
time taken to solve it.
Modern networks are
‘normally’ very robust, but unusual cases must be
accounted for. Packets can be lost at any time, and
complex ripples of timeouts can occur
In this case, a sub 0.2%
physical interruption on the bus lead to around a 40% loss
in practical throughput. Increasing the interruption to 1.5%
dropped the throughput by 90%.
In cases where network integrity becomes safety-critical,
then a designer must be able to prove that a physical link
interruption can be handled. |
For more information contact:
Electro Systems Associates Pvt
Protocol Signal Group (PSG) - Sales Team
Website :
+91 98860
80011/ 91086 32174
Landline No.: (91) 80 6764 8835/36 |
If you no
longer wish to receive these emailers, you may
at any time.