Have you seen the movie “Thor Ragnarok”? There was a part in the movie where Thor and Loki were trying to escape from their crazy sister, Hela, who had been in prison for a long time. To get to Asgard (the home of Thor and Loki) from earth, they use a very high-speed travel portal (Bifrost bridge). As Thor and Loki were trying to escape from their sister through this bridge, a fight broke out and Thor and Loki got knocked out of the bridge and fall into a waste planet (Sakaar).
This is a good introduction to this article, where we will be discussing what Packet Loss is, its causes and effects, and how to solve, or at least reduce the possibility, of packet loss. We will also use GNS3 to simulate a network where packet loss exists.
Packet Loss and its causes
The short story of Thor and his evil sister is exactly how packets get lost. Simple put, Packet loss is when packets traveling through a network medium get “knocked off” before getting to their destination. There are a couple of reasons why packet loss happens and we will look at some of them in this section.
Note: Every network will encounter issues like packet loss, from time to time. This is expected. However, these issues should not have too much of a negative impact on the performance of the network.
One of the major causes of packet loss is link congestion. A simple analogy is rush hour traffic when there are more cars on the road than the road can sufficiently handle. Another analogy is a 4-lane road merging into 2 lanes. What happens is that there are more packets arriving on a link than that link is designed to handle.
In some cases, even if the link can technically handle the amount of traffic reaching it, it has been configured to drop packets after a certain limit. An example of this is an organization that purchases 2 Mbps from its ISP. Even if the link can technically support up to 100 Mbps (e.g. MetroEthernet), the ISP will configure their devices to ensure that the organization can only push 2Mbps worth of traffic. Anything more will usually be dropped (depending on the maximum burstable agreement the organization has with the ISP).
Another example of network congestion is when service providers intentionally oversubscribe a link. The rationale is that all the subscribers of that service will not be using the link simultaneously. However, what happens during peak periods when more people are using the service than its capacity is that there will likely be packet loss resulting from congestion.
Another cause of packet loss similar to network congestion is overutilized devices. This means that a device is operating at a capacity it was not designed for. In a network, packets may arrive faster than they can be processed/sent out. To handle this type of situation, many devices have buffers where they hold packets temporarily until they are able to be processed and sent out. However, in the case of an overutilized device, the buffer will probably fill up quickly, resulting in excess packets being dropped.
For example, a Cisco ASA 5506-X is designed to handle up to 750 Mbps of throughput traffic. If you use such a device at the network edge of an organization pushing more than that maximum throughput, you will definitely have an issue.
What happens in many instances is that the device performs at a good enough performance during normal (off-peak) operating times but during peak periods, there will be a noticeable drop in performance, usually evident in the high CPU utilization of the device.
Faulty hardware and/or software
Another cause of packet loss is faulty hardware. This could be a component of a device or the whole device itself. For example, I once worked on a project where the ISP was providing the organization with 100 Mbps but the organization was still struggling with good Internet access especially during peak periods. What we discovered was that the interface on the edge router connecting the organization to its ISP was only able to account for 30 Mbps out of 100 Mbps! The interface had failed (for whatever reason) and once we moved the link to another interface, the performance increased immediately.
Closely related to faulty hardware is a buggy software running on the network device. As with any other software, it is usually impossible for the development team to catch all the bugs in the software running on network devices, and one of such bugs may result in packet loss.
Here are some examples of software bugs in Cisco devices resulting in packet loss:
Wireless versus Wired networks
The type of network medium can also be a cause of packet loss. Generally speaking, wireless networks suffer more setbacks than their wired counterparts. For example, radio frequency interference can be a major issue on wireless networks resulting in packet loss. Other challenges on a wireless network that can result in packet loss include weak signal, distance limitations, and (improperly configured) roaming.
In the case of wired networks, faulty cables can result in packet loss. This could result from the fact that the cable is not properly terminated or that the cable is damaged, causing issues for the electrical signal meant to flow through the cable.
I was once called to troubleshoot a problem in a datacenter. The network guys suddenly noticed a major degradation in the network performance so much so that accessing the network devices for management was difficult i.e. very slow access. We had identified the devices that were affected – two edge routers acting in active/standby mode. Thinking it was a hardware problem on the active device (due to the high CPU utilization), we did a manual failover to the standby device and we started seeing the same problem on the standby device.
This made us focus on the traffic being received by these devices. Upon further investigation, we noticed that a particular IP address was performing a network attack by flooding traffic to the devices, incapacitating them. Blocking that IP address stopped that attack and brought the network back to its normal operating condition.
The attack described above is an example of a Denial of Service (DoS) attack and can result in legitimate packets being dropped because a device is overwhelmed with attack traffic.
The last cause of packet loss we will consider in this article is faulty configuration. A typical example is speed and duplex mismatch between two devices on a link. If one device is configured for half-duplex while the other one is configured for full-duplex, there will likely be collision resulting in packet loss on the link.
Effects of Packet Loss
The effects of packet loss vary depending on the protocol/application concerned. TCP is generally designed to handle packet loss because of the acknowledgment and retransmission of packets – if a packet gets lost (i.e. no acknowledgment is received for that packet), it will usually be retransmitted. UDP, on the other hand, does not have inbuilt retransmission capability and may not handle packet loss as well. However, irrespective of the protocol/application, too much loss of packets is definitely a problem.
Examples of applications that do not handle packet loss well are Voice over IP (VoIP) and some types of video. You have probably been on calls (e.g. Skype, WhatsApp) where there is a noticeable performance issue, like “robotic speech” or completely missed audio. This is usually as a result of packet loss (along with other factors like bandwidth, delay, and jitter).
According to Cisco recommendations, packet loss on VoIP traffic should be kept below 1% and between 0.05% and 5% depending on the type of video.
Lab: Packet Loss in GNS3
Let us investigate the effects of packet loss using a simple lab in GNS3. To make this as realistic as possible, we will introduce the NETem appliance which emulates a link and is able to introduce various factors like bandwidth, delay, and packet loss on a link. This functionality is actually built into the Linux kernel – the NETem appliance just makes it easier to configure.
Download GNS3 Here and Get it installed in Order to Follow along with our Lab setup
Our lab setup is as shown below:
The NETem appliance is transparent on the network so PC1 and R1 are actually on the same 10.0.0.0/24 network, thinking they have a direct connection.
The easiest test we can do on the network is a ping test. Let us ping from PC1 to R1:
PC1> ping 10.0.0.1 10.0.0.1 icmp_seq=1 timeout 84 bytes from 10.0.0.1 icmp_seq=2 ttl=255 time=23.168 ms 84 bytes from 10.0.0.1 icmp_seq=3 ttl=255 time=6.965 ms 84 bytes from 10.0.0.1 icmp_seq=4 ttl=255 time=14.084 ms 84 bytes from 10.0.0.1 icmp_seq=5 ttl=255 time=13.407 ms PC1> ping 10.0.0.1 84 bytes from 10.0.0.1 icmp_seq=1 ttl=255 time=6.999 ms 84 bytes from 10.0.0.1 icmp_seq=2 ttl=255 time=12.637 ms 84 bytes from 10.0.0.1 icmp_seq=3 ttl=255 time=12.203 ms 84 bytes from 10.0.0.1 icmp_seq=4 ttl=255 time=11.711 ms 84 bytes from 10.0.0.1 icmp_seq=5 ttl=255 time=6.818 ms
As you can see from the screenshot above, we received a reply to almost all the ping echo packets.
Note: The first ping packet timed out due to ARP. After that initial ping, ping should not timeout as long as the ARP cache still contains the MAC address of the other host.
Now, we will configure the NETem appliance to introduce loss on the network. When we open the console (telnet) connection to that appliance, the default interface is as shown below:
What I want to do is apply a 15% loss in a symmetric manner i.e. both ways.
Now when we test with ping again, we see that some ping packets are lost:
PC1> ping 10.0.0.1 10.0.0.1 icmp_seq=1 timeout 84 bytes from 10.0.0.1 icmp_seq=2 ttl=255 time=10.554 ms 10.0.0.1 icmp_seq=3 timeout 84 bytes from 10.0.0.1 icmp_seq=4 ttl=255 time=5.864 ms 84 bytes from 10.0.0.1 icmp_seq=5 ttl=255 time=5.807 ms PC1> ping 10.0.0.1 10.0.0.1 icmp_seq=1 timeout 84 bytes from 10.0.0.1 icmp_seq=2 ttl=255 time=6.068 ms 84 bytes from 10.0.0.1 icmp_seq=3 ttl=255 time=3.839 ms 84 bytes from 10.0.0.1 icmp_seq=4 ttl=255 time=3.460 ms 84 bytes from 10.0.0.1 icmp_seq=5 ttl=255 time=6.079 ms
If you replicate this lab, try the ping over and over again and you will notice that the packets lost each time will differ slightly. Also, reduce/increase the packet loss and see what effect it has on the network.
Side note: Something very interesting to try is to replace PC1 with a router or any device that can be used to open a telnet/ssh connection (VPCS doesn’t support this). Next, configure R1 to accept remote connections and then try to manage R1 remotely (telnet/ssh) from the other device you just added. What you will notice is that at 10% packet loss, the remote connection will be relatively smooth. However, at 30%, you will notice typing delays. You can experiment with lower/higher values.
Diagnosing Packet Loss
While there is no strict approach to detecting packet loss on a network, there are a couple of steps and tools you can use. You will usually start from a place of user experience, that is, users are complaining about poor network performance or they are experiencing some of the effects of packet loss that we have discussed above. From that point, you will want to start troubleshooting to either confirm that the problem exists or exclude packet loss as the cause of the problem e.g. an application problem.
One of the most evident signs that packet loss is occurring on a network is devices with High CPU utilization. Like we already discussed, this can be as a result of several reasons like overutilized devices, faulty hardware/software, or even an attack. If you find one of such devices on the network (e.g. through your network management system), then you will want to troubleshoot why that device has high CPU utilization. Cisco has a good guide for troubleshooting high CPU utilization on its devices.
R1#show processes cpu CPU utilization for five seconds: 5%/0%; one minute: 3%; five minutes: 1% PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process 1 8 67 119 0.00% 0.00% 0.00% 0 Chunk Manager 2 4 39 102 0.00% 0.01% 0.00% 0 Load Meter 3 0 1 0 0.00% 0.00% 0.00% 0 chkpt message ha 4 0 1 0 0.00% 0.00% 0.00% 0 EDDRI_MAIN 5 1200 70 17142 0.00% 0.57% 0.29% 0 Check heaps
Assuming there are no easy-to-detect causes of packet loss on the network such as high CPU utilization, then you can continue your troubleshooting using tools like ping and traceroute. By consistently sending ping packets (of various sizes), you may be able to determine that there is loss on the network. Once this has been identified, you can then use traceroute to try to determine which hop in the path from sender to receiver is causing the packet loss. MTR, a tool that combines the functionality of ping and traceroute in one, can also be used to continuously monitor the performance of a particular path, and report packet loss if any.
When troubleshooting packet loss on a device, it will be worth taking a look at the interface statistics. Many vendors have command-line or GUI tools to view the statistics on network interfaces and will reveal information such as the number of packets that have gone in and out of that interface, the number of errors, the size of the input and output queues, and if there have been any drops e.g. due to a full buffer.
R1#show interfaces FastEthernet 0/0 FastEthernet0/0 is up, line protocol is up Hardware is Gt96k FE, address is c201.6bcd.0000 (bia c201.6bcd.0000) Internet address is 10.0.0.1/24 MTU 1500 bytes, BW 10000 Kbit/sec, DLY 1000 usec, reliability 255/255, txload 1/255, rxload 1/255 Encapsulation ARPA, loopback not set Keepalive set (10 sec) Half-duplex, 10Mb/s, 100BaseTX/FX ARP type: ARPA, ARP Timeout 04:00:00 Last input 00:00:05, output 00:00:00, output hang never Last clearing of "show interface" counters never Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0 Queueing strategy: fifo Output queue: 0/40 (size/max) 5 minute input rate 0 bits/sec, 0 packets/sec 5 minute output rate 0 bits/sec, 0 packets/sec 38 packets input, 3652 bytes Received 1 broadcasts, 0 runts, 0 giants, 0 throttles 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored 0 watchdog 0 input packets with dribble condition detected 73 packets output, 7754 bytes, 0 underruns 0 output errors, 0 collisions, 1 interface resets 0 unknown protocol drops 0 babbles, 0 late collision, 0 deferred 0 lost carrier, 0 no carrier 0 output buffer failures, 0 output buffers swapped out R1#
Here are a few articles to help you identify/troubleshoot input and output drops:
- Troubleshooting input and output queue drops on Cisco devices
- Understanding dropped packets in Juniper
- Network statistics in Linux
Finally in this section on troubleshooting packet loss, you want to consider packet capturing using a traffic diagnostic tool like Wireshark. These tools are typically able to capture and analyse traffic based on several performance characteristics, including detecting packet loss.
Fixing Packet Loss
Solving the issue of packet loss on a network is usually as simple as identifying the cause, and finding a fix for that cause.
- If a link is congested, perhaps you should consider getting a “fatter” pipe so that you can push more traffic through that link. You can also consider applying Quality of Service (QoS) features such that certain types of traffic (e.g. VoIP) are given priority over other traffic that are not so sensitive to loss or critical to operations.
- For devices that are over-utilized above their capacity, the only solution may be to upgrade to a higher performance device. In some cases, it may be a component of the device that needs to be upgraded. For example, you should not use a Fast-Ethernet interface for a 100Mbps link because even though the theoretical limit of Fast-Ethernet is 100Mbps, in practice, you will probably not be able to hit that limit. Use a Gigabit-Ethernet interface instead.
- Swap out faulty hardware/cables and upgrade software as soon as new releases are available (upon adequate testing).
- Depending on your environment, you may opt for a physical network cable (wired) connecting your device to the network instead of using a wireless connection. For wireless networks, you should work on reducing interference as much as possible. One way is to move to a less crowded channel. If distance is not a limitation and your devices support it, you can move to the 5Ghz band which suffers less interference, has more non-overlapping channels, resulting in less congestion and contention. Using a WiFi Analyzer can further assist you in finding issues and spotty areas in your wifi network.
- If under attack, try to mitigate that attack as fast as you can. This can be as simple as using an ACL to block the IP address of the attacker (if static and known). In more complex cases, you can use features like Remotely Triggered Black Hole Routing or a DDoS-prevention cloud service like Cloudflare.
- Finally, check that your configuration is not causing packet loss. Ensure that duplex settings match on devices (or just leave it on Auto). If you have configured QoS, ensure that your buffer’s size is enough.
The effects of Packet loss can be very annoying like inaudible audio calls and grainy videos. As we have seen in this article, packet loss can be caused by a variety of things like congestion, security attack, and even the network medium being used. To combat this issue, identify the cause using tools like ping, MRT, show commands, and packet captures, and then try to fix the defect.