famous-bugs 🐛 🐝 🐜
A curated list of bugs, problems and failures that developers may find useful to know.
Table Of Contents
Introduction
As software developers, we can simply define our work as fixing bugs and developing solutions to problems. This is a curated list of problems and bugs that developers may find useful to know. I hope it will become a community driven list to create a value.
Problems
Thundering Herd Problem
Bieber would post a photo, and so many Beliebers would "Like" it that Instagram's computers couldn't keep up.
When Justin Bieber posts a photo, so many Beliebers would "Like" it that causes tremendous amount of notifications, queries and processes. This problem faced by the Instagram team is actually a very good example of the thundering herd problem. They did many improvements to avoid this problem happen again and explained in this article.
Further readings:
N+1 Query Problem
... turns into n+1 requests since the item has n associated items.
The N+1 problem occurs when the code tries to load the children of a parent in a relationship (e.g. one-to-many relations). Nearly all of the ORMs enable lazy-loading by default. Assume that you are willing to create a list of records with the data coming from the relations. One query for fetching the parents and N queries for N parent records (one record per each to fetch the data from relation) are issued. As you can expect, doing N+1 queries instead of a single query will flood your database with queries, which is something we should avoid. Hopefully, ORMs have known the problem for quite long time and they have already build-in solutions. The solution is simple: while developing you should should tell in advance to the ORM that you will need additional data (eager loading).
Further readings:
Single Point of Failure Problem
A huge internet outage has affected large swaths of the internet, including major sites like Amazon, Reddit, Twitter and Twitch ... The source of the problem is the Fastly content delivery network (CDN), which has confirmed a global disruption.
A single point of failure (SPOF) is a part of a system that will stop the entire system from working when it fails. SPOFs are undesirable in any system with a goal of high availability or reliability. As well as the software and hardware components of the system, the cloud vendor can also be the the SPOF of a system. The big internet failure (The failure of Fastly, CDN vendor, affected the many major sites) at Jun 8, 2021 is a perfect incident for this case.
Further readings:
Year 2000 Problem
... making the year 2000 indistinguishable from 1900.
The Year 2000 problem (also known as the Y2K problem, the Millennium bug, Y2K bug, the Y2K glitch, or Y2K) refers to fails occurred becuase of the formatting and storage of calendar data for dates beginning in the year 2000. Many failures were already documented all over the world. Here are some examples:
- On 1 January 1999, taxi meters in Singapore stopped working.
- On 1 January 2000, in Onagawa, Japan, an alarm sounded at a nuclear power plant at two minutes after midnight.
- On 1 March 2000, In the United States, the Coast Guard's message processing system was affected.
- Norway and Finland had to change their national identification number, to indicate correctly the century in which a person was born.
Further readings:
Outages and Hacks
Gangnam Style Broke YouTube
We never thought a video would be watched in numbers greater than a 32-bit integer.
YouTube
YouTube's counter was previously using a 32-bit integer which means the maximum possible views it could count was 2,147,483,647. And "Gangnam Style" surpassed the 2-billion-view marker. YouTube has upgraded to a 64-bit integer so that the maximum views a video can receive is now 9,223,372,036,854,775,808.
Mysterious Traffic Of A Flower Image On Wikimedia
20% of all requests to one of our data centers for media are for this image of a flower. Nobody knows why.
On Feb 3, 2021, Wikimedia tech team has noticed that they get about 90M requests per day from various ISPs in India. The requests are all for an image file (see below). They hypothesised that there might be some mobile app that hotlinked the image for e.g. a splash screen. Wikimedia did not reveale the app, but people thought that It may be an alternative app for TikTok. Because, around the time, India took the decision to totally ban TikTok which direct people to use alternative apps. The butterfly effect starting with this decision turned this into a memory to remember for Wikimedia and the mysterious application tech teams.
See also:
NPM Leftpad Breakage
An 11 line npm package called left-pad with only 10 stars on github was unpublished...it broke some of the most important packages on all of npm.
Azer Koçulu had been publishing a simple code he wrote to npm and It became very popular. Many projects has used his package as a dependency. On March 11, he received an email from a patent and trademark agent who works for Kik which was a messaging app. KiK was also the name of another package of him. They wanted him to rename the Kik package but he did not accept. Then the agency forced NPM to do so. After NPM's decision, Azer Koçulu had taken down all of his packages including left-pad. Then, many JavaScript programmers around the world started getting an error message "npm ERR! 404 'left-pad' is not in the npm registry".
Further readings:
Heathrow Terminal 5 Opening
... simple real scenarios which for some reason weren’t tested.
Heathrow Terminal 5 was officially opened on 14 March 2008. On the day of opening, it did not operate as planned, so It forced British Airways to cancel 34 flights and suspend baggage check-in.
The newest luggage transporting system software couldn’t handle some simple real scenarios which for some reason weren’t tested. For example, a luggage was carried back manually because the owner forgot something in. In this case, the program went off and an item wasn’t recorded. The luggage processing was always disrupted by such kind of little details.
During the following ten days, around 42,000 items weren’t delivered to owners, and over 500 flights were canceled. Check-in to other flights became temporarily unavailable.
Further readings:
- Technical glitches hit T5 opening
- Heathrow Terminal 5 at Wikipedia
Stack Overflow Outage On July 20, 2016
The regular expression was: ^[\s\u200c]+|[\s\u200c]+$ ...
On July 20, 2016 StackOverflow experienced a 34 minute outage because of a malformed post that caused one of the regular expressions to consume high CPU on Stack Overflow web servers. It was a very typical regular expression denial of service (ReDoS) attack but It was occurred because the malformed post was displayed on the homepage for a while. So, the regular expression checks (match or mismatch) consumed computational resources. As the homepage is used for healtcheck by the load balancers, the entire site became unavailable.
Further readings:
Gitlab Database Outage
...accidentally deleted the production base. What made things even worse is that the directory holding the copies was empty too — the backups had not been made for a long time due to a configuration error...
On January 31, 2017, Gitlab faced an issue that may be a good incident about the importance of backups. They planned to make a major change their database server setup. During the process, the production database was deleted accidentally. The things got worse when they realized that the backups were not taken for a while becuase of a configuration issue. Most probably, they tried to their best with a big panic but it was resulted in an 18-hour outage with the lost of 300 GB of customer data. Gitlab published an honest and detailed postmortem about the outage. You can also watch this video about how the incident occured and was fixed.
PHP Git Commit Incident
Hi everyone,
Yesterday (2021-03-28) two malicious commits were pushed to the php-src repo [1] from the names of Rasmus Lerdorf and myself. We don't yet know how exactly this happened ...
On March 28, 2021, Nikita Popov (one of the maintainers) said that two malicious commits were added to the php-src repository in both his and Rasmus Lerdorf's (the PHP creator) names. The exact reason was not published publicly yet but Nikita said that everything points towards a compromise of the self hosted server (git.php.net) rather than the compromise of the git accounts). This is a good example of the supply-chain attack, in which threat actors will target elements in the supply chain of the projects such as an open source project, library, or another component that is relied upon.
Further readings:
- Changes to Git commit workflow
- PHP backdoor attempt shows need for better code authenticity verification
October 4th Facebook Outage
... its root cause was a faulty configuration change on our end.
On October 4, 2021, at 15:39 UTC, Facebook, Messenger, Instagram, WhatsApp, Mapillary, and Oculus became globally unavailable for more than seven hours. It also blocked the "Log in with Facebook" used for accessing third-party sites. Facebook team declared that the root cause was a configuration changes on the backbone routers that coordinate network traffic between their data centers.
Cloudflare Outage 2019
... we deployed a new rule in our WAF Managed Rules that caused CPUs to become exhausted on every CPU core that handles HTTP/HTTPS traffic ...
The Cloudflare outage on July 2, 2019, was triggered by a configuration error during a routine deployment of a new rule in the company's Web Application Firewall (WAF) software. This error caused the entire Cloudflare network to crash, leading to widespread service disruptions for numerous websites and online services that relied on Cloudflare's infrastructure. The outage affected a wide range of online platforms, including popular websites, e-commerce platforms, social media networks, and even government services.
Equifax Data Breach 2017
In September of 2017, Equifax announced it experienced a data breach, which impacted the personal information of approximately 147 million people.
In 2017, Equifax, one of the largest consumer credit reporting agencies, experienced a devastating data breach that compromised the personal information of millions of individuals. The breach was a result of a vulnerability in Equifax's website software, which allowed hackers to gain unauthorized access to the company's systems and exfiltrate vast amounts of sensitive data.
Bugs and Worms
The First Bug
First actual case of bug being found.
On September 9, 1947, the Mark II (at Harvard University) broke down. Engineers investigated and diagnosed the cause. A moth had entered the machine and had shorted out relay number 70 of Panel F. They attached the bug to the page with with a note "First actual case of bug found.". This is how the term "bug" was born.
The sheet is kept at the National Museum of American History of the Smithsonian Institution in Washington.
Further readings:
The Explosion of the Ariane 5
... a 64 bit floating point number ... was converted to a 16 bit signed integer.
On June 4, 1996 the Ariane 5 rocket launched by the European Space Agency exploded just forty seconds after its lift-off. The rocket was on its first voyage, after a decade of development costing $7 billion. The destroyed rocket and its cargo were valued at $500 million. The accident was a significant setback for Europe’s space program.
The horizontal velocity of the rocket with respect to the platform was larger than 32,767, the largest integer storable in a 16 bit signed integer, and thus the conversion failed.
Further readings:
Metric System Mess Of NASA’s Mars Climate Orbiter
The Mars Climate Orbiter was a robotic space probe launched by NASA on December 11, 1998 to study the Martian climate. The navigation team used the metric system in its calculations, while the team designing and building the spacecraft, provided crucial acceleration data in the English metric system. The acceleration readings measured in units of pound-seconds^2 for a force called newton-seconds^2. In a sense, the spacecraft was lost in translation.
Further readings:
The Morris Worm
This was a design flaw ...
According to its creator, Robert Tappan Morris, It was not written to cause damage, but to highlight security flaws. It was programmed to exploit the known vulnerabilities in sendmail, finger, rsh/rexec and weak passwords. While creating the worm, Robert programmed the worm to copy itself 14% of the time to avoid system administrators trying to defeat the worm by instructing the computer to report a false positive. This was a design flaw and created fork bombs and crashes in the affected computer.
The Morris Worm is accepted as one of the first computer worms distributed via the Internet and It was the first to gain significant mainstream media attention. A floppy disk containing the source code for the Morris Worm is held at the Computer History Museum.
Further readings:
Death by IT
One digit was dropped from a computer code to indicate the patients were "deceased" rather than "discharged to home".
In 2002, St. Mary's Mercy Medical Center in Grand Rapids erroneously reported that 8500 patients dead because of a glitch in their patient management software. False death reports weren't only sent to patients but also to the insurance companies and social security offices. There is a clear announcement about how the problem was fixed but the management software was changed.
Further readings:
The 1990 AT&T Network Collapse
An error in just one line of code brought down AT&T's network for several hours.
Before this collapse, AT&T's long-distance network was accepted as reliable and strong. It was carrying over 70% of the nation's long-distance traffic and routing over 115 million telephone calls. This collapse resulted in a $60 million lost as of 75 million missed phone by AT&T customers calls and 200 000 airline and hotel reservations and other businesses that relied on the telephone network.
The bug occurred because of a break statement in an if clause nested in a switch clause in the upgraded recovery software of all switches. All the switches became unreliable at the same time while each switch tries to determine if the neighbor switches were reliable or not.
1 while (ring receive buffer not empty and side buffer not empty) DO
2 Initialize pointer to first message in side buffer or ring receive buffer
3 get copy of buffer
4 switch (message)
5 case (incoming_message):
6 if (sending switch is out of service) DO
7 if (ring write buffer is empty) DO
8 send "in service" to status map
9 else
10 break
11 END IF
12 process incoming message, set up pointers to optional parameters
13 break
14 END SWITCH
15 do optional parameter work
Further readings:
- The 1990 AT&T Long Distance Network Collapse
- All Circuits are Busy Now: The 1990 AT&T Long Distance Network Collapse
ILOVEYOU Worm
The events inspired the song "E-mail" on the Pet Shop Boys' UK top-ten album of 2002, Release, the lyrics of which play thematically on the human desires which enabled the mass destruction of this computer infection.
ILOVEYOU worm was created by Onel de Guzman (a college student in Manila Philippines) and infected over ten million Windows personal computers after 4 May 2000. His purpose was to steal passwords that he could use other users' internet accounts without paying for the service. He stated that the worm was very easy to create, because there was a bug in Windows 95 that would run code in email attachments.
Originally, he designed the worm to only work in Manila. Out of curiosity, he removed this geographic restriction and allowed the worm to spread worldwide. Of course, he did not expect this worldwide spread. The worm moved first to Hong Kong, then to Europe, and finally the United States. Within ten days, over fifty million infections (10% of internet-connected computers in the world) had been reported. To protect themselves, The Pentagon, CIA, the British Parliament and most large corporations decided to completely shut down their mail systems.
The worm provided users a way to modify it and this allowed more than twenty five variations of ILOVEYOU to spread across the internet, each one doing different kinds of damage.
This worm created a public awareness of the real threat of malwares and antivirus software providers entered a golden era of distribution. Additionally, It helped many people to be more skeptical of emails which were the classic virus delivery systems.
Further readings:
The Zune Bug
Judgment day has arrived for owners of 30GB Zunes...
On December 31, 2008 many owners of Zune Player started reporting the player started freezing. The response from Microsoft was to wait until the next day and the freeze will be solved by itself. The reason was a simple loop causing infinite execution on leap years. A simple loop control statement which was coded without considiring the leap years caused this mess.
Here is the problematic loop;
year = ORIGINYEAR;
while (days > 365)
{
if (IsLeapYear(year))
{
if (days > 366)
{
days -= 366;
year += 1;
}
}
else
{
days -= 365;
year += 1;
}
}
Further readings;
- The Zune Bug At Bit-Player
The Forgotten Space Character
GIANT BUG... causing /usr to be deleted... so sorry....
Bumblebee is a project to make Nvidia Optimus enabled laptops work in GNU/Linux systems. On May 24, 2011 an issue was created because the installation script deletes the /usr/ folder. The reason was a forgotten mistyped space in the script. The effect of the bug is so harsh becuase the victim should reinstall the OS after. The issue still attracts attention and you can see the recent comments (On Mar 7, 2021).
....
rm -rf /usr /lib/nvidia-current/xorg/xorg
....
Rachel True's Problem On iCloud
Anyone else getting this error from Apple iCloud ? In past or now? ...
Rachel True shared on Twitter thats she’s getting a error when attempting to log into iCloud. It may be appearing because the iCloud is interpreting her last name "True" as a boolean value. In her tweet, she noted that she’d been locked out for over 6 months and the problem is still valid at 27 February, 2021. As this tweet is very popular now, most probably the iCloud team will notice it and come up with a solution soon.
It is a very common and simple bug because developers tends to forget the importance of the types values. Here are some other similar cases;
The MySpace Worm (Samy Worm)
Myspace blocks a lot of tags. In fact, they only seem to allow ... maybe a few others ...
The MySpace Worm is an XSS worm that was designed to propagate across the social networking site MySpace by Samy Kamkar. Even the worm itself was relatively harmless, It resulted in Kamkar being sentenced to three years' probation with only one computer with no access to the Internet and more penalties. This worm can be a good case story to show the possible results of an XSS exploit.
Further readings;
Translations
This is available in a number of languages.
Language | Maintainer |
---|---|
Umut Işık |
If you would like to update a translation or add a new language, just open a pull request.
Contributing
Please do contribute!
Raise an issue if you'd like to suggest an addition or change, or Open a pull request to add your own.
Please read the Contributing Guidelines and the Code of Conduct documents.