Subscribe Now

* You will receive the latest news and updates on the Canadian IT marketplace.

Trending News

Blog Post

How Cloudflare stopped the bleeding

How Cloudflare stopped the bleeding 

Tavis Ormandy, a respected security researcher with Google’s Project Zero, discovered the leak while analyzing Google search results. Ormandy wrote:

“On February 17, 2017, I was working on a corpus distillation project, when I encountered some data that didn’t match what I had been expecting. It’s not unusual to find garbage, corrupt data, mislabeled data or just crazy non-conforming data…but the format of the data this time was confusing enough that I spent some time trying to debug what had gone wrong, wondering if it was a bug in my code. In fact, the data was bizarre enough that some colleagues around the Project Zero office even got intrigued.

It became clear after a while we were looking at chunks of uninitialized memory interspersed with valid data. The program that this uninitialized data was coming from just happened to have the data I wanted in memory at the time. That solved the mystery, but some of the nearby memory had strings and objects that really seemed like they could be from a reverse proxy operated by Cloudflare, a major CDN service.

A while later, we figured out how to reproduce the problem. It looked like if an HTML page hosted behind Cloudflare had a specific combination of unbalanced tags, the proxy would intersperse pages of uninitialized memory into the output (kinda like heartbleed, but Cloudflare-specific and worse for reasons I’ll explain later). My working theory was that this was related to their “ScrapeShield” feature which parses and obfuscates html – but because reverse proxies are shared between customers, it would affect all Cloudflare customers.”

Ormandy promptly contacted Cloudflare. According to their blog, the company had an initial mitigation in place with 47 minutes and a global fix in less than 7 hours. John Graham-Cumming subsequently posted details of the vulnerability: “It turned out that in some unusual circumstances … our edge servers were running past the end of a buffer and returning memory that contained private information such as HTTP cookies, authentication tokens, HTTP POST bodies, and other sensitive data. And some of that data had been cached by search engines…We quickly identified the problem and turned off three minor Cloudflare features (email obfuscation, Server-side Excludes and Automatic HTTPS Rewrites) that were all using the same HTML parser chain that was causing the leakage. At that point it was no longer possible for memory to be returned in an HTTP response.”

When all was said and done, it appears that Cloudflare had been leaking customer data since September 2016. According to the company only 1 in 3.3 million HTTP requests exposed data, but that is still a vast amount considering Internet scale.

Cloudflare customers need to immediately address potential credential, session token, and confidential information disclosures. Some of this data was indexed by Google and other search engines, and despite remediation efforts, will linger for some time.

Cloudflare customers should:

  1. Invalidate all authentication tokens and cookies. WordPress site owners, if they have not already done so, should read Mark Maunder’s excellent post on the Wordfence blog ( In summary, change all wp-config.php salts. This will log everyone out and invalidate cookies and sessions. Similar mechanisms are available on other publishing platforms and applications.
  2. Since it is possible that credentials were exposed, site owners should change any password that has flowed through Cloudflare. While two-factor authentication would provide a strong mitigating factor, passwords should still be updated. Those not using two-factor authentication should seriously consider it today.
  3. Cloudflare customers need to carefully consider whether they have potentially suffered a security breach, and if that gives rise to any reporting requirements. If in doubt, a qualified legal opinion should be obtained.

Beyond the need for immediate damage control, the Cloudflare leak presents valuable lessons for developers, operations teams, and security professionals.

Developers commonly use an assortment of libraries and languages. In this case, Cloudflare used Ragel, which is converted into generated C code and then compiled. As the company explained, “The C code uses, in the classic C manner, pointers to the HTML document being parsed, and Ragel itself gives the user a lot of control of the movement of those pointers. The underlying bug occurs because of a pointer error.

/* generated code */
if ( ++p == pe )
    goto _test_eof;

The root cause of the bug was that reaching the end of a buffer was checked using the equality operator and a pointer was able to step past the end of the buffer. This is known as a buffer overrun. Had the check been done using >= instead of == jumping over the buffer end would have been caught. The equality check is generated automatically by Ragel and was not part of the code that we wrote. This indicated that we were not using Ragel correctly.”

While it is not always feasible to review generated code, manually or with analysis tools, doing so in this case might have prevented a significant data leak. Developers should remain cognizant of the fact that libraries and generated code can introduce significant vulnerabilities.

While there are no guarantees, thorough testing significantly reduces risk. The vast majority of developers and quality assurance groups rely primarily on positive test cases; they verify that the software behaves as expected with valid input data. From a security perspective, negative test cases are vital. Fuzz testing, developed by Barton Miller at the University of Wisconsin in 1989, is particularly helpful. The technique, often automated or semi-automated, involves providing invalid, unexpected, or random data as inputs and monitoring the program for exceptions. Fuzzing, as it is often called, should be considered mandatory when input data is provided by a user or third party; Cloudflare proxies being a prime example.

On a more positive note, Cloudflare’s ability to rapidly disable the affected features demonstrates a valuable best practice: “Every feature Cloudflare ships has a corresponding feature flag, which we call a ‘global kill’. We activated the Email Obfuscation global kill 47 minutes after receiving details of the problem and the Automatic HTTPS Rewrites global kill 3h05m later. The Email Obfuscation feature had been changed on February 13 and was the primary cause of the leaked memory, thus disabling it quickly stopped almost all memory leaks. Within a few seconds, those features were disabled worldwide.”

Finally, the fact that Cloudflare had a cross-functional team assembling within eight minutes of receiving details of the bug from Google, and was able to quickly respond, demonstrates the value of incident response planning. While it is regrettable that the security issue arose in the first place, much can be learned from how Cloudflare stopped the bleeding.

Have a security question you’d like answered in a future column? Please send me an email.

Related posts