Tony MartinVegue 3/10/17 Tony MartinVegue 3/10/17

Selection Bias and Information Security Surveys

Everyone in infosec has seen a sketchy stat—“60% of orgs were hit by ransomware!” But who actually took that survey? This post breaks down how selection bias warps vendor reports and how bad data becomes cybersecurity “truth.”

The auditor stared at me blankly. The gaze turned into a gape and lasted long enough to make me shift uncomfortably in my chair, click my pen and look away before I looked back at him.

The blank look flashed to anger.

“Of course, malicious insiders are the biggest threat to this company. They’re the biggest threat to ALL companies.”

He waved the latest copy of a vendor report, which would lead anyone to believe malicious insiders are the single biggest threat to American business since Emma Goldman.

The report he waved in the room was not research at all. It was vendor marketing, thinly disguised as a “survey of Information Security leaders.” It was solely based on an unscientific survey of a small group of people. It reeked of error and bias.

Selection bias is what makes these surveys virtually worthless. I previously wrote about the problems of surveys in information security vendor reports and I want to dig in deeper on a topic from the last post: properly selecting a representative sample from the general population being surveyed. This matters so much. This is perhaps the most important step when conducting a statistically sound survey.

Why this matters

Risk analysts are one of many professions that rely on both internal and external incident data to assess risk. If a risk analyst is performing an assessment of current or former employees stealing customer data, there are two primary places one would look for incident data to determine frequency: internal incident reports and external data on frequency of occurrence.

One of the first places a risk analyst would look would be one of the many published reports on insider threat. The analyst would then find one or several statistics about the frequency of current or former employees stealing data, and use the figure to help provide a likelihood of a loss event.

If the survey is statistically sound, the results can be extrapolated to the general population. In other words, if the survey states that 12% of insiders use USB devices to steal data, within a margin of error, you can use that same range to help inform your assessment.

If the survey is not statistically sound, the results only apply to respondents of the survey. This is called selection bias.

What is selection bias?

There are many forms of bias that are found in statistics, and by extension, in surveys, but the most common is selection bias. It’s the easiest to get wrong and throws the results off the quickest.

Selection bias occurs when the survey result is systematically different from the population that is being studied. Here are a few ways this happens.

Undercoverage: Underrepresentation of certain groups in the sample. For example, if you are surveying information security professionals, you will want pen testers, risk analysts, department heads, CISO’s — essentially a cross-section. If you have a hard time getting CISO’s to answer the survey, the survey will be biased toward undercoverage of CISO’s.
Voluntary Response: This occurs when your survey takers are self-selected. The most common example of this is online surveys or polls. Slashdot polls are fun — but completely non-scientific because of voluntary response. Optimally, one would like to have participants randomly selected to ensure a good cross-section of groups of people.
Participation bias: This occurs when a certain group of participants are more or less likely to participate than others. This can happen when a certain group appreciates the value of surveys more than others (risk analysts versus pen testers) or if survey takers are incentivized, such as with reward points or cash. Compensating survey takers is a very contentious practice and will usually result in people taking the survey that are not in the intended sample population.

Real-world example

There are many to choose from, but I found the “2015 Vormetric Insider Threat Report” from a random Google search. The report is aesthetically polished and, on the surface, very informative. It has the intended effect of making any reader nervous about data theft from employees and contractors.

The report is based on a survey of 818 IT professionals that completed an online survey. The report authors are very careful; they frame the report as opinion of the respondents. Furthermore, there is a short disclosure at the end of the report that states the survey methodology, identifies the company that performed the survey (Harris Poll) and states that the “…online survey is not based on a probability sample and therefore no estimate of theoretical sampling error can be calculated.”

Let me translate that: This report is for marketing and entertainment purposes only.

Why not?

Here’s another problem: Harris Poll compensates their survey takers. Survey takers earn points (called “HIpoints”) for every survey they fill out. These points can be redeemed for gift cards and other items. We already know the survey isn’t statistically significant from the disclosure, but one must ask — can the survey be trusted to include only IT professionals, if the respondents are self-selected and are rewarded if they say anything to qualify for the survey?

The most obvious problem here is voluntary selection and participation bias; both lead to a situation in which you should not use the survey results to base any serious decision on.

I don’t mean to pick on Vormetric exclusively. There are hundreds of similar surveys out there.

Here’s another one. The Cyberthreat Defense Group conducted an online survey that asked many enterprise security questions. One of the results was that 60% of the respondent’s organizations were hit by ransomware in 2016. I fast-forwarded to the section that described the methodology. They vaguely disclosed the fact that survey takers were PAID and the survey results represented the opinions of the respondents. It’s back-page news, but at least it’s there. This is the problem:

Now it’s not the opinion of a small, self-selected, compensated group of people that may or may not be in security. Now it’s fact.

Then it gets tweeted, re-tweeted, liked, whatever. Now it’s InfoSec Folklore.

Screen Shot 2018-07-31 at 12.58.53 PM.png

See the problem?

Tony MartinVegue 2/25/17 Tony MartinVegue 2/25/17

The Problem with Security Vendor Reports

Most vendor security reports are just glossy marketing in disguise, riddled with bad stats and survey bias. This post breaks down how to spot junk research before it ends up in your board slides — and how to demand better.

The information security vendor space is flooded with research: annual reports, white papers, marketing publications — the list goes on and on. This research is subsequently handed to marketing folks (and engineers who are really marketers) where they fan out to security conferences across the world, standing in booths quoting statistics and attending pay-to-play speaking slots, convincing executives to buy their security products.

There’s a truth, however, that the security vendors know but most security practitioners and decision makers aren’t quite wise to yet. Much of the research vendors present in reports and marking brochures isn’t rooted in any defensible, scientific method. It’s an intentional appeal to fear, designed to create enough self-doubt to make you buy their solution.

This is how it’s being done:

Most vendor reports are based on surveys, also known as polls
Most of the surveys presented by security vendors ignore the science behind surveys, which is based on statistics and mathematics
Instead of using statistically significant survey methods, many reports use dubious approaches designed to lead the reader down a predetermined path

This isn’t exactly new. Advertisers have consumer manipulation down to an art form and have been doing it for decades. Security vendors, however, should be held to a higher standard due to fact that the whole field is based on trust and credibility. Many vendor reports are presented as security research and not advertisements.

What’s a survey?

A survey is a poll. Pollsters ask a small group of people a question, such as “In the last year, how many of your security incidents have been caused by insiders?” The results are extrapolated to apply it to a general population. For example, IBM conducted a survey that found that 59% of CISO’s experienced cyber incidents in which the attackers could defeat their defenses. The company that conducted the survey didn’t poll all CISO’s — they polled a sample of CISO’s and extrapolated a generality about the entire population of CISO’s.

This type of sampling and extrapolation is completely acceptable to do, if the survey adheres to established methodologies in survey science. Doing so makes the survey statistically significant; not doing it puts the validity of the results in question.

All surveys have some sort of error and bias. However, a good survey will attempt to control for this by doing the following:

Use established survey science methods to reduce the errors and bias
Disclose the errors and bias to the readers
Disclose the methodology used to conduce the survey
A good survey will also publish the raw data for peer review

Why you should care about statistically sound surveys

Surveys are everywhere in security. They are found in cute infographics, annual reports, journal articles and academic papers. Security professionals take these reports and read them, learn from them, quote them in steering committee meetings or to senior executives when they ask questions. Managers often ask security analysts to quantify risk with data — the easiest way is to find a related survey. We rely on the data to enable our firms to make risk-aware business decisions.

When you tell your Board of Directors that 43% of all data breaches are caused by internal actors, you’d better be right. The data you are using must be statistically significant and rooted in fact. If you are quoting vendor FUD or some marking brochure that’s disconnected from reality, your credibility is at stake. We are trusted advisors and everything we say must be defensible.

What makes a good survey

Everyone has seen a survey. Election and public opinion polls seem simple on the surface, but it’s very hard to do correctly. The science behind surveys are rooted in math and statistics; when the survey is, it’s statistically significant.

There are four main components of a statistically significant survey:

Population

This is a critical first step. What is the group that is being studied? How big is the group? An example would be “CISO’s” or “Information Security decision makers.”

Sample size

The size of the group you are surveying. It’s usually not possible to study an entire population, so a sample is chosen. A good survey taker will do all they can to ensure the sample size is as representative as the general population as possible. More importantly, the sample size needs to be randomly selected.

Confidence interval

Also known as the margin of error; (e.g. +/-); larger the sample size, the lower the margin of error.

Unbiased Questions

The questions themselves are crafted by a neutral professional trained in survey science. Otherwise, it is very easy to craft biased questions that lead the responder to answer in a certain way.

What makes a bad survey?

A survey will lose credibility as it uses less and less of the above components. There are many ways a survey could be bad, but here are the biggest red flags:

No disclosure of polling methodology
No disclosure of the company that conducted the poll
The polling methodology is disclosed, but no effort was made to make it random or representative of the population (online polls have this problem)
Survey takers are compensated (people will say anything for money)
Margin of error not stated

Be Skeptical

Be skeptical of vendor claims. Check for yourself and read the fine print. When you stroll the vendor halls at RSA or Blackhat and a vendor makes some outrageous claim about an imminent threat, dig deeper. Ask hard questions. We can slowly turn the ship away from FUD and closer to fact and evidence-based research.

And if you’re a vendor — think about using reputable research firms to perform your surveys.