Selection Bias and Information Security Surveys
Everyone in infosec has seen a sketchy stat—“60% of orgs were hit by ransomware!” But who actually took that survey? This post breaks down how selection bias warps vendor reports and how bad data becomes cybersecurity “truth.”
The auditor stared at me blankly. The gaze turned into a gape and lasted long enough to make me shift uncomfortably in my chair, click my pen and look away before I looked back at him.
The blank look flashed to anger.
“Of course, malicious insiders are the biggest threat to this company. They’re the biggest threat to ALL companies.”
He waved the latest copy of a vendor report, which would lead anyone to believe malicious insiders are the single biggest threat to American business since Emma Goldman.
The report he waved in the room was not research at all. It was vendor marketing, thinly disguised as a “survey of Information Security leaders.” It was solely based on an unscientific survey of a small group of people. It reeked of error and bias.
Selection bias is what makes these surveys virtually worthless. I previously wrote about the problems of surveys in information security vendor reports and I want to dig in deeper on a topic from the last post: properly selecting a representative sample from the general population being surveyed. This matters so much. This is perhaps the most important step when conducting a statistically sound survey.
Why this matters
Risk analysts are one of many professions that rely on both internal and external incident data to assess risk. If a risk analyst is performing an assessment of current or former employees stealing customer data, there are two primary places one would look for incident data to determine frequency: internal incident reports and external data on frequency of occurrence.
One of the first places a risk analyst would look would be one of the many published reports on insider threat. The analyst would then find one or several statistics about the frequency of current or former employees stealing data, and use the figure to help provide a likelihood of a loss event.
If the survey is statistically sound, the results can be extrapolated to the general population. In other words, if the survey states that 12% of insiders use USB devices to steal data, within a margin of error, you can use that same range to help inform your assessment.
If the survey is not statistically sound, the results only apply to respondents of the survey. This is called selection bias.
What is selection bias?
There are many forms of bias that are found in statistics, and by extension, in surveys, but the most common is selection bias. It’s the easiest to get wrong and throws the results off the quickest.
Selection bias occurs when the survey result is systematically different from the population that is being studied. Here are a few ways this happens.
Undercoverage: Underrepresentation of certain groups in the sample. For example, if you are surveying information security professionals, you will want pen testers, risk analysts, department heads, CISO’s — essentially a cross-section. If you have a hard time getting CISO’s to answer the survey, the survey will be biased toward undercoverage of CISO’s.
Voluntary Response: This occurs when your survey takers are self-selected. The most common example of this is online surveys or polls. Slashdot polls are fun — but completely non-scientific because of voluntary response. Optimally, one would like to have participants randomly selected to ensure a good cross-section of groups of people.
Participation bias: This occurs when a certain group of participants are more or less likely to participate than others. This can happen when a certain group appreciates the value of surveys more than others (risk analysts versus pen testers) or if survey takers are incentivized, such as with reward points or cash. Compensating survey takers is a very contentious practice and will usually result in people taking the survey that are not in the intended sample population.
Real-world example
There are many to choose from, but I found the “2015 Vormetric Insider Threat Report” from a random Google search. The report is aesthetically polished and, on the surface, very informative. It has the intended effect of making any reader nervous about data theft from employees and contractors.
The report is based on a survey of 818 IT professionals that completed an online survey. The report authors are very careful; they frame the report as opinion of the respondents. Furthermore, there is a short disclosure at the end of the report that states the survey methodology, identifies the company that performed the survey (Harris Poll) and states that the “…online survey is not based on a probability sample and therefore no estimate of theoretical sampling error can be calculated.”
Let me translate that: This report is for marketing and entertainment purposes only.
Why not?
Here’s another problem: Harris Poll compensates their survey takers. Survey takers earn points (called “HIpoints”) for every survey they fill out. These points can be redeemed for gift cards and other items. We already know the survey isn’t statistically significant from the disclosure, but one must ask — can the survey be trusted to include only IT professionals, if the respondents are self-selected and are rewarded if they say anything to qualify for the survey?
The most obvious problem here is voluntary selection and participation bias; both lead to a situation in which you should not use the survey results to base any serious decision on.
I don’t mean to pick on Vormetric exclusively. There are hundreds of similar surveys out there.
Here’s another one. The Cyberthreat Defense Group conducted an online survey that asked many enterprise security questions. One of the results was that 60% of the respondent’s organizations were hit by ransomware in 2016. I fast-forwarded to the section that described the methodology. They vaguely disclosed the fact that survey takers were PAID and the survey results represented the opinions of the respondents. It’s back-page news, but at least it’s there. This is the problem:
Now it’s not the opinion of a small, self-selected, compensated group of people that may or may not be in security. Now it’s fact.
Then it gets tweeted, re-tweeted, liked, whatever. Now it’s InfoSec Folklore.
See the problem?
The Problem with Security Vendor Reports
Most vendor security reports are just glossy marketing in disguise, riddled with bad stats and survey bias. This post breaks down how to spot junk research before it ends up in your board slides — and how to demand better.
The information security vendor space is flooded with research: annual reports, white papers, marketing publications — the list goes on and on. This research is subsequently handed to marketing folks (and engineers who are really marketers) where they fan out to security conferences across the world, standing in booths quoting statistics and attending pay-to-play speaking slots, convincing executives to buy their security products.
There’s a truth, however, that the security vendors know but most security practitioners and decision makers aren’t quite wise to yet. Much of the research vendors present in reports and marking brochures isn’t rooted in any defensible, scientific method. It’s an intentional appeal to fear, designed to create enough self-doubt to make you buy their solution.
This is how it’s being done:
Most vendor reports are based on surveys, also known as polls
Most of the surveys presented by security vendors ignore the science behind surveys, which is based on statistics and mathematics
Instead of using statistically significant survey methods, many reports use dubious approaches designed to lead the reader down a predetermined path
This isn’t exactly new. Advertisers have consumer manipulation down to an art form and have been doing it for decades. Security vendors, however, should be held to a higher standard due to fact that the whole field is based on trust and credibility. Many vendor reports are presented as security research and not advertisements.
What’s a survey?
A survey is a poll. Pollsters ask a small group of people a question, such as “In the last year, how many of your security incidents have been caused by insiders?” The results are extrapolated to apply it to a general population. For example, IBM conducted a survey that found that 59% of CISO’s experienced cyber incidents in which the attackers could defeat their defenses. The company that conducted the survey didn’t poll all CISO’s — they polled a sample of CISO’s and extrapolated a generality about the entire population of CISO’s.
This type of sampling and extrapolation is completely acceptable to do, if the survey adheres to established methodologies in survey science. Doing so makes the survey statistically significant; not doing it puts the validity of the results in question.
All surveys have some sort of error and bias. However, a good survey will attempt to control for this by doing the following:
Use established survey science methods to reduce the errors and bias
Disclose the errors and bias to the readers
Disclose the methodology used to conduce the survey
A good survey will also publish the raw data for peer review
Why you should care about statistically sound surveys
Surveys are everywhere in security. They are found in cute infographics, annual reports, journal articles and academic papers. Security professionals take these reports and read them, learn from them, quote them in steering committee meetings or to senior executives when they ask questions. Managers often ask security analysts to quantify risk with data — the easiest way is to find a related survey. We rely on the data to enable our firms to make risk-aware business decisions.
When you tell your Board of Directors that 43% of all data breaches are caused by internal actors, you’d better be right. The data you are using must be statistically significant and rooted in fact. If you are quoting vendor FUD or some marking brochure that’s disconnected from reality, your credibility is at stake. We are trusted advisors and everything we say must be defensible.
What makes a good survey
Everyone has seen a survey. Election and public opinion polls seem simple on the surface, but it’s very hard to do correctly. The science behind surveys are rooted in math and statistics; when the survey is, it’s statistically significant.
There are four main components of a statistically significant survey:
Population
This is a critical first step. What is the group that is being studied? How big is the group? An example would be “CISO’s” or “Information Security decision makers.”
Sample size
The size of the group you are surveying. It’s usually not possible to study an entire population, so a sample is chosen. A good survey taker will do all they can to ensure the sample size is as representative as the general population as possible. More importantly, the sample size needs to be randomly selected.
Confidence interval
Also known as the margin of error; (e.g. +/-); larger the sample size, the lower the margin of error.
Unbiased Questions
The questions themselves are crafted by a neutral professional trained in survey science. Otherwise, it is very easy to craft biased questions that lead the responder to answer in a certain way.
What makes a bad survey?
A survey will lose credibility as it uses less and less of the above components. There are many ways a survey could be bad, but here are the biggest red flags:
No disclosure of polling methodology
No disclosure of the company that conducted the poll
The polling methodology is disclosed, but no effort was made to make it random or representative of the population (online polls have this problem)
Survey takers are compensated (people will say anything for money)
Margin of error not stated
Be Skeptical
Be skeptical of vendor claims. Check for yourself and read the fine print. When you stroll the vendor halls at RSA or Blackhat and a vendor makes some outrageous claim about an imminent threat, dig deeper. Ask hard questions. We can slowly turn the ship away from FUD and closer to fact and evidence-based research.
And if you’re a vendor — think about using reputable research firms to perform your surveys.
What’s the difference between a vulnerability scan, penetration test and a risk analysis?
Think vulnerability scan, pen test, and risk analysis are the same thing? They're not — and mixing them up could waste your money and leave you exposed. This post breaks down the real differences so you can make smarter, more secure decisions.

You’ve just deployed an ecommerce site for your small business or developed the next hot iPhone MMORGP. Now what?
Don’t get hacked!
An often overlooked, but very important process in the development of any Internet-facing service is testing it for vulnerabilities, knowing if those vulnerabilities are actually exploitable in your particular environment and, lastly, knowing what the risks of those vulnerabilities are to your firm or product launch. These three different processes are known as a vulnerability assessment, penetration test and a risk analysis. Knowing the difference is critical when hiring an outside firm to test the security of your infrastructure or a particular component of your network.
Let’s examine the differences in depth and see how they complement each other.

Vulnerability assessment
Vulnerability assessments are most often confused with penetration tests and often used interchangeably, but they are worlds apart.
Vulnerability assessments are performed by using an off-the-shelf software package, such as Nessus or OpenVas to scan an IP address or range of IP addresses for known vulnerabilities. For example, the software has signatures for the Heartbleed bug or missing Apache web server patches and will alert if found. The software then produces a report that lists out found vulnerabilities and (depending on the software and options selected) will give an indication of the severity of the vulnerability and basic remediation steps.
It’s important to keep in mind that these scanners use a list of known vulnerabilities, meaning they are already known to the security community, hackers and the software vendors. There are vulnerabilities that are unknown to the public at large and these scanners will not find them.
Penetration test
Many “professional penetration testers” will actually just run a vulnerability scan, package up the report in a nice, pretty bow and call it a day. Nope — this is only a first step in a penetration test. A good penetration tester takes the output of a network scan or a vulnerability assessment and takes it to 11 — they probe an open port and see what can be exploited.
For example, let’s say a website is vulnerable to Heartbleed. Many websites still are. It’s one thing to run a scan and say “you are vulnerable to Heartbleed” and a completely different thing to exploit the bug and discover the depth of the problem and find out exactly what type of information could be revealed if it was exploited. This is the main difference — the website or service is actually being penetrated, just like a hacker would do.
Similar to a vulnerability scan, the results are usually ranked by severity and exploitability with remediation steps provided.
Penetration tests can be performed using automated tools, such as Metasploit, but veteran testers will write their own exploits from scratch.
Risk analysis
A risk analysis is often confused with the previous two terms, but it is also a very different animal. A risk analysis doesn’t require any scanning tools or applications — it’s a discipline that analyzes a specific vulnerability (such as a line item from a penetration test) and attempts to ascertain the risk — including financial, reputational, business continuity, regulatory and others — to the company if the vulnerability were to be exploited.
Many factors are considered when performing a risk analysis: asset, vulnerability, threat and impact to the company. An example of this would be an analyst trying to find the risk to the company of a server that is vulnerable to Heartbleed.
The analyst would first look at the vulnerable server, where it is on the network infrastructure and the type of data it stores. A server sitting on an internal network without outside connectivity, storing no data but vulnerable to Heartbleed has a much different risk posture than a customer-facing web server that stores credit card data and is also vulnerable to Heartbleed. A vulnerability scan does not make these distinctions. Next, the analyst examines threats that are likely to exploit the vulnerability, such as organized crime or insiders, and builds a profile of capabilities, motivations and objectives. Last, the impact to the company is ascertained — specifically, what bad thing would happen to the firm if an organized crime ring exploited Heartbleed and acquired cardholder data?
A risk analysis, when completed, will have a final risk rating with mitigating controls that can further reduce the risk. Business managers can then take the risk statement and mitigating controls and decide whether or not to implement them.
The three different concepts explained here are not exclusive of each other, but rather complement each other. In many information security programs, vulnerability assessments are the first step — they are used to perform wide sweeps of a network to find missing patches or misconfigured software. From there, one can either perform a penetration test to see how exploitable the vulnerability is or a risk analysis to ascertain the cost/benefit of fixing the vulnerability. Of course, you don’t need either to perform a risk analysis. Risk can be determined anywhere a threat and an asset is present. It can be data center in a hurricane zone or confidential papers sitting in a wastebasket.
It’s important to know the difference — each are significant in their own way and have vastly different purposes and outcomes. Make sure any company you hire to perform these services also knows the difference.
Originally published at www.csoonline.com on May 13, 2015.