I've been in security a long time and the topic of PII comes up more frequently than you can imagine. The typical business user of information wants a 'list' of data points that are PII, while this is easy to provide the list changes over time. What's more troublesome is trying to explain how some combinations of data cause the data set to now be PII.
The following data points are often used for the express purpose of distinguishing individual identity. Therefore they are clearly classified as PII under the definition used by the National Institute of Standards and Technology (NIST):
Full name (if not common)
Home address
Email address (if private from an association/club membership, etc.)
National identification number
Passport number
IP address (when linked, but not PII by itself in US)
Vehicle registration plate number
Driver's license number
Face, fingerprints, or handwriting
Credit card numbers
Digital identity
Date of birth
Birthplace
Genetic information
Telephone number
Login name, screen name, nickname, or handle
The following data points are traits shared by many people, and can not be used to distinguish individual identity. However, they are potentially PII, because they may be combined with other personal information to identify an individual.
First or last name, if common
Country, state, postcode or city of residence
Age, especially if non-specific
Gender or race
Name of the school they attend or workplace
Grades, salary, or job position
Criminal record
Web cookie
When a person wishes to remain anonymous, descriptions of them will often employ several of the above, such as "a 21-year-old white female who works at Starbucks". Note that information can still be private, in the sense that a person may not wish for it to become publicly known, without being personally identifiable. Moreover, sometimes multiple pieces of information, none sufficient by itself to uniquely identify an individual, may uniquely identify a person when combined; this is one reason that multiple pieces of evidence are usually presented at criminal trials. It has been shown that, in 1990, 87% of the population of the United States could be uniquely identified by gender, ZIP code, and full date of birth.
So the question is, how do we determine if the data points being requested constitute PII? I'd like to propose creating a simple way to calculate what I call Bits of Identity1. The formula can be adjusted for city, state, country, world or anything in between.
After working in the device print area (see this site) for years, I realized the same issue exists in reverse.. namely what data points can I use to uniquely identify a device. In the identity game, it's the same in reverse, as in how many bits of identity does this data point add?
If you have data pertaining to the United States, which in 2010 was 309.3 million, then you require -Log2 (1/309300000) bits of data to identify an individual. This equates to 28.2 bits. Now that we know our target, we can calculate this easily.
For example, if a user needs to work with just gender, there are 2 choices (Male and Female). That single data point has -Log2(1/2) bits or 1 bit of identity. So far so good. Now, what if your user wants zip code as well? There are ~43000 zip codes in the US, given any zip code at random would add an additional -Log2(1/43,000) bits or 15.39 bits of identity. So the two data points together get you to 16.39, still far short of the target 28.2. However some of these data points like zipcode are tricky, for instance, there are zip codes with say 200 people in them if you had one of those you would be looking at -Log2(200/309300000) or 20.56 bits of identity, add in gender and you are 21.56 bits which is getting close.
Let us calculate some common data points below:
First or last name - This varies, but using the site HowManyOfMe we can estimate the following:
Last name: -Log2 (1/151671) assuming even distribution results in 17.21 bits of identity
First name: -Log2 (1/5163) assuming even distribution results in 12.33 bits of identity
Age, especially if non-specific
If even distribution assuming 0 to 100, -Log2(1/100) results in 6.62 bits of identity.
If you look at a given age (see Demography of the US) and take for example 74 year olds, you are closer to -Log2(1/2000000) or 20.97 bits of identity
Gender or race
Gender: -Log2 (1/2) results in 1 bit of identity
1 As I don't believe this idea has been published before, if you are using this idea or using it to create a derivative work, please give attribution to John Kula and this site.
Reference
L. Sweeney, Simple Demographics Often Identify People Uniquely. Carnegie Mellon University, Data Privacy Paper 3. Pittsburgh 2000. Retrieved September 01, 2017, from https://dataprivacylab.org/projects/identifiablity/paper1.pdf