De-identification means removing identifying data from a dataset and should be done at the earliest possible time. Once a dataset has been de-identified, the dataset can be shared without disclosing identifying information.
Removing identifiers is important to protect the confidentiality of research participants. However, there is always a risk of re-identifying data, and changing technology introduces new ways to re-identify data. Managing that risk is an important part of sharing research data.
There are several ways of approaching de-identification, each of which has benefits and drawbacks.
Anonymizing data removes all links to the individual, as well as links across datasets. This means all identifiers are removed. However, as with all de-identification methods, it may still be possible to re-identify individuals through indirect identifiers and/or links to related datasets.
For example, the following shows a small section of a dataset containing identifiers:
| Name | Address | Postal Code | Year of birth | Gender | Occupation | Salary |
|---|---|---|---|---|---|---|
| Sally Xi | 123 City Roadway, Vancouver, BC |
V5V 1P2 | 1970 | Female | Manager | 90,000 |
| Sam Cooper | 4567 Town Way, Smalltown, BC |
V8A 1A5 | 1982 | Male | Machinist | 65,000 |
An anonymized version of that dataset might look like this:
| Postal code | Year of birth | Gender | Occupation | Salary |
|---|---|---|---|---|
| V5V 1P2 | 1970 | Female | Manager | 90,000 |
| V8A 1A5 | 1982 | Male | Machinist | 65,000 |
In some cases, this might be enough to ensure that the data is not re-identified. However, often the anonymized data may be easily re-identified. For example, if there are not many machinists in the V8A 1A5 postal code, then there is a strong risk of re-identification for the data related to Sam Cooper.
Researchers are increasingly using algorithm-based tools to help anonymize their data and manage the risk of reidentifying their anonymized data. Examples of anonymization tools include:
Use of coded data (or pseudonymization) is a method of de-identification that replaces identifiers with pseudonyms or identifiers that are generated by the researcher. Using coded data allows researchers to link de-identified data to the same individual across multiple datasets while retaining confidentiality of the individual.
This means that, unlike anonymized data, coded data can be linked across datasets. Linking across datasets can make data more useful, but it can also increase the risk of re-identification. Researchers can also choose to use different pseudonyms for different datasets, which may remove some analytical value but also decrease the risk of re-identification.
Distinguishing between anonymization and coded data is important.
| Name | Anonymized | Coded |
|---|---|---|
| Sally Xi | ANON | P12L25 |
| Sam Cooper | ANON | P38Q27 |
| Sunil Gupta | ANON | P59M16 |
| Sam Cooper | ANON | P38Q27 |
| Sally Xi | ANON | P12L25 |
When the data is anonymized, the link between the individual and the data is removed altogether. Users of the dataset can no longer tell whether multiple records come from the same person. When the data is coded, it is clear whether the same person or different people responded. However, it is important to remember that both the anonymized and coded datasets still contain re-identification risks.
The following resource provides additional detail on de-identification of data:
You may need or want to keep a file linking the participant names and IDs or pseudonyms. Keep in mind your data is not anonymized if a linking file exists.