Step-by-step de-identification guide

Data submitted to ARCHIMEDES must comply with applicable privacy regulations and ethical approvals. In many cases this involves de-identifying or coding data with consent prior to submission.

The tools and resources below are provided for educational purposes only, and researchers are responsible for ensuring their data is prepared appropriately.

Step-by-step de-identification guide

Ready to begin de-identifying your data?

Select your data type below to view step-by-step guidance, key risks to consider, and commonly used methods and tools.

Structured Data

Organized Data in fixed formats.

Examples: Electronic medical/ health records (EMRs/EHRs). EXCEL or CSV spreadsheets, databases. registries, etc.

Unstructured Data

Raw, unorganized information.

Examples: clinical notes, discharge summaries, radiology reports, surgical summaries, etc

Imaging Data

Image files from clinical exams.

Examples: medical images (MRI, CT scan, x-ray, ultrasound, etc) in various formats: i.e., DICOM, NIfTI, JPEG/PNG, etc

Other data

Additional data types beyond structured, text, or imaging data

Examples: genomics, wearable/sensor data, waveforms, audio/video, or combined datasets.

How to de-identify your data type

Select a data type above to see instructions on how to de-identify your data type.

The steps below outline a typical process for identifying sensitive information, applying appropriate de-identification methods, and checking re-identification risk.

Structured Data Workflow

Identify direct identifiers

Locate and remove explicit identifiers such as names, addresses, health card numbers, or email addresses.

What this means

Direct identifiers uniquely identify an individual.

What to do

Remove or replace variables such as names, health card numbers, email addresses, or phone numbers.

Example

Name → removed

Health card number → replaced with Study_ID
Identify and assess quasi-identifiers

Evaluate indirect identifiers (e.g., age, postal code, gender, dates) that could reveal identity when combined.

What this means

Quasi-identifiers do not identify someone alone but may reveal identity when combined.

What to do

Review variables such as age, postal code, gender, and dates to determine if combinations could identify someone.

Example

Age + postal code + gender may uniquely identify an individual in a small population.
Select transformation methods

Choose appropriate techniques such as generalization, suppression, pseudonymization, or noise addition.
What this means

Choose how quasi-identifiers will be modified to reduce re-identification risk.

What to do

Common approaches include:
- Generalization (Age 43 → 40–45)
- Suppression (remove variable)
- Pseudonymization (replace IDs with study codes)
Example

Postal code: K1A0B1 → K1A***
Apply de-identification tools

Use software or structured workflows to implement the selected transformations.

What this means

Use software or scripts to apply the transformations to your dataset.

What to do

Load the dataset into a tool and apply the selected transformations to relevant variables.

Example tools

ARX, Amnesia, R packages, Python scripts
Assess re-identification risk

Evaluate whether individuals could reasonably be re-identified from the transformed data.
Verify and document the process

Perform quality checks and record the transformations applied to ensure transparency and reproducibility.

What this means

Confirm that the dataset has been properly de-identified and record what changes were made.

What to do

Check that identifiers were removed or transformed and document the methods used.

Example

User documents which transformations were applied (e.g., age generalization, postal code truncation) and records the assessed re-identification risk. This documentation allows the risk to be reassessed if additional data are added later.

Tools & Tips

Commonly Used Tools

ARX (Learn more – link coming soon)
sdcMicro (R) (Learn more - link to Edward’s workshop)
Custom Python or R scripts (Learn more - link coming soon)

Unstructured Data Workflow

Identify direct identifiers

Locate and remove explicit identifiers appearing in free text such as names, addresses, phone numbers, or health card numbers.

What this means

Direct identifiers may appear anywhere in narrative text and directly reveal an individual’s identity.

What to do

Review documents for identifiers such as names, phone numbers, email addresses, addresses, or medical record numbers and remove or replace them.

Example

“Patient John Smith presented with chest pain.”

→ “Patient [NAME REDACTED] presented with chest pain.”
Identify contextual identifiers

Evaluate contextual details (e.g., age, occupation, rare conditions, locations, event dates) that could reveal identity when combined.

What this means

Narrative text often contains contextual information that could indirectly identify someone.

What to do

Review documents for demographic or contextual details such as age, location, occupation, rare diagnoses, or unique events.

Example

“A 92-year-old retired pilot from a small town in Ontario…”

This combination of details may identify an individual.
Select transformation methods

Choose appropriate techniques such as redaction, placeholder replacement, or generalization.
What this means

Choose how sensitive elements in the text will be modified.

What to do

Common approaches include
- Redaction (remove sensitive text)
- Placeholder replacement (e.g., [NAME], [DATE])
- Generalization (exact details replaced with broader categories)
Example

“March 3, 2022” → “[DATE]”
Apply de-identification tools

Use automated tools or manual review workflows to remove identifiers from documents.

What this means

Use software or manual review processes to identify and remove sensitive information from documents.

What to do

Apply automated text-scrubbing tools or perform manual review to redact identifiers.

Example tools

Philter, MITRE Tool, or manual review workflows.
Review remaining contextual information

Confirm that remaining text does not contain details that could reasonably reveal identity.
Verify and document the process

Perform quality checks and record the transformations applied to ensure transparency and reproducibility.

What this means

Confirm that identifiers were removed and document how the text was processed.

What to do

Review the final documents to ensure identifiers were removed or generalized and record the methods used.

Example

User documents which identifiers were redacted or generalized and records the assessed re-identification risk so it can be reassessed if additional data are added later.

Tools & Tips

Commonly Used Tools

ARXPhilter (Learn more – link coming soon)
MITRE tool (Learn more – link coming soon)
Manual review workflows (Learn more – link coming soon)

Imaging Data Workflow

Identify identifiers in metadata

Locate explicit identifiers stored in image metadata (e.g., DICOM headers).

What this means

Medical imaging files often contain identifying information in metadata fields.

What to do

Review metadata fields for identifiers such as patient name, patient ID, date of birth, or institution information.

Example

PatientName field SMITH, JOHN → FAKE, NAME or REDACTED NAME
Identify identifiers in image pixels

Check whether identifying information is embedded directly within the image.

What this means

Some images contain burned-in text or overlays that display patient identifiers.

What to do

Inspect images for identifiers such as names, IDs, or dates embedded in the pixel data.

Example

Patient name visible in ultrasound image overlay.
Select de-identification approach

Choose appropriate methods to remove or modify identifiers.
What this means

Different techniques may be required for metadata and pixel-based identifiers.

What to do

Common approaches include:
- Removing metadata fields
- Replacing identifiers with study IDs
- Cropping or masking burned-in identifiers
- Shifting dates in metadata
Example

StudyDate → shifted by several days
Apply imaging de-identification tools

Use imaging software to remove or modify metadata and embedded identifiers.

What this means

Specialized tools are typically required to de-identify imaging datasets.

What to do

Run anonymization scripts or imaging tools that modify metadata and remove burned-in identifiers.

Example tools

DICOM anonymizers, PixelMed tools, Python scripts.
Review images and metadata

Confirm that identifiers were successfully removed from both metadata and pixel data.
Verify and document the process

Perform quality checks and record the transformations applied.

What this means

Confirm that all identifying information has been removed.

What to do

Review sample images and metadata and document the transformations performed.

Example tools

User records which metadata fields were removed and whether pixel-based identifiers were masked or cropped.

Tools & Tips

See our workshop

Commonly Used Tools

DICOM Anonymizer (Learn more – link coming soon)
PixelMed (Learn more – link coming soon)
Custom Python scripts (Learn more – link coming soon)

Other Data Workflow

Identify direct identifiers

Locate and remove explicit identifiers stored in metadata or associated participant information.

What this means

Other datasets often include a separate file or metadata fields linking data to a participant.

What to do

Review the dataset and associated files for identifiers such as names, participant IDs linked to individuals, email addresses, or device registration details.

Example (ECG Dataset)

A wearable ECG export may contain metadata fields such as:

PatientName: John Smith

PatientID: 987654

DeviceOwnerEmail: [email protected]

De-identified version:

PatientName → removed

PatientID → replaced with Study_ID

DeviceOwnerEmail → removed
Identify potentially identifying variables

Evaluate variables that could indirectly reveal identity.

What this means

Certain variables may not directly identify someone but could reveal identity when combined with other information.

What to do

Review fields such as demographics, geographic location, timestamps, or biological data that could uniquely identify individuals.

Example

Whole genome sequencing data can be inherently identifiable because an individual’s genome is unique.

Example dataset fields:

Sample_ID

Age

Sex

Sequencing_Batch

Even without names, genomic sequences themselves may allow re-identification if linked with external databases.
Select transformation methods

Choose techniques to reduce identification risk while maintaining useful data.
What this means

Certain variables may need to be modified or reduced in precision before sharing.

What to do

Common approaches include:
- Pseudonymization of participant identifiers
- Aggregation of time-based data
- Removal of precise location data
- Sharing processed or summary data instead of raw data
Example (wearable device data)

Raw step count data recorded every second:

08:01:01 → 4 steps

08:01:02 → 5 steps

Transformed dataset

Step counts aggregated to hourly totals instead of second-level timestamps.
Apply de-identification methods

Use analysis tools or scripts to implement the selected transformations.

What this means

De-identification of these datasets is typically done using analysis software.

What to do

Use scripts or statistical tools to remove identifiers, aggregate timestamps, or modify sensitive variables.

Example (ECG time-series data)

Remove metadata fields containing patient information and replace the patient identifier with a random study ID.

Example tools: Python scripts, R workflows, or statistical software.
Review dataset for unique patterns

Check whether remaining data could still identify individuals.

What this means

Even without direct identifiers, some data patterns may still be unique.

What to do

Review whether rare biological signals, unique movement patterns, or small participant groups could allow identification.

Example (wearable data)

Continuous GPS traces showing a device located every night at the same address could reveal a participant’s home location.

Example (genomics)

ContinuousA rare genetic variant associated with a specific family could potentially identify participants.
Verify and document the process

Perform quality checks and record the transformations applied.

What this means

Confirm that identifying information has been removed and document the steps taken.

What to do

Review the dataset after processing and record the transformations applied so the process is transparent and reproducible.

Example

User documents that participant names and device emails were removed, timestamps were aggregated to hourly intervals, GPS coordinates were removed, and participant IDs were replaced with study IDs. The assessed re-identification risk is recorded so it can be reassessed if additional data are added later.

Tools & Tips

Commonly Used Tools

Python data processing libraries (Learn more – link coming soon)
R statistical packages (Learn more – link coming soon)
Custom analysis workflows (Learn more – link coming soon)

Step-by-step de-identification guides

Step-by-step de-identification guide