Step-by-step de-identification guide
Data submitted to ARCHIMEDES must comply with applicable privacy regulations and ethical approvals. In many cases this involves de-identifying or coding data with consent prior to submission.
The tools and resources below are provided for educational purposes only, and researchers are responsible for ensuring their data is prepared appropriately.
Step-by-step de-identification guide
Ready to begin de-identifying your data?
Select your data type below to view step-by-step guidance, key risks to consider, and commonly used methods and tools.
Structured Data
Organized Data in fixed formats.
Examples: Electronic medical/ health records (EMRs/EHRs). EXCEL or CSV spreadsheets, databases. registries, etc.
Unstructured Data
Raw, unorganized information.
Examples: clinical notes, discharge summaries, radiology reports, surgical summaries, etc
Imaging Data
Image files from clinical exams.
Examples: medical images (MRI, CT scan, x-ray, ultrasound, etc) in various formats: i.e., DICOM, NIfTI, JPEG/PNG, etc
Other data
Additional data types beyond structured, text, or imaging data
Examples: genomics, wearable/sensor data, waveforms, audio/video, or combined datasets.
How to de-identify your data type
Select a data type above to see instructions on how to de-identify your data type.
The steps below outline a typical process for identifying sensitive information, applying appropriate de-identification methods, and checking re-identification risk.
Structured Data Workflow
-
Identify direct identifiers
Locate and remove explicit identifiers such as names, addresses, health card numbers, or email addresses.
What this means
Direct identifiers uniquely identify an individual.
What to do
Remove or replace variables such as names, health card numbers, email addresses, or phone numbers.
Example
Name → removed
Health card number → replaced with Study_ID
-
Identify and assess quasi-identifiers
Evaluate indirect identifiers (e.g., age, postal code, gender, dates) that could reveal identity when combined.
What this means
Quasi-identifiers do not identify someone alone but may reveal identity when combined.
What to do
Review variables such as age, postal code, gender, and dates to determine if combinations could identify someone.
Example
Age + postal code + gender may uniquely identify an individual in a small population.
-
Select transformation methods
Choose appropriate techniques such as generalization, suppression, pseudonymization, or noise addition.
What this means
Choose how quasi-identifiers will be modified to reduce re-identification risk.
What to do
Common approaches include:
- Generalization (Age 43 → 40–45)
- Suppression (remove variable)
- Pseudonymization (replace IDs with study codes)
Example
Postal code: K1A0B1 → K1A***
-
Apply de-identification tools
Use software or structured workflows to implement the selected transformations.
What this means
Use software or scripts to apply the transformations to your dataset.
What to do
Load the dataset into a tool and apply the selected transformations to relevant variables.
Example tools
ARX, Amnesia, R packages, Python scripts
-
Assess re-identification risk
Evaluate whether individuals could reasonably be re-identified from the transformed data.
-
Verify and document the process
Perform quality checks and record the transformations applied to ensure transparency and reproducibility.
What this means
Confirm that the dataset has been properly de-identified and record what changes were made.
What to do
Check that identifiers were removed or transformed and document the methods used.
Example
User documents which transformations were applied (e.g., age generalization, postal code truncation) and records the assessed re-identification risk. This documentation allows the risk to be reassessed if additional data are added later.
Tools & Tips
Commonly Used Tools
- ARX (Learn more – link coming soon)
- sdcMicro (R) (Learn more - link to Edward’s workshop)
- Custom Python or R scripts (Learn more - link coming soon)
Unstructured Data Workflow
-
Identify direct identifiers
Locate and remove explicit identifiers appearing in free text such as names, addresses, phone numbers, or health card numbers.
What this means
Direct identifiers may appear anywhere in narrative text and directly reveal an individual’s identity.
What to do
Review documents for identifiers such as names, phone numbers, email addresses, addresses, or medical record numbers and remove or replace them.
Example
“Patient John Smith presented with chest pain.”
→ “Patient [NAME REDACTED] presented with chest pain.”
-
Identify contextual identifiers
Evaluate contextual details (e.g., age, occupation, rare conditions, locations, event dates) that could reveal identity when combined.
What this means
Narrative text often contains contextual information that could indirectly identify someone.
What to do
Review documents for demographic or contextual details such as age, location, occupation, rare diagnoses, or unique events.
Example
“A 92-year-old retired pilot from a small town in Ontario…”
This combination of details may identify an individual.
-
Select transformation methods
Choose appropriate techniques such as redaction, placeholder replacement, or generalization.
What this means
Choose how sensitive elements in the text will be modified.
What to do
Common approaches include
- Redaction (remove sensitive text)
- Placeholder replacement (e.g., [NAME], [DATE])
- Generalization (exact details replaced with broader categories)
Example
“March 3, 2022” → “[DATE]”
-
Apply de-identification tools
Use automated tools or manual review workflows to remove identifiers from documents.
What this means
Use software or manual review processes to identify and remove sensitive information from documents.
What to do
Apply automated text-scrubbing tools or perform manual review to redact identifiers.
Example tools
Philter, MITRE Tool, or manual review workflows.
-
Review remaining contextual information
Confirm that remaining text does not contain details that could reasonably reveal identity.
-
Verify and document the process
Perform quality checks and record the transformations applied to ensure transparency and reproducibility.
What this means
Confirm that identifiers were removed and document how the text was processed.
What to do
Review the final documents to ensure identifiers were removed or generalized and record the methods used.
Example
User documents which identifiers were redacted or generalized and records the assessed re-identification risk so it can be reassessed if additional data are added later.
Tools & Tips
Commonly Used Tools
- ARXPhilter (Learn more – link coming soon)
- MITRE tool (Learn more – link coming soon)
- Manual review workflows (Learn more – link coming soon)
Imaging Data Workflow
-
Identify identifiers in metadata
Locate explicit identifiers stored in image metadata (e.g., DICOM headers).
What this means
Medical imaging files often contain identifying information in metadata fields.
What to do
Review metadata fields for identifiers such as patient name, patient ID, date of birth, or institution information.
Example
PatientName field SMITH, JOHN → FAKE, NAME or REDACTED NAME
-
Identify identifiers in image pixels
Check whether identifying information is embedded directly within the image.
What this means
Some images contain burned-in text or overlays that display patient identifiers.
What to do
Inspect images for identifiers such as names, IDs, or dates embedded in the pixel data.
Example
Patient name visible in ultrasound image overlay.
-
Select de-identification approach
Choose appropriate methods to remove or modify identifiers.
What this means
Different techniques may be required for metadata and pixel-based identifiers.
What to do
Common approaches include:
- Removing metadata fields
- Replacing identifiers with study IDs
- Cropping or masking burned-in identifiers
- Shifting dates in metadata
Example
StudyDate → shifted by several days
-
Apply imaging de-identification tools
Use imaging software to remove or modify metadata and embedded identifiers.
What this means
Specialized tools are typically required to de-identify imaging datasets.
What to do
Run anonymization scripts or imaging tools that modify metadata and remove burned-in identifiers.
Example tools
DICOM anonymizers, PixelMed tools, Python scripts.
-
Review images and metadata
Confirm that identifiers were successfully removed from both metadata and pixel data.
-
Verify and document the process
Perform quality checks and record the transformations applied.
What this means
Confirm that all identifying information has been removed.
What to do
Review sample images and metadata and document the transformations performed.
Example tools
User records which metadata fields were removed and whether pixel-based identifiers were masked or cropped.
Tools & Tips
See our workshop
- Practical approaches to de-identification of medical imaging data: from metadata to pixel-level protection
- Recording Link
- Slides Link
Commonly Used Tools
- DICOM Anonymizer (Learn more – link coming soon)
- PixelMed (Learn more – link coming soon)
- Custom Python scripts (Learn more – link coming soon)
Other Data Workflow
-
Identify direct identifiers
Locate and remove explicit identifiers stored in metadata or associated participant information.
What this means
Other datasets often include a separate file or metadata fields linking data to a participant.
What to do
Review the dataset and associated files for identifiers such as names, participant IDs linked to individuals, email addresses, or device registration details.
Example (ECG Dataset)
A wearable ECG export may contain metadata fields such as:
PatientName: John Smith
PatientID: 987654
DeviceOwnerEmail: [email protected]
De-identified version:
PatientName → removed
PatientID → replaced with Study_ID
DeviceOwnerEmail → removed
-
Identify potentially identifying variables
Evaluate variables that could indirectly reveal identity.
What this means
Certain variables may not directly identify someone but could reveal identity when combined with other information.
What to do
Review fields such as demographics, geographic location, timestamps, or biological data that could uniquely identify individuals.
Example
Whole genome sequencing data can be inherently identifiable because an individual’s genome is unique.
Example dataset fields:
Sample_ID
Age
Sex
Sequencing_Batch
Even without names, genomic sequences themselves may allow re-identification if linked with external databases.
-
Select transformation methods
Choose techniques to reduce identification risk while maintaining useful data.
What this means
Certain variables may need to be modified or reduced in precision before sharing.
What to do
Common approaches include:
- Pseudonymization of participant identifiers
- Aggregation of time-based data
- Removal of precise location data
- Sharing processed or summary data instead of raw data
Example (wearable device data)
Raw step count data recorded every second:
08:01:01 → 4 steps
08:01:02 → 5 steps
Transformed dataset
Step counts aggregated to hourly totals instead of second-level timestamps.
-
Apply de-identification methods
Use analysis tools or scripts to implement the selected transformations.
What this means
De-identification of these datasets is typically done using analysis software.
What to do
Use scripts or statistical tools to remove identifiers, aggregate timestamps, or modify sensitive variables.
Example (ECG time-series data)
Remove metadata fields containing patient information and replace the patient identifier with a random study ID.
Example tools: Python scripts, R workflows, or statistical software.
-
Review dataset for unique patterns
Check whether remaining data could still identify individuals.
What this means
Even without direct identifiers, some data patterns may still be unique.
What to do
Review whether rare biological signals, unique movement patterns, or small participant groups could allow identification.
Example (wearable data)
Continuous GPS traces showing a device located every night at the same address could reveal a participant’s home location.
Example (genomics)
ContinuousA rare genetic variant associated with a specific family could potentially identify participants.
-
Verify and document the process
Perform quality checks and record the transformations applied.
What this means
Confirm that identifying information has been removed and document the steps taken.
What to do
Review the dataset after processing and record the transformations applied so the process is transparent and reproducible.
Example
User documents that participant names and device emails were removed, timestamps were aggregated to hourly intervals, GPS coordinates were removed, and participant IDs were replaced with study IDs. The assessed re-identification risk is recorded so it can be reassessed if additional data are added later.
Tools & Tips
Commonly Used Tools
- Python data processing libraries (Learn more – link coming soon)
- R statistical packages (Learn more – link coming soon)
- Custom analysis workflows (Learn more – link coming soon)
See also
- De-identification workflows (coming soon)
- Tutorials and workshops (coming soon)
- Link library (coming soon)