Step-by-step de-identification guide

Data submitted to ARCHIMEDES must comply with applicable privacy regulations and ethical approvals. In many cases this involves de-identifying or coding data with consent prior to submission.

The tools and resources below are provided for educational purposes only, and researchers are responsible for ensuring their data is prepared appropriately.

Step-by-step de-identification guide

Ready to begin de-identifying your data?

Select your data type below to view step-by-step guidance, key risks to consider, and commonly used methods and tools.

Structured Data

Organized Data in fixed formats.

Examples: Electronic medical/ health records (EMRs/EHRs). EXCEL or CSV spreadsheets, databases. registries, etc.

Unstructured Data

Raw, unorganized information.

Examples: clinical notes, discharge summaries, radiology reports, surgical summaries, etc

Imaging Data

Image files from clinical exams.

Examples: medical images (MRI, CT scan, x-ray, ultrasound, etc) in various formats: i.e., DICOM, NIfTI, JPEG/PNG, etc

Other data

Additional data types beyond structured, text, or imaging data

Examples: genomics, wearable/sensor data, waveforms, audio/video, or combined datasets.

How to de-identify your data type

Select a data type above to see instructions on how to de-identify your data type.

The steps below outline a typical process for identifying sensitive information, applying appropriate de-identification methods, and checking re-identification risk.

Structured Data Workflow

  1. Identify direct identifiers

    Locate and remove explicit identifiers such as names, addresses, health card numbers, or email addresses.

    What this means

    Direct identifiers uniquely identify an individual.

    What to do

    Remove or replace variables such as names, health card numbers, email addresses, or phone numbers.

    Example

    Name → removed

    Health card number → replaced with Study_ID

  2. Identify and assess quasi-identifiers

    Evaluate indirect identifiers (e.g., age, postal code, gender, dates) that could reveal identity when combined.

    What this means

    Quasi-identifiers do not identify someone alone but may reveal identity when combined.

    What to do

    Review variables such as age, postal code, gender, and dates to determine if combinations could identify someone.

    Example

    Age + postal code + gender may uniquely identify an individual in a small population.

  3. Select transformation methods

    Choose appropriate techniques such as generalization, suppression, pseudonymization, or noise addition.

    What this means

    Choose how quasi-identifiers will be modified to reduce re-identification risk.

    What to do

    Common approaches include:

    • Generalization (Age 43 → 40–45)
    • Suppression (remove variable)
    • Pseudonymization (replace IDs with study codes)
    Example

    Postal code: K1A0B1 → K1A***

  4. Apply de-identification tools

    Use software or structured workflows to implement the selected transformations.

    What this means

    Use software or scripts to apply the transformations to your dataset.

    What to do

    Load the dataset into a tool and apply the selected transformations to relevant variables.

    Example tools

    ARX, Amnesia, R packages, Python scripts

  5. Assess re-identification risk

    Evaluate whether individuals could reasonably be re-identified from the transformed data.

  6. Verify and document the process

    Perform quality checks and record the transformations applied to ensure transparency and reproducibility.

    What this means

    Confirm that the dataset has been properly de-identified and record what changes were made.

    What to do

    Check that identifiers were removed or transformed and document the methods used.

    Example

    User documents which transformations were applied (e.g., age generalization, postal code truncation) and records the assessed re-identification risk. This documentation allows the risk to be reassessed if additional data are added later.

Tools & Tips

Commonly Used Tools

  • ARX (Learn more – link coming soon)
  • sdcMicro (R) (Learn more - link to Edward’s workshop)
  • Custom Python or R scripts (Learn more - link coming soon)

Unstructured Data Workflow

  1. Identify direct identifiers

    Locate and remove explicit identifiers appearing in free text such as names, addresses, phone numbers, or health card numbers.

    What this means

    Direct identifiers may appear anywhere in narrative text and directly reveal an individual’s identity.

    What to do

    Review documents for identifiers such as names, phone numbers, email addresses, addresses, or medical record numbers and remove or replace them.

    Example

    “Patient John Smith presented with chest pain.”

    → “Patient [NAME REDACTED] presented with chest pain.”

  2. Identify contextual identifiers

    Evaluate contextual details (e.g., age, occupation, rare conditions, locations, event dates) that could reveal identity when combined.

    What this means

    Narrative text often contains contextual information that could indirectly identify someone.

    What to do

    Review documents for demographic or contextual details such as age, location, occupation, rare diagnoses, or unique events.

    Example

    “A 92-year-old retired pilot from a small town in Ontario…”

    This combination of details may identify an individual.

  3. Select transformation methods

    Choose appropriate techniques such as redaction, placeholder replacement, or generalization.

    What this means

    Choose how sensitive elements in the text will be modified.

    What to do

    Common approaches include

    • Redaction (remove sensitive text)
    • Placeholder replacement (e.g., [NAME], [DATE])
    • Generalization (exact details replaced with broader categories)
    Example

    “March 3, 2022” → “[DATE]”

  4. Apply de-identification tools

    Use automated tools or manual review workflows to remove identifiers from documents.

    What this means

    Use software or manual review processes to identify and remove sensitive information from documents.

    What to do

    Apply automated text-scrubbing tools or perform manual review to redact identifiers.

    Example tools

    Philter, MITRE Tool, or manual review workflows.

  5. Review remaining contextual information

    Confirm that remaining text does not contain details that could reasonably reveal identity.

  6. Verify and document the process

    Perform quality checks and record the transformations applied to ensure transparency and reproducibility.

    What this means

    Confirm that identifiers were removed and document how the text was processed.

    What to do

    Review the final documents to ensure identifiers were removed or generalized and record the methods used.

    Example

    User documents which identifiers were redacted or generalized and records the assessed re-identification risk so it can be reassessed if additional data are added later.

Tools & Tips

Commonly Used Tools

  • ARXPhilter (Learn more – link coming soon)
  • MITRE tool (Learn more – link coming soon)
  • Manual review workflows (Learn more – link coming soon)

Imaging Data Workflow

  1. Identify identifiers in metadata

    Locate explicit identifiers stored in image metadata (e.g., DICOM headers).

    What this means

    Medical imaging files often contain identifying information in metadata fields.

    What to do

    Review metadata fields for identifiers such as patient name, patient ID, date of birth, or institution information.

    Example

    PatientName field SMITH, JOHN → FAKE, NAME or REDACTED NAME

  2. Identify identifiers in image pixels

    Check whether identifying information is embedded directly within the image.

    What this means

    Some images contain burned-in text or overlays that display patient identifiers.

    What to do

    Inspect images for identifiers such as names, IDs, or dates embedded in the pixel data.

    Example

    Patient name visible in ultrasound image overlay.

  3. Select de-identification approach

    Choose appropriate methods to remove or modify identifiers.

    What this means

    Different techniques may be required for metadata and pixel-based identifiers.

    What to do

    Common approaches include:

    • Removing metadata fields
    • Replacing identifiers with study IDs
    • Cropping or masking burned-in identifiers
    • Shifting dates in metadata
    Example

    StudyDate → shifted by several days

  4. Apply imaging de-identification tools

    Use imaging software to remove or modify metadata and embedded identifiers.

    What this means

    Specialized tools are typically required to de-identify imaging datasets.

    What to do

    Run anonymization scripts or imaging tools that modify metadata and remove burned-in identifiers.

    Example tools

    DICOM anonymizers, PixelMed tools, Python scripts.

  5. Review images and metadata

    Confirm that identifiers were successfully removed from both metadata and pixel data.

  6. Verify and document the process

    Perform quality checks and record the transformations applied.

    What this means

    Confirm that all identifying information has been removed.

    What to do

    Review sample images and metadata and document the transformations performed.

    Example tools

    User records which metadata fields were removed and whether pixel-based identifiers were masked or cropped.

Tools & Tips

See our workshop

Commonly Used Tools

  • DICOM Anonymizer (Learn more – link coming soon)
  • PixelMed (Learn more – link coming soon)
  • Custom Python scripts (Learn more – link coming soon)

Other Data Workflow

  1. Identify direct identifiers

    Locate and remove explicit identifiers stored in metadata or associated participant information.

    What this means

    Other datasets often include a separate file or metadata fields linking data to a participant.

    What to do

    Review the dataset and associated files for identifiers such as names, participant IDs linked to individuals, email addresses, or device registration details.

    Example (ECG Dataset)

    A wearable ECG export may contain metadata fields such as:

    PatientName: John Smith

    PatientID: 987654

    DeviceOwnerEmail: [email protected]

    De-identified version:

    PatientName → removed

    PatientID → replaced with Study_ID

    DeviceOwnerEmail → removed

  2. Identify potentially identifying variables

    Evaluate variables that could indirectly reveal identity.

    What this means

    Certain variables may not directly identify someone but could reveal identity when combined with other information.

    What to do

    Review fields such as demographics, geographic location, timestamps, or biological data that could uniquely identify individuals.

    Example

    Whole genome sequencing data can be inherently identifiable because an individual’s genome is unique.

    Example dataset fields:

    Sample_ID

    Age

    Sex

    Sequencing_Batch

    Even without names, genomic sequences themselves may allow re-identification if linked with external databases.

  3. Select transformation methods

    Choose techniques to reduce identification risk while maintaining useful data.

    What this means

    Certain variables may need to be modified or reduced in precision before sharing.

    What to do

    Common approaches include:

    • Pseudonymization of participant identifiers
    • Aggregation of time-based data
    • Removal of precise location data
    • Sharing processed or summary data instead of raw data
    Example (wearable device data)

    Raw step count data recorded every second:

    08:01:01 → 4 steps

    08:01:02 → 5 steps

    Transformed dataset

    Step counts aggregated to hourly totals instead of second-level timestamps.

  4. Apply de-identification methods

    Use analysis tools or scripts to implement the selected transformations.

    What this means

    De-identification of these datasets is typically done using analysis software.

    What to do

    Use scripts or statistical tools to remove identifiers, aggregate timestamps, or modify sensitive variables.

    Example (ECG time-series data)

    Remove metadata fields containing patient information and replace the patient identifier with a random study ID.

    Example tools: Python scripts, R workflows, or statistical software.

  5. Review dataset for unique patterns

    Check whether remaining data could still identify individuals.

    What this means

    Even without direct identifiers, some data patterns may still be unique.

    What to do

    Review whether rare biological signals, unique movement patterns, or small participant groups could allow identification.

    Example (wearable data)

    Continuous GPS traces showing a device located every night at the same address could reveal a participant’s home location.

    Example (genomics)

    ContinuousA rare genetic variant associated with a specific family could potentially identify participants.

  6. Verify and document the process

    Perform quality checks and record the transformations applied.

    What this means

    Confirm that identifying information has been removed and document the steps taken.

    What to do

    Review the dataset after processing and record the transformations applied so the process is transparent and reproducible.

    Example

    User documents that participant names and device emails were removed, timestamps were aggregated to hourly intervals, GPS coordinates were removed, and participant IDs were replaced with study IDs. The assessed re-identification risk is recorded so it can be reassessed if additional data are added later.

Tools & Tips

Commonly Used Tools

  • Python data processing libraries (Learn more – link coming soon)
  • R statistical packages (Learn more – link coming soon)
  • Custom analysis workflows (Learn more – link coming soon)

See also

  • De-identification workflows (coming soon)
  • Tutorials and workshops (coming soon)
  • Link library (coming soon)