SUMMARY - What Counts as Personal Data

Baker Duck
Submitted by pondadmin on

A person's name and address are obviously personal data deserving protection. But what about an IP address that could theoretically identify them but usually does not? A cookie ID tracking browsing across websites? Aggregated statistics showing that people in a neighborhood tend to shop at particular stores? An inference that someone is likely pregnant based on purchase patterns, even if no pregnancy-related information was directly collected? Legal frameworks define personal data differently, creating confusion about what information requires protection and what can be freely collected and used. Whether definitions should be broad, capturing anything potentially linked to individuals, or narrow, covering only information that directly identifies them, determines what privacy protections actually protect and what compliance costs organizations face.

The Case for Broad, Protective Definitions

Advocates argue that personal data definitions must be comprehensive because narrow definitions create exploitable loopholes. From this view, GDPR's approach is correct: personal data includes any information relating to an identifiable person, whether directly or indirectly. This means IP addresses are personal data even though most organizations cannot identify the person behind them, because someone with additional information could. Device identifiers are personal data even though they do not contain names. Behavioral patterns are personal data when they relate to specific individuals even if those individuals are pseudonymous. Moreover, technology evolution means data that seems anonymous today may be identifiable tomorrow. Researchers routinely re-identify supposedly anonymized datasets using techniques that improve constantly. Location data, browsing histories, and purchase patterns can identify individuals even without traditional identifiers. From this perspective, narrow definitions that exclude metadata, aggregated data, or inferred attributes fail to protect people from surveillance and manipulation. A company that claims IP addresses are not personal data can track users across websites, build detailed behavioral profiles, and sell that information, all while arguing it does not handle personal data. The solution requires: broad statutory definitions covering any information relating to identified or identifiable persons; presumption that data is personal unless proven otherwise; recognition that pseudonymous and aggregated data often remain identifiable; inclusion of inferred and derived data because algorithmic conclusions about people affect them as much as observed facts; and adaptation as re-identification techniques evolve rather than static definitions based on outdated technology assumptions.

The Case for Practical, Bounded Definitions

Others argue that overly broad personal data definitions create unworkable compliance burdens while providing minimal additional protection. From this perspective, treating everything as personal data makes the category meaningless. If IP addresses that organizations cannot link to individuals are personal data requiring full GDPR compliance, the administrative burden explodes while actual privacy protection barely improves. Aggregated statistics showing demographic trends should not require the same protections as medical records. Inferred attributes are not the same as observed facts—treating an algorithm's guess that someone might be interested in a product as personal data equivalent to their actual health information confuses categories requiring different treatment. Moreover, broad definitions create legal uncertainty. Organizations cannot determine what information requires protection, leading to overcautious data minimization that prevents beneficial uses. Research using aggregate statistics faces compliance obstacles even though it poses minimal privacy risk. From this view, personal data should be defined as information that actually identifies specific individuals or that organizations can reasonably link to individuals using information available to them. IP addresses that ISPs could link to subscribers might be personal data for ISPs but not for websites that only see the address. Cookie IDs that cannot be connected to real identities without disproportionate effort should not trigger full data protection obligations. The solution involves: risk-based approaches where protection requirements scale with identification risk; clear categories distinguishing directly identifying information from pseudonymous and aggregated data; practical re-identification tests asking whether organizations can actually identify people rather than theoretical linkability; and flexibility allowing different treatment of different data types rather than one-size-fits-all regulation.

The Re-Identification Problem

Technology increasingly enables re-identifying data that seemed anonymous. Location traces with a few data points can uniquely identify most people. Browsing histories combined with publicly available information reveal identity. Genetic data can identify relatives who never consented to participate. From one view, this proves that narrow definitions based on current anonymization techniques are inadequate and that any data potentially linkable to individuals should be protected. From another view, it means we should focus on re-identification harms rather than trying to define everything as personal data—prohibit re-identification attempts while allowing use of data that is actually de-identified in practice. Whether the law should assume all data is potentially personal because future re-identification is possible, or whether practical current identification risk should determine protections, shapes what compliance requires.

The Metadata and Derivative Data Challenge

Metadata—information about data rather than the data itself—often reveals more than content. Who someone calls and when reveals more about their life than what they say. Email metadata shows relationship networks, travel patterns, and life events. From one perspective, metadata is personal data deserving equal or greater protection than content. From another perspective, metadata often has legitimate uses for service provision and security that content analysis does not, and treating it identically to primary data prevents these uses. Similarly, inferred data derived through analytics often reveals sensitive attributes: algorithms inferring health conditions, political views, or sexual orientation from seemingly innocuous information. Whether inferred attributes are personal data, and whether they deserve protection even if inferences are wrong, determines what algorithmic processing requires.

The Aggregation and Statistics Boundary

Individual data points clearly constitute personal data, but what about aggregates? Statistics showing that 60% of website visitors from a neighborhood clicked on ads do not identify individuals. Yet aggregates become re-identifiable with enough dimensions. From one view, truly aggregated data with sufficient participants and limited dimensions should not be personal data, encouraging beneficial research and analytics. From another view, organizations claiming data is aggregated often use techniques that remain linkable to individuals, and the boundary between individual and aggregate is exploited to avoid compliance. Whether clear mathematical thresholds for anonymization exist or whether aggregation is a spectrum requiring case-by-case assessment determines what statistical uses are permitted.

The Household and Device Challenge

Is a household a person for data protection purposes? An IP address identifies a household where multiple people live. A shared device contains data from different users. Smart home devices collect information about everyone in the home even if only one person consented. From one perspective, household data should be protected because it relates to identifiable people living there. From another perspective, treating household-level information as personal data for each resident creates unworkable consent and rights exercise requirements. Whether data granularity determines protection or whether any connection to individuals triggers full requirements affects what Internet of Things and household technology deployments must address.

The Question

If narrow personal data definitions create loopholes that enable surveillance and profiling while claiming not to handle personal data, does that justify broad definitions capturing any information potentially linkable to individuals, even if current holders cannot make that link? When treating everything as personal data creates compliance burdens that prevent beneficial uses of genuinely de-identified information, does that mean definitions should focus on practical identification risks rather than theoretical possibilities? And if technology evolution means that data anonymous today may be identifiable tomorrow, should laws protect based on future re-identification potential or current actual risk, and who decides what re-identification risk is acceptable?

0
| Comments
0 recommendations