New Delhi, Jan. 30: Just four transactions of a credit or a debit card are required for data mining software to identify 90 per cent card users from an anonymised dataset of transactions, a study has shown, highlighting what scientists say is a need to revise standards for privacy in the digital world.
The study by a team of scientists in the US has suggested that patterns of human behaviour hidden in only four transactions are enough to identify individuals from a dataset of a million transactions that hides the names, addresses or other identifiers of the users.
The researchers at the Massachusetts Institute of Technology and other institutions have found that data mining software with access to only three components of information --- the transaction location the transaction date, and transaction amount --- through four transactions can successfully re-identify 90 per cent users from a dataset that contains credit card records of 1.1 million people.
The study, published today in the journal Science, also found it was easier to identify women than men as well as users who had high-value transactions than low-value transactions, possibly because women and high-value transaction users may have more predictable behaviour patterns.
Such anonymised datasets of card transactions are at times used for research in economics, human behavioural analysis, marketing, finance, urban planning, transportation and policy-making. But the scientists say their findings should not cause any concern among credit or debit card users.
"We don't want this to be a study warning about Big Brother," said Vivek Singh, an India-born information scientist, now an assistant professor at Rutgers University in the US and a member of the research team. "We believe such data should be shared, but we need to be aware of the limitations of anonymised data. This study is the first to quantify the number of external pieces of information needed to identify an individual in a large, simply anonymised dataset."
The study by Singh and his colleagues, Yves-Alexandre and Alex Pentland, at the MIT and Laura Radaelli in Denmark shows that computer software can use large datasets containing external information --- in this study, for example, shops where the credit cards were used --- to find distinct patterns of human behaviour and identify individuals.
"If you know four shops where a person went on some days, 90 per cent of the time, he or she was the only one who visited the shops on these days," said Singh. "So a person is likely to be identified if only four points are known."
The key idea, Singh said, is that access to a simply anonymised dataset can lead to identification of individuals about whom some information is known --- in this study, the visits to four shops.
The researchers say their results show the need to move to newer privacy approaches. For example, a user's information could be decentralised or released in small increments over time or a known amount of noise could be added to each user's data such that it makes the user indistinguishable from a group of other users.