Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The answer is always “it depends,” but I think if a date time is a UTC timestamp, such as a record of when an event happened, then with random sampling, it shouldn’t matter? It’s just a timestamp. The amount of information it contains might include location, might include timing to other events, could be correlated, but… on its own? It doesn’t need anonymization. Likewise the sequence of events, should be safe to use.

I get that you can look up or de-anonymize an event by its timestamp and the same is true of ID numbers. But it’s worse for ID numbers because these are often permanent and re-used for multiple events.

But yeah, the risk in anonymized data is that it’s never truly both anonymous and useful. Truly anonymous data might be considered junk or random data.

Anonymized data has some utility purpose to fulfil. Perhaps “realistic” analytics is required, or you want to troubleshoot a production issue without revealing who did what to engineers. So you anonymize the fields they shouldn’t see, and create a subset of data that reproduces the issue…?

Anonymized data is almost always a bad approach compared to generating data from algorithmic or random sources, but sometimes we need anonymized or restricted data to start that process.



Data can be anonymous and useful. You however have to define what you mean by useful, an use that to inform how you go about making it anonymous.

A good example is: https://gretel.ai/blog/gretel-ai-illumina-using-ai-to-create...

Full disclosure, I work at Gretel, but I thought this was relevant enough to mention.


True? But I wouldn’t call creating new data from non-anonymous data “making data anonymous”. Instead, that’s new random data whose values are constrained or based on real-world data. I’d call that newly generated data, not anonymized data.

To me, anonymized data has an inherent risk of leaking the original transaction because it is a one-to-one mapping of the original data. If you generate new data, it will by definition diverge from the production dataset in some way that might be unrealistic. For example, fields with address components might not actually point to real places, or might not be written the same way as they would be in production. Perhaps a portion of production data includes international addresses or rural routes that your software might fail to generate, or worse, maybe it would generate them incorrectly.

Frankly, generating data is a better approach than anonymized data. And I know of anonymization techniques where good data is mixed with bad data and statistically, the bad data can be filtered out later but only in aggregate, etc. But I’m drawing a line in the sand between anonymized data that closely matches real data, and that which is “generated data”, because you can still potentially learn from the anonymized data but you can’t learn from generated data much more than you would from the initial model that created the constraints used to generate the data. I’m probably explaining this poorly, it’s a bit late at night in my time zone. :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: