Privacy preserving medical knowledge discovery by multiple “patient characteristics” formatted data

Kenta Kitamura, Mhd Irvan, Rie Shigetomi Yamaguchi


Statistical processing and Artificial Intelligence (AI) development utilizing big data have been actively researched recently. However, there are growing concerns about privacy violations due to the use of private data. For such concerns, the EU General Data Protection Regulation (GDPR) was introduced to regulate the handling of personal information. The GDPR makes it difficult to discover medical knowledge through big data analysis in medical studies. However, the GDPR is not concerned with handling non-personally identifiable statistical information. Statistical information is commonly published, collected, and analyzed. However, it is unknown whether collecting and analyzing such statistical information can generate medical evidence through variable-to-variable research, such as the relationship between tobacco and cancer.

In this paper, we propose to use statistical information that is not concerned by the GDPR to estimate cross-tabulation tables, which are usually generated from personal information in medical research and are widely used for analysis between medical variables. In particular, as statistical information, we use “patient characteristics” formatted data commonly published in medical research. The scope of this paper is the situation where the publisher of statistical information and the analyst of published statistical information differ. On the publisher side, we assume the publisher collects raw data from a target people group by random sampling multiple times and converts the data to patient characteristics formatted data. On the analyst side, we assume the analyst collects those published many random sampled patient characteristics formatted data and estimates the cross-tabulation table by the Law of Large Numbers (LLN). We model the publisher-analyst situation described above. In the aforementioned model, we evaluate our proposal estimation’s usefulness through both theoretical and experimental accuracy assessments. Furthermore, for quantitative Privacy Preserving Data Mining (PPDM), we evaluate the risk of anonymity when collecting multiple patient characteristics using the existing anonymity indicator, the Patient Family Detect on Overall Category (PFDOC) entropy. We theoretically and experimentally check the occurrence rate of vulnerable patient characteristics with PFDOC entropy equal to zero obtained by the analyst. In the experiment, the target people group data is 20,000 personal data which have four categorical binary values. As the publisher model, we created 10,000 patient characteristics, which are statistics for randomly sampled 50 data from the 20,000 data. As the analyst model, we estimated the cross-tabulation table by the 10,000 patient characteristics. The theoretical prediction error was 1.8% (95% CI), and the experimental error was within 1.5% (95% CI, n = 100), indicating a close agreement between theory and experiment. Regarding anonymity, it was theoretically expected that PFDOC entropy = 0 patient characteristics would be rare in categories with a population ratio of 25% to 75%, leading to ensured anonymity. It was confirmed in the experiment. Based on these results, we can conclude that, by using the patient characteristics formatted data release and collection model and selecting the appropriate population ratio categories, an analyst can accurately estimate cross-tabulation tables while preserving PFDOC entropy-based anonymity without legal restriction.


random sampling; health care; patient characteristics

Full Text:



  • There are currently no refbacks.