publications | Swiss Data Anonymization Competence Center

2024

Priv. Stat. Databases.

Assessing the Potentials of LLMs and GANs as State-of-the-Art Tabular Synthetic Data Generation Methods

Marko Miletic , and Murat Sariyar

In Privacy in Statistical Databases , 2024

Abs

The abundance of tabular microdata constitutes a valuable resource for research, policymaking, and innovation. However, due to stringent privacy regulations, a significant portion of this data remains inaccessible. To address this, synthetic data generation methods have emerged as a promising solution. Here, we assess the potentials of two state-of-the-art GAN and LLM tabular synthetic data generators using different utility & risk measures and propose a robust risk estimation for individual records based on shared nearest neighbors. LLMs outperform CTGAN by generating synthetic data that more closely matches real data distributions, as evidenced by lower Wasserstein distances. LLMs also generally provide better predictive performance compared to CTGAN, with higher \\{F}_{1}\\F1and \\{R}^{2}\\R2scores. Interestingly, this does not necessarily mean that LLMs better capture correlations. Our proposed risk measure, Shared Neighbor Identifiability (SNI), proves effective in accurately assessing identification risk, offering a robust tool for navigating the risk-utility trade-off. Furthermore, we identify the challenges posed by mixed feature types in distance calculation. Ultimately, the choice between LLMs and GANs depends on factors such as data complexity, computational resources, and the desired level of model interpretability, emphasizing the importance of informed decision-making in selecting the appropriate generative model for specific applications.
Priv. Stat. Databases.

Evaluation of Synthetic Data Generators on Complex Tabular Data

Oscar Thees , Jiří Novák , and Matthias Templ

In Privacy in Statistical Databases , 2024

Abs

Synthetic data generators are widely utilized to produce synthetic data, serving as a complement or replacement for real data. However, the utility of data is often limited by its complexity. The aim of this paper is to show their performance using a complex data set that includes cluster structures and complex relationships. We compare different synthesizers such as synthpop, Synthetic Data Vault, simPop, Mostly AI, Gretel, Realtabformer, and arf, taking into account their different methodologies with (mostly) default settings, on two properties: syntactical accuracy and statistical accuracy. As a complex and popular data set, we used the European Statistics on Income and Living Conditions data set. Almost all synthesizers resulted in low data utility and low syntactical accuracy.
Stud. Health
Technol. Inform.

Large Language Models for Synthetic Tabular Health Data: A Benchmark Study

Marko Miletic , and Murat Sariyar

In Digital Health and Informatics Innovations for Sustainable Health Care Systems , Aug 2024

Abs HTML

Synthetic tabular health data plays a crucial role in healthcare research, addressing privacy regulations and the scarcity of publicly available datasets. This is essential for diagnostic and treatment advancements. Among the most promising models are transformer-based Large Language Models (LLMs) and Generative Adversarial Networks (GANs). In this paper, we compare LLM models of the Pythia LLM Scaling Suite with varying model sizes ranging from 14M to 1B, against a reference GAN model (CTGAN). The generated synthetic data are used to train random forest estimators for classification tasks to make predictions on the real-world data. Our findings indicate that as the number of parameters increases, LLM models outperform the reference GAN model. Even the smallest 14M parameter models perform comparably to GANs. Moreover, we observe a positive correlation between the size of the training dataset and model performance. We discuss implications, challenges, and considerations for the real-world usage of LLM models for synthetic tabular data generation.
Appl. Sci.

Challenges of Using Synthetic Data Generation Methods for Tabular Microdata

Marko Miletic , and Murat Sariyar

Applied Sciences, Jul 2024

Abs HTML

The generation of synthetic data holds significant promise for augmenting limited datasets while avoiding privacy issues, facilitating research, and enhancing machine learning models’ robustness. Generative Adversarial Networks (GANs) stand out as promising tools, employing two neural networks—generator and discriminator—to produce synthetic data that mirrors real data distributions. This study evaluates GAN variants (CTGAN, CopulaGAN), a variational autoencoder, and copulas on diverse real datasets of different complexity encompassing numerical and categorical attributes. The results highlight CTGAN’s sensitivity to training parameters and TVAE’s robustness across datasets. Scalability challenges persist, with GANs demanding substantial computational resources. TVAE stands out for its high utility across all datasets, even for high-dimensional data, though it incurs higher privacy risks, which is indicative of the curse of dimensionality. While no single model universally excels, understanding the trade-offs and leveraging model strengths can significantly enhance synthetic data generation (SDG). Future research should focus on adaptive learning mechanisms, scalability enhancements, and standardized evaluation metrics to advance SDG methods effectively. Addressing these challenges will foster broader adoption and application of synthetic data.

2023

Stud. Health
Technol. Inform.

Implementing Informative-Based Active Learning in Biomedical Record Linkage for the Splink Package in Python

Marko Miletic , and Murat Sariyar

Jun 2023

Abs HTML

In biomedical record linkage, efficient determination of a threshold to decide at which level of similarity two records should be classified as belonging to the same patient is frequently still an open issue. Here, we describe how to implement an efficient active learning strategy that puts into practice a measure of usefulness of training sets for such a task. Our results show that active learning should always be considered when training data is to be produced via manual labeling. In addition to that, active learning gives a quick indication how complex a problem is by looking into the label frequencies: If the most difficult entities are always stemming from the same class, then the classifier will probably have less problems in distinguishing the classes. In big data applications, these two properties are essential, as the problems of under- and overfitting are exacerbated in such contexts.

2022

Int. J. Inf. Secur.

A systematic overview on methods to protect sensitive data provided for various analyses

Matthias Templ , and Murat Sariyar

International Journal of Information Security, Aug 2022

Abs HTML

In view of the various methodological developments regarding the protection of sensitive data, especially with respect to privacy-preserving computation and federated learning, a conceptual categorization and comparison between various methods stemming from different fields is often desired. More concretely, it is important to provide guidance for the practice, which lacks an overview over suitable approaches for certain scenarios, whether it is differential privacy for interactive queries, k-anonymity methods and synthetic data generation for data publishing, or secure federated analysis for multiparty computation without sharing the data itself. Here, we provide an overview based on central criteria describing a context for privacy-preserving data handling, which allows informed decisions in view of the many alternatives. Besides guiding the practice, this categorization of concepts and methods is destined as a step towards a comprehensive ontology for anonymization. We emphasize throughout the paper that there is no panacea and that context matters.

2017

J. Stat. Softw.

Simulation of Synthetic Complex Data: The R Package simPop

Matthias Templ , Bernhard Meindl , Alexander Kowarik , and Olivier Dupriez

Journal of Statistical Software, Aug 2017

Abs HTML Code

The production of synthetic datasets has been proposed as a statistical disclosure control solution to generate public use files out of protected data, and as a tool to create "augmented datasets" to serve as input for micro-simulation models. Synthetic data have become an important instrument for ex-ante assessments of policy impact. The performance and acceptability of such a tool relies heavily on the quality of the synthetic populations, i.e., on the statistical similarity between the synthetic and the true population of interest. Multiple approaches and tools have been developed to generate synthetic data. These approaches can be categorized into three main groups: synthetic reconstruction, combinatorial optimization, and model-based generation. We provide in this paper a brief overview of these approaches, and introduce simPop, an open source data synthesizer. simPop is a user-friendly R package based on a modular object-oriented concept. It provides a highly optimized S4 class implementation of various methods, including calibration by iterative proportional fitting and simulated annealing, and modeling or data fusion by logistic regression. We demonstrate the use of simPop by creating a synthetic population of Austria, and report on the utility of the resulting data. We conclude with suggestions for further development of the package.

2015

J. Stat. Softw.

Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro

Matthias Templ , Alexander Kowarik , and Bernhard Meindl

Journal of Statistical Software, Aug 2015

Abs HTML Code

The demand for data from surveys, censuses or registers containing sensible information on people or enterprises has increased significantly over the last years. However, before data can be provided to the public or to researchers, confidentiality has to be respected for any data set possibly containing sensible information about individual units. Confidentiality can be achieved by applying statistical disclosure control (SDC) methods to the data in order to decrease the disclosure risk of data. The R package sdcMicro serves as an easy-to-handle, object-oriented S4 class implementation of SDC methods to evaluate and anonymize confidential micro-data sets. It includes all popular disclosure risk and perturbation methods. The package performs automated recalculation of frequency counts, individual and global risk measures, information loss and data utility statistics after each anonymization step. All methods are highly optimized in terms of computational costs to be able to work with large data sets. Reporting facilities that summarize the anonymization process can also be easily used by practitioners. We describe the package and demonstrate its functionality with a complex household survey test data set that has been distributed by the International Household Survey Network.

2011

Stat. Methods Appt.

Simulation of close-to-reality population data for household surveys with application to EU-SILC

Andreas Alfons , Stefan Kraft , Matthias Templ , and Peter Filzmoser

Statistical Methods & Applications, Apr 2011

Abs HTML

Statistical simulation in survey statistics is usually based on repeatedly drawing samples from population data. Furthermore, population data may be used in courses on survey statistics to explain issues regarding, e.g., sampling designs. Since the availability of real population data is in general very limited, it is necessary to generate synthetic data for such applications. The simulated data need to be as realistic as possible, while at the same time ensuring data confidentiality. This paper proposes a method for generating close-to-reality population data for complex household surveys. The procedure consists of four steps for setting up the household structure, simulating categorical variables, simulating continuous variables and splitting continuous variables into different components. It is not required to perform all four steps so that the framework is applicable to a broad class of surveys. In addition, the proposed method is evaluated in an application to the European Union Statistics on Income and Living Conditions (EU-SILC).
J. Biomed. Inform.

Controlling false match rates in record linkage using extreme value theory

Murat Sariyar , Andreas Borg , and Klaus Pommerening

Journal of Biomedical Informatics, Aug 2011

Abs HTML

Cleansing data from synonyms and homonyms is a relevant task in fields where high quality of data is crucial, for example in disease registries and medical research networks. Record linkage provides methods for minimizing synonym and homonym errors thereby improving data quality. We focus our attention to the case of homonym errors (in the following denoted as ‘false matches’), in which records belonging to different entities are wrongly classified as equal. Synonym errors (‘false non-matches’) occur when a single entity maps to multiple records in the linkage result. They are not considered in this study because in our application domain they are not as crucial as false matches. False match rates are frequently computed manually through a clerical review, so without modelling the distribution of the false match rates a priori. An exception is the work of Belin and Rubin (1995) [4]. They propose to estimate the false match rate by means of a normal mixture model that needs training data for a calibration process. In this paper we present a new approach for estimating the false match rate within the framework of Fellegi and Sunter by methods of Extreme Value Theory (EVT). This approach needs no training data for determining the threshold for matches and therefore leads to a significant cost-reduction. After giving two different definitions of the false match rate, we present the tools of the EVT used in this paper: the generalized Pareto distribution and the mean excess plot. Our experiments with real data show that the model works well, with only slightly lower accuracy compared to a procedure that has information about the match status and that maximizes the accuracy.

2010

Lect. Notes
Comput. Sci.

Disclosure Risk of Synthetic Population Data with Application in the Case of EU-SILC

Matthias Templ , and Andreas Alfons

Aug 2010

Abs HTML

In survey statistics, simulation studies are usually performed by repeatedly drawing samples from population data. Furthermore, population data may be used in courses on survey statistics to support the theory by practical examples. However, real population data containing the information of interest are in general not available, therefore synthetic data need to be generated. Ensuring data confidentiality is thereby absolutely essential, while the simulated data should be as realistic as possible. This paper briefly outlines a recently proposed method for generating close-to-reality population data for complex (household) surveys, which is applied to generate a population for Austrian EU-SILC (European Union Statistics on Income and Living Conditions) data. Based on this synthetic population, confidentiality issues are discussed using five different worst case scenarios. In all scenarios, the intruder has the complete information on key variables from the real survey data. It is shown that even in these worst case scenarios the synthetic population data are confidential. In addition, the synthetic data are of high quality.
R J.

The RecordLinkage Package: Detecting Errors in Data

Murat Sariyar , and Andreas Borg

The R Journal, Aug 2010

Abs HTML Code

Abstract Record linkage deals with detecting homonyms and mainly synonyms in data. The package RecordLinkage provides means to perform and evaluate different record linkage methods. A stochas tic framework is implemented which calculates weights through an EM algorithm. The determination of the necessary thresholds in this model can be achieved by tools of extreme value theory. Further more, machine learning methods are utilized, including decision trees (rpart), bootstrap aggregating (bagging), ada boost (ada), neural nets (nnet) and support vector machines (svm). The generation of record pairs and comparison patterns from single data items are provided as well. Comparison patterns can be chosen to be binary or based on some string metrics. In order to reduce computation time and memory usage, blocking can be used. Future development will concentrate on additional and refined methods, performance improvements and input/output facilities needed for real-world application.