Privacy-preserving Machine learning#
Much current research in data science involves machine learning (ML) models interacting with data sourced from a large number of individuals with significant variation in the general level of awareness, consent, and understanding of research goals. As such, researchers have a responsibility to protect the confidentiality and privacy of the people whose data is being processed. At the same time, sharing both data and trained models drives scientific advancement and promotes important social goals in open and transparent science.
It’s important to note that local and international regulations such as the General Data Protection Regulation (GDPR) and the EU’s policy on trustworthy AI also establish legal duties and principles on privacy protection that the following tools may help researchers meet.
Learning with privacy#
Beyond sharing data with other researchers, we can also share our trained models, or make them available as a service: carrying out predictions on data provided by others without the need for them to invest time and resources in training their own systems. However, this sharing can also carry risks for personal privacy. For instance, many ML solutions require users to send personal data to a central server to process, exposing them to the risk of interception or misuse. The model itself may learn sequences from the dataset that we don’t wish to be retained, a process referred to as unintended memorization [CLE+19]. This could be particularly harmful when considering models dealing with large amounts of user-created text [BLM+22].
Federated learning#
Federated Learning is a design paradigm in which the users’ data never leaves their own devices, with the model itself being broken down into a set of computations that take place on the edge, before updates are sent back to a central coordinator [KMA+19].
Adversarial learning#
We can also draw on the experience of research in the field of cross-domain training to teach models to ignore undesirable data by directly controlling the training process [CNC18]. This can also be extended beyond private attributes to elimination of unwanted biases [ZLM18].
Differential privacy#
Differential privacy has also seen significant use as a technique for preserving privacy during model training, reducing the risk of the model learning individual data points too well by adding small amounts of statistical noise during training [BDC20, FBDD20].
Useful resources#
Privacy in Deep Learning: A Survey A useful brief overview of some extant threats and mitigation strategies.
Deep learning and differential privacy Frank McSherry’s thought provoking blog about the privacy research landscape.
Privacy Preserving Machine Learning: Maintaining confidentiality and preserving trust A recent overview from Microsoft Research of privacy-preserving learning
PySyft A Federated Learning and privacy preservation library designed for compatibility with major machine learning frameworks PyTorch and TensorFlow