# Differential privacy – Wikipedia

Methods of safely sharing general data
Differential privacy ( DP ) is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset. The idea behind differential gear privacy is that if the effect of making an arbitrary single substitution in the database is minor adequate, the question resultant role can not be used to infer much about any one individual, and therefore provides privacy. Another way to describe derived function privacy is as a constraint on the algorithm used to publish aggregate information about a statistical database which limits the disclosure of secret information of records whose information is in the database. For example, differentially individual algorithms are used by some government agencies to publish demographic information or other statistical aggregates while ensuring confidentiality of view responses, and by companies to collect information about user behavior while controlling what is visible even to internal analysts. approximately, an algorithm is differentially private if an perceiver seeing its output signal can not tell if a particular individual ‘s data was used in the calculation. Differential privacy is much discussed in the context of identifying individuals whose information may be in a database. Although it does not directly refer to designation and reidentification attacks, differentially private algorithm credibly resist such attacks. [ 1 ] Differential privacy was developed by cryptographers and therefore is frequently associated with cryptography, and absorb much of its linguistic process from cryptanalysis.

## history

official statistics organizations are charged with collecting information from individuals or establishments, and publishing sum data to serve the public matter to. For model, the 1790 United States Census collected data about individuals living in the United States and published tabulations based on sex, age, raceway, and stipulate of servitude. statistical organizations have long collected information under a promise of confidentiality that the data provided will be used for statistical purposes, but that the publications will not produce information that can be traced back to a specific individual or establishment. To accomplish this goal, statistical organizations have long suppressed information in their publications. For example, in a board presenting the sales of each commercial enterprise in a town grouped by business category, a cell that has information from only one company might be suppressed, in order to maintain the confidentiality of that company ‘s specific sales. The adoption of electronic information processing systems by statistical agencies in the 1950s and 1960s dramatically increased the phone number of tables that a statistical organization could produce and, in so doing, importantly increased the electric potential for an improper disclosure of confidential data. For model, if a business that had its sales numbers suppressed besides had those numbers appear in the total sales of a region, then it might be possible to determine the inhibit prize by subtracting the other sales from that total. But there might besides be combinations of additions and subtractions that might cause the secret information to be revealed. The count of combinations that needed to be checked increases exponentially with the numeral of publications, and it is potentially boundless if data users are able to make queries of the statistical database using an interactional question system. In 1977, Tore Dalenius formalized the mathematics of cell inhibition. [ 2 ] In 1979, Dorothy Denning, Peter J. Denning and Mayer D. Schwartz formalized the concept of a Tracker, an adversary that could learn the confidential contents of a statistical database by creating a serial of target queries and remembering the results. [ 3 ] This and future research showed that privacy properties in a database could lone be preserved by considering each new question in light of ( possibly all ) previous queries. This line of make is sometimes called query privacy, with the final leave being that tracking the affect of a question on the privacy of individuals in the database was NP-hard. In 2003, Kobbi Nissim and Irit Dinur demonstrated that it is impossible to publish arbitrary queries on a individual statistical database without revealing some total of private data, and that the stallion information content of the database can be revealed by publishing the results of a surprisingly little act of random queries—far fewer than was implied by former function. [ 4 ] The general phenomenon is known as the Fundamental Law of Information Recovery, and its key insight, namely that in the most general case, privacy can not be protected without injecting some measure of noise, led to growth of derived function privacy. In 2006, Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam D. Smith published an article formalizing the sum of noise that needed to be added and proposing a generalize mechanism for doing sol. [ 1 ] Their work was a co-recipient of the 2016 TCC Test-of-Time Award [ 5 ] and the 2017 Gödel Prize. [ 6 ] Since then, subsequent inquiry has shown that there are many ways to produce very accurate statistics from the database while still ensuring high levels of privacy. [ 7 ] [ 8 ]

## ε-differential privacy

The 2006 Dwork, McSherry, Nissim and Smith article introduced the concept of ε-differential privacy, a mathematical definition for the privacy passing associated with any data publish draw from a statistical database. ( here, the term statistical database means a set of data that are collected under the assurance of confidentiality for the determination of producing statistics that, by their output, do not compromise the privacy of those individuals who provided the data. ) The intuition for the 2006 definition of ε-differential privacy is that a person ‘s privacy can not be compromised by a statistical handout if their data are not in the database. therefore, with differential privacy, the goal is to give each individual roughly the same privacy that would result from having their data removed. That is, the statistical functions run on the database should not excessively depend on the data of any one person. Of course, how a lot any individual contributes to the consequence of a database question depends in part on how many people ‘s data are involved in the question. If the database contains data from a single person, that person ‘s data contributes 100 %. If the database contains data from a hundred people, each person ‘s datum contributes just 1 %. The key penetration of differential privacy is that as the question is made on the data of fewer and fewer people, more randomness needs to be added to the question resultant role to produce the same total of privacy. Hence the name of the 2006 composition, “ Calibrating make noise to sensitivity in secret data analysis. ” The 2006 paper presents both a mathematical definition of derived function privacy and a mechanism based on the addition of Laplace noise ( i.e. noise coming from the Laplace distribution ) that satisfies the definition .

### definition of ε-differential privacy

Let ε be a positive real number and A { \displaystyle { \mathcal { A } } } be a randomize algorithm that takes a dataset as input signal ( representing the actions of the entrust party holding the datum ). Let im A { \displaystyle { \textrm { im } } \ { \mathcal { A } } } denote the image of A { \displaystyle { \mathcal { A } } }. The algorithm A { \displaystyle { \mathcal { A } } } is said to provide ε { \displaystyle \varepsilon } -differential privacy if, for all datasets D 1 { \displaystyle D_ { 1 } } and D 2 { \displaystyle D_ { 2 } } that differ on a single element ( i.e., the data of one person ), and all subsets S { \displaystyle S } of im A { \displaystyle { \textrm { im } } \ { \mathcal { A } } } :
Pr [ A ( D 1 ) ∈ S ] ≤ exp ⁡ ( ε ) ⋅ Pr [ A ( D 2 ) ∈ S ], { \displaystyle \Pr [ { \mathcal { A } } ( D_ { 1 } ) \in S ] \leq \exp \left ( \varepsilon \right ) \cdot \Pr [ { \mathcal { A } } ( D_ { 2 } ) \in S ], }
where the probability is taken over the randomness used by the algorithm. [ 9 ] Differential privacy offers firm and full-bodied guarantees that facilitate modular design and analysis of differentially secret mechanisms due to its composability, robustness to post-processing, and graceful degradation in the presence of correlated data .

### Composability

( Self- ) composability refers to the fact that the joint distribution of the outputs of ( possibly adaptively chosen ) differentially private mechanisms satisfies derived function privacy. Sequential composition. If we query an ε-differential privacy mechanism t { \displaystyle deoxythymidine monophosphate } times, and the randomization of the mechanism is freelancer for each question, then the solution would be ε t { \displaystyle \varepsilon thyroxine } -differentially private. In the more general case, if there are nitrogen { \displaystyle nitrogen } mugwump mechanisms : M 1, …, M nitrogen { \displaystyle { \mathcal { M } } _ { 1 }, \dots, { \mathcal { M } } _ { n } } , whose privacy guarantees are ε 1, …, ε normality { \displaystyle \varepsilon _ { 1 }, \dots, \varepsilon _ { nitrogen } } differential gear privacy, respectively, then any serve guanine { \displaystyle g } of them : thousand ( M 1, …, M newton ) { \displaystyle gram ( { \mathcal { M } } _ { 1 }, \dots, { \mathcal { M } } _ { nitrogen } ) } is ( ∑ one = 1 newton ε one ) { \displaystyle \left ( \sum \limits _ { i=1 } ^ { normality } \varepsilon _ { i } \right ) } -differentially secret. [ 10 ] Parallel composition. If the previous mechanisms are computed on disjoint subsets of the secret database then the function guanine { \displaystyle gigabyte } would be ( soap i ε i ) { \displaystyle ( \max _ { i } \varepsilon _ { i } ) } -differentially private alternatively. [ 10 ]

### robustness to post-processing

For any deterministic or randomized routine F { \displaystyle F } defined over the picture of the mechanism A { \displaystyle { \mathcal { A } } }, if A { \displaystyle { \mathcal { A } } } satisfies ε-differential privacy, thus does F ( A ) { \displaystyle F ( { \mathcal { A } } ) } . together, composability and robustness to post-processing permit modular construction and analysis of differentially secret mechanisms and motivate the concept of the privacy loss budget. If all elements that access sensible data of a complex mechanism are individually differentially secret, so will be their combination, followed by arbitrary post-processing .

### Group privacy

In general, ε-differential privacy is designed to protect the privacy between neighboring databases which differ only in one row. This means that no adversary with arbitrary aide information can know if one particular participant submitted his information. however this is besides extendible. We may want to protect databases differing in hundred { \displaystyle cytosine } rows, which amounts to an adversary with arbitrary accessory information knowing if

c

{\displaystyle c}

finical participants submitted their data. This can be achieved because if c { \displaystyle speed of light } items change, the probability dilation is bounded by exp ⁡ ( ε hundred ) { \displaystyle \exp ( \varepsilon deoxycytidine monophosphate ) } rather of exp ⁡ ( ε ) { \displaystyle \exp ( \varepsilon ) } , [11] i, for D1 and D2 differing on c { \displaystyle c } items :

Pr [ A ( D 1 ) ∈ S ] ≤ exp ⁡ ( ε carbon ) ⋅ Pr [ A ( D 2 ) ∈ S ] { \displaystyle \Pr [ { \mathcal { A } } ( D_ { 1 } ) \in S ] \leq \exp ( \varepsilon hundred ) \cdot \Pr [ { \mathcal { A } } ( D_ { 2 } ) \in S ] \, \ ! }

frankincense setting ε rather to ε / c { \displaystyle \varepsilon /c } achieves the craved result ( security of degree centigrade { \displaystyle c } items ). In other words, alternatively of having each item ε-differentially private protected, now every group of c { \displaystyle coulomb } items is ε-differentially individual protected ( and each item is ( ε / degree centigrade ) { \displaystyle ( \varepsilon /c ) } -differentially private protected ) .

## ε-differentially individual mechanisms

Since differential privacy is a probabilistic concept, any differentially private mechanism is necessarily randomized. Some of these, like the Laplace mechanism, described below, trust on adding controlled noise to the function that we want to compute. Others, like the exponential mechanism [ 12 ] and back tooth sampling [ 13 ] sample from a problem-dependent syndicate of distributions alternatively .

### sensitivity

Let d { \displaystyle five hundred } be a positive integer, D { \displaystyle { \mathcal { D } } } be a collection of datasets, and f : D → R five hundred { \displaystyle f\colon { \mathcal { D } } \rightarrow \mathbb { R } ^ { vitamin d } } be a function. The sensitivity [ 1 ] of a serve, denoted Δ degree fahrenheit { \displaystyle \Delta degree fahrenheit } , is defined by

Δ fluorine = soap ‖ farad ( D 1 ) − degree fahrenheit ( D 2 ) ‖ 1, { \displaystyle \Delta f=\max \lVert fluorine ( D_ { 1 } ) -f ( D_ { 2 } ) \rVert _ { 1 }, }

where the utmost is over all pairs of datasets D 1 { \displaystyle D_ { 1 } } and D 2 { \displaystyle D_ { 2 } } in D { \displaystyle { \mathcal { D } } } differing in at most one component and ‖ ⋅ ‖ 1 { \displaystyle \lVert \cdot \rVert _ { 1 } } denotes the ℓ 1 { \displaystyle \ell _ { 1 } } average. In the exemplar of the checkup database below, if we consider fluorine { \displaystyle f } to be the officiate Q one { \displaystyle Q_ { i } } , then the sensitivity of the function is one, since changing any one of the entries in the database causes the end product of the function to change by either zero or one. There are techniques ( which are described below ) using which we can create a differentially private algorithm for functions with low sensitivity .

### The Laplace mechanism

The Laplace mechanism adds Laplace noise ( i.e. noise from the Laplace distribution, which can be expressed by probability concentration function noise ( y ) ∝ exp ⁡ ( − | y | / λ ) { \displaystyle { \text { noise } } ( y ) \propto \exp ( -|y|/\lambda ) \, \ ! } , which has beggarly zero and standard deviation 2 λ { \displaystyle { \sqrt { 2 } } \lambda \, \ ! } ). nowadays in our case we define the output signal routine of A { \displaystyle { \mathcal { A } } \, \ ! } as a veridical valued affair ( called as the transcript output by A { \displaystyle { \mathcal { A } } \, \ ! } ) as T A ( x ) = f ( x ) + Y { \displaystyle { \mathcal { T } } _ { \mathcal { A } } ( adam ) =f ( adam ) +Y\, \ ! } where Y ∼ Lap ( λ ) { \displaystyle Y\sim { \text { Lap } } ( \lambda ) \, \ ! \, \ ! } and farad { \displaystyle f\, \ ! } is the original very valued query/function we planned to execute on the database. now clearly T A ( x ) { \displaystyle { \mathcal { T } } _ { \mathcal { A } } ( ten ) \, \ ! } can be considered to be a continuous random variable, where

p vitamin d farad ( T A, D 1 ( ten ) = thyroxine ) p d farad ( T A, D 2 ( adam ) = triiodothyronine ) = noise ( thymine − fluorine ( D 1 ) ) noise ( triiodothyronine − farad ( D 2 ) ) { \displaystyle { \frac { \mathrm { pdf } ( { \mathcal { T } } _ { { \mathcal { A } }, D_ { 1 } } ( x ) =t ) } { \mathrm { pdf } ( { \mathcal { T } } _ { { \mathcal { A } }, D_ { 2 } } ( adam ) =t ) } } = { \frac { { \text { noise } } ( t-f ( D_ { 1 } ) ) } { { \text { noise } } ( t-f ( D_ { 2 } ) ) } } \, \ ! }

which is at most e | farad ( D 1 ) − degree fahrenheit ( D 2 ) | λ ≤ vitamin e Δ ( f ) λ { \displaystyle e^ { \frac { |f ( D_ { 1 } ) -f ( D_ { 2 } ) | } { \lambda } } \leq e^ { \frac { \Delta ( degree fahrenheit ) } { \lambda } } \, \ ! } . We can consider Δ ( fluorine ) λ { \displaystyle { \frac { \Delta ( farad ) } { \lambda } } \, \ ! } to be the privacy factor ε { \displaystyle \varepsilon \, \ ! } . frankincense T { \displaystyle { \mathcal { T } } \, \ ! } follows a differentially secret mechanism ( as can be seen from the definition above ). If we try to use this concept in our diabetes case then it follows from the above derived fact that in order to have A { \displaystyle { \mathcal { A } } \, \ ! } as the ε { \displaystyle \varepsilon \, \ ! } -differential private algorithm we need to have λ = 1 / ε { \displaystyle \lambda =1/\varepsilon \, \ ! } . Though we have used Laplace noise here, other forms of noise, such as the Gaussian Noise, can be employed, but they may require a slight rest of the definition of differential privacy. [ 11 ] According to this definition, differential gear privacy is a condition on the release mechanism ( i.e., the trust party releasing information about the dataset ) and not on the dataset itself. intuitively, this means that for any two datasets that are similar, a given differentially private algorithm will behave approximately the lapp on both datasets. The definition gives a strong guarantee that presence or absence of an individual will not affect the final end product of the algorithm significantly. For example, assume we have a database of medical records D 1 { \displaystyle D_ { 1 } } where each record is a pair ( Name, X ), where X { \displaystyle adam } is a boolean denote whether a person has diabetes or not. For example :

Name Has Diabetes (X)
Ross 1
Monica 1
Joey 0
Phoebe 0
Chandler 1
Rachel 0

now suppose a malicious drug user ( much termed an adversary ) wants to find whether Chandler has diabetes or not. Suppose he besides knows in which row of the database Chandler resides. immediately suppose the adversary is only allowed to use a detail phase of question Q iodine { \displaystyle Q_ { one } } that returns the partial union of the foremost iodine { \displaystyle one } rows of column X { \displaystyle ten } in the database. In order to find Chandler ‘s diabetes condition the adversary executes Q 5 ( D 1 ) { \displaystyle Q_ { 5 } ( D_ { 1 } ) } and Q 4 ( D 1 ) { \displaystyle Q_ { 4 } ( D_ { 1 } ) } , then computes their difference. In this example, Q 5 ( D 1 ) = 3 { \displaystyle Q_ { 5 } ( D_ { 1 } ) =3 } and Q 4 ( D 1 ) = 2 { \displaystyle Q_ { 4 } ( D_ { 1 } ) =2 } , so their difference is 1. This indicates that the “ Has Diabetes ” playing field in Chandler ‘s quarrel must be 1. This model highlights how individual information can be compromised even without explicitly querying for the information of a specific individual. Continuing this model, if we construct D 2 { \displaystyle D_ { 2 } } by replacing ( Chandler, 1 ) with ( Chandler, 0 ) then this malicious adversary will be able to distinguish D 2 { \displaystyle D_ { 2 } } from D 1 { \displaystyle D_ { 1 } } by computing Q 5 − Q 4 { \displaystyle Q_ { 5 } -Q_ { 4 } } for each dataset. If the adversary were required to receive the values Q one { \displaystyle Q_ { i } } via an ε { \displaystyle \varepsilon } -differentially secret algorithm, for a sufficiently little ε { \displaystyle \varepsilon }, then he or she would be ineffective to distinguish between the two datasets .

### Randomized reception

A simple case, particularly developed in the social sciences, [ 14 ] is to ask a person to answer the question “ Do you own the attribute A ? “, according to the follow routine :

1. Toss a coin.
2. If heads, then toss the coin again (ignoring the outcome), and answer the question honestly.
3. If tails, then toss the coin again and answer “Yes” if heads, “No” if tails.

( The apparently excess excess chuck in the first gear font is needed in situations where barely the act of tossing a mint may be observed by others, even if the actual consequence stays hidden. ) The confidentiality then arises from the refutability of the individual responses. But, overall, these data with many responses are significant, since plus responses are given to a quarter by people who do not have the attribute A and three-quarters by people who actually possess it. thus, if p is the on-key proportion of people with A, then we expect to obtain ( 1/4 ) ( 1- p ) + ( 3/4 ) p = ( 1/4 ) + p /2 positive responses. hence it is possible to estimate p. In particular, if the attribute A is synonymous with illegal behavior, then answering “ Yes ” is not accuse, insofar as the person has a probability of a “ Yes ” answer, whatever it may be. Although this example, inspired by randomize response, might be applicable to microdata ( i.e., releasing datasets with each individual reply ), by definition differential privacy excludes microdata releases and is merely applicable to queries ( i.e., aggregating individual responses into one solution ) as this would violate the requirements, more specifically the plausible deniability that a national participated or not. [ 15 ] [ 16 ]

### static transformations

A transformation T { \displaystyle T } is cytosine { \displaystyle deoxycytidine monophosphate } -stable if the Hamming distance between T ( A ) { \displaystyle T ( A ) } and T ( B ) { \displaystyle T ( B ) } is at most carbon { \displaystyle cytosine } -times the Hamming outdistance between A { \displaystyle A } and B { \displaystyle B } for any two databases A, B { \displaystyle A, B } . Theorem 2 in [ 10 ] asserts that if there is a mechanism M { \displaystyle M } that is ε { \displaystyle \varepsilon } -differentially private, then the composite mechanism M ∘ T { \displaystyle M\circ T } is ( ε × speed of light ) { \displaystyle ( \varepsilon \times c ) } -differentially private. This could be generalized to group privacy, as the group size could be thought of as the Hamming distance h { \displaystyle planck’s constant } between A { \displaystyle A } and B { \displaystyle B } ( where A { \displaystyle A } contains the group and B { \displaystyle B } does n’t ). In this case M ∘ T { \displaystyle M\circ T } is ( ε × c × heat content ) { \displaystyle ( \varepsilon \times c\times h ) } -differentially private.

## Other notions of differential privacy

Since derived function privacy is considered to be excessively potent or weak for some applications, many versions of it have been proposed. [ 17 ] The most widespread relaxation is ( ε, δ ) -differential privacy, [ 18 ] which weakens the definition by allowing an extra small δ concentration of probability on which the upper constipate ε does not hold .

## borrowing of differential gear privacy in real-world applications

several uses of differential privacy in drill are known to go steady :

## Public purpose considerations

There are respective public purpose considerations regarding differential privacy that are significant to consider, specially for policymakers and policy-focused audiences concern in the social opportunities and risks of the engineering : [ 27 ]

• Data Utility & Accuracy. The main concern with differential privacy is the tradeoff between data utility and individual privacy. If the privacy loss parameter is set to favor utility, the privacy benefits are lowered (less “noise” is injected into the system); if the privacy loss parameter is set to favor heavy privacy, the accuracy and utility of the dataset are lowered (more “noise” is injected into the system). It is important for policymakers to consider the tradeoffs posed by differential privacy in order to help set appropriate best practices and standards around the use of this privacy preserving practice, especially considering the diversity in organizational use cases. It is worth noting, though, that decreased accuracy and utility is a common issue among all statistical disclosure limitation methods and is not unique to differential privacy. What is unique, however, is how policymakers, researchers, and implementers can consider mitigating against the risks presented through this tradeoff.
• Data Privacy & Security. Differential privacy provides a quantified measure of privacy loss and an upper bound and allows curators to choose the explicit tradeoff between privacy and accuracy. It is robust to still unknown privacy attacks. However, it encourages greater data sharing, which if done poorly, increases privacy risk. Differential privacy implies that privacy is protected, but this depends very much on the privacy loss parameter chosen and may instead lead to a false sense of security. Finally, though it is robust against unforeseen future privacy attacks, a countermeasure may be devised that we cannot predict.