Information gain when measuring an unknown qubit

In quantum information the fundamental information-containing system is the qubit. A measurement of a single qubit can at most yield one classical bit. However, a dichotomous measurement of an unknown qubit will yield much less information about the qubit state. We use Bayesian inference to compute how much information one progressively gets by making sucessive, individual measurements on an ensemble of identically prepared qubits. Perhaps surprisingly, even if the measurements are arranged so that each measurement yields one classical bit, that is, the two possible measurement outcomes are a priori equiprobable, it takes almost a handful of measurements before one has gained one bit of information about the gradually concentrated qubit probability density. We also show that by following a strategy that reaps the maximum information per measurement, we are led to a mutually unbiased basis as our measurement bases. This is a pleasing, although not entirely surprising, result.


Introduction
In recent years the field of quantum information has attracted considerable attention. A main driving force has been the concern about personal and national information integrity. We do not want personal or national information to be read by unwanted and perhaps even unknown parties. At the same time we are all aware of the fact that both nations and individuals spend considerable resources trying obtain such information [1]. This concern has led to the development of quantum key distribution [2,3], a system of distributing cryptographic keys that rely for its safety on quantum mechanics rather on the belief that factorisation of large numbers (on which public key cryptographic systems rely on) is computationally hard [4,5]. However, in 1995 it was shown by Shor [6] that a quantum mechanical computer could factorise a product in a number of steps scaling polynomially rather than exponentially with the number of digits of the product [7]. The latter is what the best, publicly known factorisation algorithms do.
Another driving force is the increased sensitivity quantum measurements offer compared to classical measurement when using the same amount of resources, such as energy. In the most advanced measuring systems constructed so far, such as the Laser Interferometer Gravitational-Wave Observatory, quantum enhanced measurement techniques are already in use [8]. However, to extract the most information from the measurement system and data, proper understanding of statistics and classical information theory and its relation to quantum measurements is needed.
A basic building block of quantum information is the qubit, just as the bit is the basic building block for contemporary information society. However, in spite of the seemingly similar notions of a bit and a qubit, there are fundamental differences between the two. A (classical) bit is a unit of information. A qubit, on the other hand, is not a unit but simply designates a two-state quantum system. A single qubit can be prepared to transfer one bit of classical information, but this is only possible if the the involved parties first agree on a (twostate) basis, and then an equiprobable superposition or mixture between the two basis states is prepared. If no designated basis is chosen a priori, then the analogy between the bit and the qubit is incomplete. The reason is that probabilities in classical physics is replaced by complex-valued probability amplitudes in quantum mechanics. This allows for a richer set of possible measurements, and hence outcomes, for a quantum system.
As quantum information is becoming increasingly ubiquitous in physics and informatics, it is important that contemporary physics students learn some (quantum) information theory and that computer science students learn to appreciate the difference between classical bits and quantum qubits early on. Unfortunately, it is not so easy to find such information in quantum mechanics or quantum information textbooks, as they tend to focus on quantum information, without regard to its relation to classical information theory. The present manuscript makes no claim of offering a complete exposition of the difference between the two. However, we do show, in an intentionally pedagogic manner, at least how an unknown bit and qubit, with equiprobable a priori state probabilities or probability density, respectively, differ in how much information a measurement will give. With the qubit as an example of a quantum system which allows for many different kind of measurements by choice of measurement basis, the concept of mutually unbiased basis naturally emerges. Hopefully, this will not only inform students of an important concept in contemporary quantum mechanics, but may also pique the students' interest in quantum information science, a rapidly growing and exciting field.

Qubits and the Bloch sphere
In the following we shall assume that the qubit is encoded on a spin 1/2 system. This system should be familiar to most students that have come into contact with quantum mechanics. We use the eigenstates  | ⟩ and  | ⟩ (spin down and spin up) to the spin z observable and associate the states with the logical qubit states | ⟩ 0 and | ⟩ 1 , respectively. A pure qubit state y | ⟩ can then be written 2 , and i is the imaginary unit. Through θ and f we can associate every qubit to a point on the Bloch sphere [9, 10], see figure 1. Hence, the state  | ⟩ (  | ⟩) lies on the north (south) pole of the sphere.
As the spin of a light particle, a photon, is manifested in term of the photon's polarisation, students familiar with optics can alternatively look at the quantum state of the qubit as a polarisation state of a photon. The Bloch sphere is then replaced by the Poincaré sphere, and the states  | ⟩ and  | ⟩ can be interpreted as a left circularly and a right circularly polarised photon, respectively. By analogy, the states    (| ⟩ | ⟩) 2 then represent vertical and horizontal linear polarisation, and the states    (| ⟩ | ⟩) i 2 represent diagonal and antidiagonal linear polarisation of a photon.
We shall also assume that we have two protagonists that we call Alice and Bob. Alice will prepare an ensemble of identical, pure qubits specified by θ and f. However, she will chose the two angles without telling Bob what values she has chosen. Bob's task is to try to estimate what values Alice has chosen, through von Neumann measurements.
Since the state of the qubits Alice prepared is completely unknown to Bob, it is reasonable for him to assign a uniform prior probability density q f ( ) p , on the Bloch sphere. (If the measurement outcome space is unknown, assigning a prior is somewhat more involved, see [11]. However, a spin qubit measurement results in either spin up or spin down.) Thus, q f p = -( ) ( ) p , 4 1 . In this manner the probability that the qubit was prepared by Alice with the parameters θ and f within a small surrounding solid angle Ω anywhere on the Bloch sphere is equal to p W ( ) 4 . To motivate this scenario, suppose Bob wants to characterise a thick, transparent, birefringent material where neither the material thickness nor the material's fast and slow axes are known. If a photon of a known polarisation (or spin) state is sent through the material, the polarisation state of the photon will in general change. This change will let Bob characterise the material. However, since Bob has no clue of the how the material will affect the photon, he has good reason to assign a uniform probability density as his prior. Thus, 'Alice' should not be taken literally as a person, but rather as some procedure to prepare the unknown quantum state, in this case by propagating a known state through an unknown material. Here, it is perhaps prudent to point out that in order to characterise an unknown material in this way it is not sufficient to use only one input polarisation. If, e.g. the input state is linearly polarised in a direction parallel to either of the fast or slow axis of the material, the polarisation state will not change. This will allow Bob to draw the conclusion about the birefringence axis direction, but not about the birefringence 'strength' (for the used wavelength). To get information about the latter one could, e.g. use a diagonally linearly polarised photons and measure the material's influence on such photons. The material's influence on different input states will eventually disclose the material's birefringence properties. The generalisation of such a procedure is called quantum process tomography [12].
Going back to the spin states, it is not difficult to prove that if we have two states y ⟨ | 1 and y | ⟩ 2 , represented on the Bloch sphere by the unit vectors v 1 and v 2 through (1) then the absolute squared overlap y y |⟨ | ⟩| , where γ is the angle subtended between v 1 and v 2 so that g = · ( ) v v cos 1 2 . Using a standard trigonometric identity we can now rewrite the absolute squared overlap y y g g The last step allows us to use the Bloch sphere representation exclusively in the following when computing state projection probabilities. From the equation we can immediately see that in order to get the projection probability 1/2, the two states' Bloch vectors must be ).

Information and Bayes' rule
Suppose Bob makes a measurement of the unknown spin qubit along the z direction. It is clear that the von Neumann entropy of the qubit state is zero since the state is pure. However, the (Shannon) entropy of Bob's measurements is not going to be zero, for only if he knows the qubits' eigenbasis (i.e. he knows θ and f) can he make a measurement that will have a predictable outcome. From the symmetry of the prior it follows that the two possible measurement outcomes  | ⟩ and  | ⟩ are a priori equiprobable. Suppose Bob gets the result  | ⟩. He has now gained one bit of classical information because according to Shannon [13], the information gain is the difference between the measurement outcome's a priori and a posteriori entropies. The entropy function H is defined where p(k) is the probability of outcome k, and if the logarithm is taken in basis 2, the result will be in units of bits. The entropy function is a measure of the uncertainty of the outcome k. Suppose that we have n equiprobable outcomes, so that the probability of obtaining outcome k is = ( ) p k n 1 . Then, H is a monotonically increasing function with n. The larger the uncertainty of the outcome, the larger is the entropy. As is clearly explained in [13], the entropy function as defined in (3) is the unique function that fulfils this condition, that is continuous in the probabilities p(k), and that is additive. The last condition can mathematically be stated as follows: assume that a process is generating successive outcomes k and l, where the probability ( ) p k l , for both k and l to happen is expressed where ( | ) p l k is the conditional probability of outcome l given the outcome k. Then additivity means that the entropy of ( ) p k l , is the sum of the entropy of p(k) and the probability weighted entropy of ( | ) p l k . That is, because a priori  =  = (| ⟩) (| ⟩) p p 1 2, and the final, a posteriori entropy is -= ( ) ( ) 1 log 1 0 bit, 5 2 because a posteriori  = (| ⟩) p 1 and  = (| ⟩) p 0 so through the measurement he has gained -= 1 0 1 bit of classical information. If he re-measures the same qubit in the same basis Bob can now predict the outcome with certainty. However, 1 bit is by no means how much information Bob has gained about what state the qubit was prepared in by Alice. To compute the latter we need two things: the first is Bayes' rule for statistical inference [14]. It can be written where, e.g. p(a) is the probability density of observing outcome a irrespective of the outcome b. Bayes' rule lets us update the probability density of getting an outcome a, given that a measurement or observation resulted in outcome b.
The other tool we need is the differential entropy function for a continuous probability density p(a). To this end we replace the sum in (3) by an integral, viz.
The differential entropy differs from the Shannon entropy in two important respects. First, it depends on the unit in which p(a) is expressed. In our case we will express the probability density of θ and f over the Bloch sphere in the unit sr −1 . However, if we would have used some other unit, so that instead of using p(a) we would have used ¢ + ( ) p ka m , where k and m are constants and ¢ p is the rescaled and translated probability density, then h p a k log . However, the information gained from a measurement that transforms an initial probability density q(a), through the use of Bayes' theorem, into an updated probability density p(a) is the difference in differential entropy between the initial and updated densities, viz.
p a a q a q a a log d log d . 8 Hence, while the values of the differential entropies depend on what units we chose to express the probability density in, the information gain, i.e. the differential entropy difference between two probability densities, is independent on the unit chosen. Secondly, the differential entropy can be negative, and it approaches minus infinity as p (a) approaches a (possibly multi-dimensional) delta function. This makes intuitive sense, because to specify precisely the parameter(s) a where the delta-like function is non-zero requires a large number of digits. The narrower the probability density gets, the larger is number of digits (=the more information) we need to specify a. Since the information gained about the parameter to be estimated equals the differential entropy difference between an initial (wide) probability density and a delta-like final probability density, the information gain will be positive, just as one intuitively expects.
The entropy of the initial, flat prior probability density q f p = As we have discussed above, the obtained numerical value is irrelevant as such, as it depends on the chosen units. What matters is that the information gained by an observation is the difference between the a priori probability density entropy (9) and the a posteriori probability density entropy. This number is independent on our choice of units. A flowchart over the estimation procedure and the computation of the associated information gain described above can be found in figure 2.

Measurements and information gain
Equipped with Bayes' rule and differential entropy we can now see how much information about θ and f Bob gains from one measurement of one of the qubits prepared by Alice. From the spherical symmetry of the problem it does not matter what his first von Neumann measurement is. We will therefore assume that Bob makes a first measurement along the z-axis on the Bloch sphere. Such a measurement will 'project' the measured qubits onto the states  | ⟩ and  | ⟩. For brevity we will denote these outcomes + z andz , respectively. The probability of measuring + z given the state , where we see that the result is independent on the parameter f. The probability for Bob to obtain this result is Hence, the two mutually exclusive outcomes + z andz are equiprobable, and the measurement therefore defines exactly one bit of classical information (or reduction of randomness) as discussed above.
Using Bayes' rule (6), the prior q f p = -( ) ( ) p , 4 1 , and the results above, Bob can get the updated qubit probability distribution From this follows that the post measurement probability is no longer uniform over the sphere, but is displaced towards the northern hemisphere as is expected and is illustrated in figure 3(b). How much information did Bob gain from this measurement? Naïvely one may guess one bit, since the measurement outcome was binary and equiprobable. However, this is incorrect. The entropy of the a posteriori probability density distribution is ò ò The information gained about the state preparation, as specified by q f The probability density plotted in pseudocolor on the Bloch sphere. The initial, uniform distribution in (a), then, (b)-(f), the density after ¼ 1, 2, , 5 optimal measurements with outcomes + z , + x , + y , + w and + u , respectively. The orientation of the six Bloch spheres is identical. On the far right, the pseudocolor, linear scale going from 0 (blue, bottom) to 0.46 (red, top). Under the Bloch spheres the information gained (in bits) by the measurement is written. This result was found by, e.g. Terashima and Ren and Fan [15,16]. We see that the information gained is far less than a bit of information. What does this mean in practice? It means that we are far from being able to exclude with reasonable certainty even half of the possible outcomes. Given the inferred probability density (11) we have e.g. only a 3/4 probability of correctly guessing on which hemisphere, north or south, the measured qubit state was prepared, and we can say nothing at all about f. As can be seen from figure 3(b), the probability density is still spread over most of the Bloch sphere. Only the point q p = has vanishing density.
A re-measurement of the just measured qubit will yield no additional information, because according to the von Neumann projection postulate, the post-measurement state of the qubit is  | ⟩ (if the qubit was not destroyed by the measurement). Hence, the just measured qubit retains no information about the parameters θ and f that specified its preparation.

Repeated measurement of one observable
To gain more information about the qubit ensemble state, Bob has to measure another of the identically prepared qubits. If he uses the same observable, the possible outcomes must be the same: he either gets the result + z orz . The probability for the first outcome, given that the first measurement along the z-axis gave the result + z , is equal to Thus, the complementary outcomez happens with probability 1/3. The classical measurement entropy of this second measurement is hence 0.918 bit. The updated probability and (c).
The average differential entropy of the ensuing updated probability distribution is . Thus, even less information was gained on average by the second measurement.
At this point it is worth mentioning that if Bob gets the measurement resultz from his second measurement, then the relative entropy of the ensuing updated probability distribution is approximately 3.4710 bit. That is, the entropy has increased by 0.0982 bit from 3.3728 bit. In physical terms this means that the qubit state 'uncertainty' has increased through a second measurement. This may seem paradoxical, but is not strange at all. We know from thermodynamics that heat can be transferred from a colder system to a warmer. If the system is large, this is very unlikely to be observed, but for small systems this is certainly a possibility. Consequently, when calculating the mean entropy of interacting systems we include all these unlikely events, weighted by their respective (small) probabilities. The result is almost universally that the average expected entropy decreases because these entropy increasing events have small probabilities. In a qubit system the same thing happens. Certain possible outcomes increase the estimated differential entropy (or lowers the estimated information gain) but on the average, every subsequent measurement will decrease the entropy. If one studies the measurement outcome sequence ¼ + -+ -+ z z z z z one finds that it is only after the second outcome the differential entropy of the estimated probability distribution increases. Already with the third measurement, the entropy is on a decreasing slope (even though it then takes a few mea- 3 sin 2 cos 2 .  Computing the weighted average entropy of these updated probability densities (where the weighing factors are 1/2, 1/6, 1/6, and 1/6, respectively) one arrives at an average differential entropy of = . That is, after three measurements Bob has gained less than one bit of information ( -» bit) about θ and f.
One can continue to compute the information gained by considering all possible outcomes of ¼ N 4, 5, , measurements. The trend is clear, each successive measurement results in a smaller and smaller information gain about θ and no information about f.
In the next section we shall show that, as expected, the optimal information gain about the qubit state (that is, about both θ and f) is not obtained by measuring along the same axis since no information is obtained about f. However, in the case Bob is only interested in the spin tilt-angle θ, it is not hard to guess, nor to prove, that repeated z-axes measurements are indeed optimal. In figure 5, the marginal conditional probability densities for θ obtained by integrating the joint density over the parameter f are plotted. The differential entropy differences between these one-dimensional densities are identical to those calculated above, although the entropy of each one is reduced by a fixed amount since they don't account for the uncertainty of the parameter f.

Measuring θ and f optimally
Measuring the identically prepared qubits repeatedly using the same observable is naturally not the optimal way to best estimate the state of the qubit ensemble Alice prepared. Instead, it is better to use either a set of mutually unbiased measurement bases (MUB) [17,18], a symmetric, informationally complete positive operator valued measurement (SIC-POVM) [19,20], a collective measurement of several qubits in the ensemble [21], or a sequence of adaptive measurements, where each new measurement is decided based on the measurement outcome history [22]. In some cases, depending on how the gain (or cost) of estimating the state correctly (or incorrectly), a collective measurement of the whole ensemble gives no advantage over a sequence of measurements on individual states [23]. Since we want to delineate how much information is gained by each successive measurement, and for simplicity, we will in the following use sequential von Neumann measurements corresponding to observables. In fact, as we shall see, optimising the two measurements following the first will give a MUB basis.
Two observables are mutually unbiased if all their respective eigenvectors  | ⟩ k and x | ⟩ where N is the Hilbert space dimension [17,18]. In our case N=2, and it is not difficult to see that the states constitute an unbiased basis relative to  | ⟩ and  | ⟩. In terms of the Pauli operators s x , s y , and s z , if the measurement along the z-axis corresponds to s z then the vectors + | ⟩ x and -| ⟩ x are the eigenvectors to s x with eigenvalues =   x 1 2. The Bloch sphere vectors corresponding to + | ⟩ x and -| ⟩ x point along the positive and negative x-axis, respectively, and the unbiasedness condition (21) together with (2) imply that these Bloch sphere vectors are orthogonal to the z axis.
Suppose once more that a first measurement is made along the z-axis and that the outcome was + z so that the prior probability density is given by (11). We note that the prior is independent of f so that in choosing the second measurement, without loss of generality, Bob can choose to measure in the xz-plane. Suppose Bob chooses the projection vectors defined by q a = , f = 0 and q p a = -, f p = for the measurement outcomes we shall call a + and a -, respectively. With this choice one finds that a q f q a q a q q a a f From these relations and from (11) one can compute the average differential entropy Bob would get as a function of the projection angle α. We note that since the prior (11) is independent of f, the differential entropy must be symmetric around the the projection angle a p = 2. From figure 6 we see that for a = 0 Bob retrieves the result from (17). On the other hand, the minimum differential entropy, resulting in the highest information gain, is obtained for a p = 2 and is » h 3.0942 2 . We also notice that this choice of projectors coincides with a measurement of the Pauli operator s x . That is, the measurement that maximises the information gain is a measurement which is unbiased with respect to the first measurement. Thus, given that the second measurement yields the result a º + + x , Bayes' rule gives the updated probability distribution q f The density is depicted in figure 3(c). The information gained by this second measurement by a MUB is -= h h 0.2786 1 2 . That is, Bob gains the same amount of information by this second, unbiased measurement as by the first. Of course, had the optimal measurement instead given the result a º -- x , then the updated probability distribution would have been different, the factor q f + ( ) ( ) 1 sin cos would instead have become q f -( ) ( ) 1 sin cos , but the information gained would have been identical. Now assume Bob makes a third measurement. If he follows the same procedure we just have outlined, and computes the average differential entropy obtained as a function of the axis direction, he finds that the optimum direction to measure along, given that he has already measured along the z and x axes, is the y axis. The derivation is straightforward but a bit tedious. Since the result is intuitive we will not write down the whole derivation but concentrate on the results. To minimise the average probability density entropy Bob should measure along the y-axis. The measurement results  y are equiprobable. Suppose he obtains the result + y . The conditioned probability density q f q f = + + ( | ) ( ( ) ( )) p y , 1 s i n s i n 2, which together with the prior (25) gives the probability 1/2 to get the assumed result + y . The entropy of the post measurement probability density of information towards estimating θ and f, somewhat less than bit of information, but certainly more than the -» he got on average by making three subsequent measurements using the same observable to estimate θ only. Figure 6. The differential entropy, probability averaged over the two possible measurement outcomes, as a function of the projection angle α.

Adaptive measurements
So far, the information Bob has gained about the preparation of the qubit ensemble would have been the same, irrespective of the measurement sequence along the x-, y-, and z-axes, and irrespective of the measurement results. In fact, given what we have discovered in section 6 Bob could have chosen to measure along any three mutually orthogonal directions on the Bloch sphere, and irrespective of the order or outcomes of the three measurements he would have obtained 0.8360 bits of information. However, after having used the three mutually unbiased measurement basis, a good way to proceed is to use the measurement outcomes, and based on the outcome history devise a measurement that reaps maximal information [22]. Given the assumed measurement outcome history + + + y x z the a posteriori probability density is symmetric with respect to the exchange of any two Bloch sphere coordinate axis. Hence, the maximum probability density must be found in the direction ( ) 1, 1, 1 3 on the Bloch sphere. This direction is characterised by q = » ( ) arccos 1 3 0.9553 rad, and f p = 4. This is the direction along which Bob is most confident about the probability density, so measurement directions perpendicular to this direction should be the ones where he knows the least and therefore can expect the highest information gain.
This intuitive reasoning turns out to be correct, because if Bob chooses his measurement projection axis for the fourth measurement defined by θ and f and plot the mean differential entropy of the post measurement, inferred probability density he obtains figure 7. The figure is only plotted for   f p 0 since the projection axis defined by q f . The angular coordinate pairs corresponding to the measurement projection direction perpendicular to ( ) 1, 1, 1 3 all result in an average differential entropy of = h 2.597 bit 4 . For these measurement projection directions the measurement outcome is totally random, and thus the measurement gives the maximum, classical entropy, one bit. Bob can, e.g. use the measurement projection direction q p = + ( ) arccos 1 3 2 and f p = 4 (and its antiparallel vector). If the fourth measured qubit is projected onto the state vector corresponding to the latter of these two vectors we call the measurement outcome + w . Using (2) we get The probability density is plotted in figure 3(e). It has its maximum in the direction q » 0.5220 and f p = 4. Bob gained -» h h 0.2185 bit 3 4 of information by this measurement, somewhat less than through each of the first three MUB measurements.
To continue, one would guess that Bob should still measure perpendicularly to the direction where q f + + + + ( | ) p w y x z , has its maximum, but at the same time Bob should also make sure that the projection direction is perpendicular to the direction defining + w . This fifth direction is given by q p = 2 and f p = 3 4 (or its antiparallel vector). (A minimisation of the post measurement probability density differential entropy confirms that this measurement projection axis indeed gives the unique, global minimum.) We shall call the projection onto the latter vector as measurement result + u . Using (2) one can readily compute the conditional probability The function is plotted in 3(f). The entropy of this distribution is » h 2.3883 bit 5 , so the measurement reaped -= h h 0.2087 bit 4 5 which is somewhat less than the fourth measurement gave Bob. The maximum of the probability density now is found at q » 0.6419 rad and f » 0.1318 rad, that is, the distribution has moved closer to the x-axis due to the projection on the vector closer to the positive x-rather than the positive y-axis. In total, the five measurements have given Bob of information, or an average of »0.2526 bit per measurement.

Discussion
Above, we have used differential entropy to quantify the information gain when making successive measurements of an ensemble of qubits. In the literature one can find several other ways of estimating the state of a measured qubit and the associated 'information gain'. The simplest, but crudest way is to use the fidelity of the qubit ensemble given to Bob to a target qubit t | ⟩. The fidelity F between the pure qubits y | ⟩ and t | ⟩ is defined y t = |⟨ | ⟩| F 2 (although some authors use the square root of this number) so that   F 0 1. However, the price for the simplicity of the definition is firstly that any state t jt where t| ⟩ is a state orthogonal to t | ⟩, has the same fidelity F to t | ⟩ irrespective of j. Secondly, therefore the fidelity F and fidelity -F 1 specifies equally well where on the Bloch sphere the state y | ⟩ is located. For example, both zero and unit fidelity allows one to predict y | ⟩ perfectly. Thirdly, the average fidelity F between two randomly chosen pure states is 1/2. Thus, when measuring the fidelity 1/4 Bob gets more information about the state y | ⟩ than if he measures the fidelity 1/2. Thus, fidelity as a measure of 'information gain' is of most use when it is close to unity (or to zero).
A popular method of state estimation is quantum tomography, and to a large extent this procedure follows the measurement strategy outlined in section 6. One measures the qubit ensemble using a number of linearly independent measurement bases (where the mutually unbiased bases represent an optimal choice). One could then in principle use minimisation of the measurement entropy to optimally estimate the state y | ⟩, but due to the logarithm function in the definition of the entropy this is mathematically difficult. Instead the standard method is to use the maximum likelihood method from statistics, combined with the constraints that the result should fulfil physical boundary conditions and be normalised. Quantum tomography does usually not provide any quantitative value of the information gained through the tomography, but in principle the (multidimensional) likelihood function contains such information. The more peaked it is, the more information gained.
We remark that instead of using differential entropy we could have used some other measures of information gain, e.g. the Kullback-Liebler (K-L) divergence [24]. For an a priori and an a posteriori probability density q(a) and p(a), respectively, the K-L divergence is defined as We see that the K-L divergence is similar, but not identical, to the differential entropy difference (8). However, as we have seen, each measurement results in a rather 'gentle' updating of the post measurement probability density in the considered scenario, so that the measurement a priori and a posteriori probabilities will not differ substantially. Therefore, the differential entropy and the K-L divergence will give rather similar numerical values in the considered case. Specifically, the estimated information gain through the first measurement is identical for the two measures since the a priori probability density q in (31) is a constant in this case. Thus, if we use the K-L divergence as a measure of information gain, we would obtain quantitatively slightly different results than when using differential entropy, but the main messages will be the same: with each successive measurement substantially less than one bit of information is reaped and the average obtained information per measurement is non-increasing. Another information measure of statistical distributions is the Fisher information [25]. When measuring, e.g. along the z-axis, the Fisher information quantifies the amount of information that the outcome of the z-measurement carries about θ. The Fisher information does not have the unit bit, but depends on the parametrization of θ through z. However, the Cramér-Rao bound [26,27] states that the inverse of the Fisher information of a probability density provides a lower bound for the variance of the estimate of θ through the measurement of z. Thus, a high Fisher information translates to the possibility of a rather accurate estimate of θ. Moreover, finding an estimation procedure that saturates the Cramér-Rao bound guarantees the procedure's optimality. Unfortunately, there exists no direct relation between the Fisher information and the obtained information in bits as they have different units and have different uses.

Conclusions
We have used von Neumann measurements on the simplest quantum system, a pure qubit, to show that caution and care must be excised when making or discussing information gain as a consequence of quantum measurements. We have discussed three different measurement strategies on individual members of an ensemble of identically prepared qubits and showed that while each measurement outcome may result in a classical bit of information, the information gained about the preparation of the qubit by using Bayes' inference rule is much smaller. For the initial measurements it is on the order of a quarter of a bit. Thus, it takes four optimal measurements before one has gained at least one bit of information.
We have also shown that the concept of mutually unbiased basis follows naturally from the requirement of maximal information gain. Irrespective of how Bob chooses his first measurement, the three first optimal measurements will always define a set of unitarily equivalent MUBs. When the mutually unbiased bases are exhausted, then our analysis shows that an adaptive measurement strategy yields more information than repeated measurements in the MUB bases. For example, the differential entropy for the choice q p = (which represent a measurement along the z-axis) in figure 7 is higher than for the optimal directions.
It is also clear from our example that information is relative. Bob cannot, at all, predict the outcome, + or −, of any of the five optimal measurements, so that the five measurement outcomes represent to him five classical bits. Alice, on the other hand, knowing what state she has prepared, can make more educated guesses about the outcomes of Bob's five measurements if she knows his measurement basis. Should, for instance, any of Bob's measurement have the prepared state as an eigenstate, then Alice can predict with certainty what this measurement's outcome will be. Hence, the sequence of the five measurement outcomes will give Alice less than five classical bits of randomness since it is impossible that all Bob's five measurements will result in two equally probable outcomes from Alice's perspective.
We also have seen that the average information gained by each successive, optimal measurement is non-increasing, and generally it is decreasing. This is well in line with any measurement process void of systematic errors. If each individual measurement has some intrinsic uncertainty, e.g. quantified by a variance σ, then from N independent measurements the expected estimated measurement value has the variance s N . Thus, the expected reduction of the variance, and hence the decrease of the differential entropy, will be smaller and smaller for each successive measurement. Only the MUB bases are such that a measurement in one basis gives no information about the likelihood of obtaining any of the outcomes of a measurement in another MUB basis.
Although the information gain per successive measurement approaches zero as the number of measurements tend to infinity, the accumulated information, contained in the inferred probability density, tends to infinity. This is because there is (at least in principle) no limit to how precisely we can get to know the angles θ and f that define the preparation of the state. As the probability density approaches a delta-function on the Bloch sphere, the differential entropy that initially was finite, tends to negative infinity.
It can be worth noticing that up to the second measurement, the procedure outlined above can be mimicked by a classical system having two degrees of freedom, e.g. two unfair coins, each having an unknown probability (for Bob) to give the outcome 'tails' when flipped. However, classical systems that are prepared in a statistical mixture of four definite states can only be (sensibly) measured using a 'basis' defined by these, mutually exclusive outcomes. Hence, a classical system will not give Bob the opportunity to measure in, e.g. mutually unbiased bases. We have seen that MUBs result in a more rapid gain of information about how the system is prepared than doing repeated measurements in the same basis.
We have not discussed the optimal measurement strategy for an ensemble of N identically prepared quantum systems, which, in general but not always, is to make a joint measurement of all N systems. When making joint measurements one can use interference between different qubits. Interference lies at the heart for most quantum information tasks and can often render them effectively parallelised. This is of course not possible if each qubit in the ensemble is measured separately.
A kind of complement to our study is a recent work by Wootters [28] where the goal is to transfer the value of a continuous variable, exemplified by θ, via the outcomes of N identical von Neumann measurements by preparing the state optimally.