Conferencias sobre computación - Richard P. Feynman

RPFLecturesComputation_Digital RPFLecturesComputation_Portada

Contents | Índice

RPFLecturesComputation_Contents1

RPFLecturesComputation_Indice1 RPFLecturesComputation_Indice2

Preface | Prefacio

RPFLecturesComputation_Preface RPFLecturesComputation_Prefacio

1. Introduction to Computers | Introducción a los computadores

RPFLecturesComputation_Chap1 RPFLecturesComputation_Cap1

RPFLecturesComputation_Chap1-2_1_Pag11

RPFLecturesComputation_Chap1-2_2_Pag13

RPFLecturesComputation_Chap1-2_3_Pag15

RPFLecturesComputation_Chap1-2_4_Pag16

RPFLecturesComputation_Chap1-2_5_Pag17

RPFLecturesComputation_Chap1-3_1_Pag18

2. Computer Organization | Organización del computador

RPFLecturesComputation_Chap2 RPFLecturesComputation_Cap2

RPFLecturesComputation_Chap2-1_1_Pag21

RPFLecturesComputation_Chap2-1_2_Pag25

RPFLecturesComputation_Chap2-1_3_Pag27

RPFLecturesComputation_Chap2-1_4_Pag28

RPFLecturesComputation_Chap2-1_5_Pag28

RPFLecturesComputation_Chap2-2_1_Pag32

RPFLecturesComputation_Chap2-2_2_Pag33

RPFLecturesComputation_Chap2-3_1_Pag34

RPFLecturesComputation_Chap2-3_2_Pag34

RPFLecturesComputation_Chap2-3_3_Pag35

RPFLecturesComputation_Chap2-3_4_Pag35

RPFLecturesComputation_Chap2-3_5_Pag37

RPFLecturesComputation_Chap2-3_6_Pag38

RPFLecturesComputation_Chap2-3_7_Pag38

RPFLecturesComputation_Chap2-4_1_Pag40

RPFLecturesComputation_Chap2-4_2_Pag40

RPFLecturesComputation_Chap2-4_3_Pag42

RPFLecturesComputation_Chap2-5_1_Pag43

RPFLecturesComputation_Chap2-5_2_Pag43

RPFLecturesComputation_Chap2-5_3_Pag44

RPFLecturesComputation_Chap2-5_4_Pag44

RPFLecturesComputation_Chap2-5_5_Pag44

RPFLecturesComputation_Chap2-5_6_Pag45

RPFLecturesComputation_Chap2-6_1_Pag46

RPFLecturesComputation_Chap2-6_2_Pag50

3. The Theory of Computation | La teoría de la computación

RPFLecturesComputation_Chap3 RPFLecturesComputation_Cap3

RPFLecturesComputation_Chap3-1_1_Pag53

RPFLecturesComputation_Chap3-1_2_Pag53

RPFLecturesComputation_Chap3-1_3_Pag53

RPFLecturesComputation_Chap3-1_4_Pag54

RPFLecturesComputation_Chap3-2_1_Pag55

What kinds of problems can such machines do, or not do? It turns out that there are some questions that FSMs cannot answer but that Turing machines can. Why this should be the case is naturally of interest to us.

RPFLecturesComputation_Chap3-3_1_Pag60

So why cannot an FSM do something this simple?

Here lies the problem. An arbitrary string requires a machine with an arbitrary - that is, ultimately, infinite - number of states. Hence, no finite state machine will do. What will do, as we shall see, is a Turing machine - for a Turing machine is, essentially, an FSM with infinite memory capacity.

RPFLecturesComputation_Chap3-3_2_Pag61

RPFLecturesComputation_3-3_term1-1

RPFLecturesComputation_3-3_term1-2

RPFLecturesComputation_3-3_term1-3

RPFLecturesComputation_Chap3-3_3_Pag61

RPFLecturesComputation_Chap3-3_4_Pag61

RPFLecturesComputation_Chap3-3_5_Pag61

RPFLecturesComputation_Chap3-3_6_Pag62

RPFLecturesComputation_Chap3-3_7_Pag63

RPFLecturesComputation_3-3_term2-1

RPFLecturesComputation_Chap3-3_8_Pag65

RPFLecturesComputation_3-4_term2-1

RPFLecturesComputation_Chap3-4_1_Pag66

RPFLecturesComputation_Chap3-4_2_Pag66

RPFLecturesComputation_Chap3-4_3_Pag67

RPFLecturesComputation_3-4_term1-1

RPFLecturesComputation_3-4_term1-2

RPFLecturesComputation_Chap3-4_4_Pag68

RPFLecturesComputation_Chap3-4_5_Pag68

RPFLecturesComputation_Chap3-4_6_Pag71

Our discussion closely follows Minsky [1967]. [RPF]

By assigning different symbols to O’s and 1’s - A’s and B’s, say - we can keep track of which were 0’ s and l’ s; if we wanted to come along tomorrow and use the file again, we could, only we would find it written in a different alphabet.

Minsky’s solution for a locating Turing machine is shown in Figure 3.16:

True to the spirit of Turing machines, we are going to copy it slowly and laboriously to another part of the tape.

A cute feature of this machine is that it copies the contents of the file into the block containing the target filename on the original tape; that is, the target string is overwritten. (We can do this because we chose to have filenames and contents the same size.)

I said earlier that if you had an effective procedure for doing some computation, then that was equivalent to it being possible in principle to find a Turing machine to do the same computation.

the general recursive functions are Turing computable, and vice versa - so we can take “Turing computable” to be an effective synonym for “computable”.

Functions for which this is not true are called “partial”.

This redefined function is complete in the sense that we can assign some numerical value to it for any x.

But in general, we cannot say in advance when a particular value of x is going to give us trouble! Put another way, it is not possible to construct a computable function which predicts whether or not the machine T F halts with input x.

This concept, of Turing machines telling us about other Turing machines, is central to the topic of Universal Turing machines to which we now turn.

Our discussion again closely follows Minsky [1967].

Minsky nicely relates this process to what you would do when using a quintuple list and a tape to figure out what a Turing machine does. The universal Turing machine U is just a slower version of you!

The description of U given above, due to Minsky, requires U to have 8 symbols and 23 states.

surely a machine that can do everything should be enormously complicated? The surprising answer is that it’s not! How efficient one can make a UTM is an entertaining question, but has no deep significance.

In other words, D always halts with an answer. Can such a machine exist? The answer is no! We can actually show that D is an impossible dream,

This is a clear contradiction! Going back through our argument, we find that it is our assumption that D exists that is wrong. So there are some computational problems (e.g. determining whether a UTM will halt) that cannot be solved by any Turing machine. This is Turing’s main result.

Consider computable real numbers: by which we mean those whose binary expansions can be printed on a tape, whether the machine halts or not.

We call a set “countable” if we can put its elements in one-to-one correspondence with elements of the set of positive integers; that is, if we can label each set member by a unique integer.

So by “diagonalization” as it is called (referring to the “diagonal” line we can draw through all of the underlined numbers above) we have shown that the real numbers are not countable. GEB XIII, diagonal de Cantor.

A procedure might take the age of the Universe to complete yet still be technically “effective”. In practice we want procedures that are not just effective but also efficient.

Many problems in “artificial intelligence”, such as face recognition, involve effective procedures that are not efficient - and in some cases, they are not even very effective!

The principle here is that you can know a lot more than you can prove! Unfortunately, it is also possible to think you know a lot more than you actually know. Hence the frequent need for proof.

An effective procedure for this might involve taking all prime numbers up to n l/2 and seeing if any divide n; if not, n is prime.

In mathematics, as in linguistics, a grammar is basically a set of rules for combining the elements of a language, only the language is a mathematical one (such as arithmetic or algebra).

But remember it took a Turing machine to do this: a finite state machine was not up to it.

We will finish our look at computability with an interesting problem discovered by Post as a graduate student in 1921. GEB I, Emil Post.

The question is this: does this process go on forever, stop, or go on periodically? The last I heard, all tested sequences had either stopped or gone into a loop, but that this should be so generally had not been proved.

4. Coding and Information Theory | Codificación y teoría de la información

RPFLecturesComputation_Chap4 RPFLecturesComputation_Cap4

Components can let us down chiefly in two ways. Firstly, they may contain faults: these can arise during manufacture and are obviously of extreme importance.

What I want to focus on now is a second way in which an element can let us down. This is when it fails randomly, once in a while, perhaps because of the random Brownian motion of atoms or just irregular noise in a circuit.

and the problem of unreliability was acute: with a million components one could expect a thousand of them to be acting up at anyone time!

Indeed, until recently, the problem has ceased to be considered very serious.

Any of these could mean that the message we receive differs from the one sent: this is the so-called “communication problem in the presence of noise”.

Now this isn’t exactly the same situation as with memory - which is like sending a message through time rather than space - but you can see the similarities.

We will start with a look at how we might go about detecting errors, an essential step before we can correct them.

Clearly, all we have to work with is the message: calling up the sender for confirmation defeats the object!

This is an example of coding a message; that is, amending its basic structure in some way to allow for the possibility of error correction (or, of course, for reasons of security).

There are two main shortcomings of this procedure which we should address. Firstly, we might get an error in the parity bit!

Secondly, at best the check only tells us whether an error exists - it does not tell us where that error might be.

Actually, it is a kind of higher- dimensional generalization of the array method.

For any given message, we can ask the question: “How many check bits do I need to not only spot a single error, but also to correct it?” One clever answer to this question was discovered by Hamming, whose basic idea was as follows.

If the syndrome is zero, meaning all parity checks pass, there is no error; if it is non-zero, there is an error, and furthermore the value of the syndrome tells us the precise location of this error.

Each such bit will be a parity check run over a subset of the bits in the full fifteen-bit message.

We assign to each parity check a one if it fails and a zero if it passes, and arrange the resulting bits into a binary number, the syndrome.

Let us look at the rightmost parity check, the far right digit of the syndrome.

To get the second, look at the second digit from the right.

we have narrowed the possible location choices.

We first have to decide where to stick in our parity bits. There is nothing in what we have said so far that tells us whereabouts in the message these must go, and in fact we can put them anywhere. However, certain positions make the encoding easier than others, and we will use one of these.

If we had chosen an apparently more straightforward option, such as placing them all at the left end of the string occupying positions 1 to 4, we would have had to deal with a set of simultaneous equations in a,b,c,d. The important feature of our choice is that the parity positions 1,2,4,8 each appear in only one subset, giving four independent equations.

Note that the cost of these benefits is almost a 50% inefficiency - five check bits for an eleven bit message. As we pointed out earlier however, the inefficiency drops considerably as we increase the message length. For a one-thousand bit message, the inefficiency is a tiny one percent or so.

We can use Poisson’s Law to get a handle on the probabilities of multiple errors occurring when we send our batches.

Now suppose we have no error detection or correction, so we are expending no cash on insurance.

On average, we should only be able to send a thousand batches before this happens, which is pretty miserable.

However, suppose we have our one percent gadget, with single error correction and double error detection. This system will take care of it itself for single errors, and only stop and let me know there is a problem if a double error occurs, once in every million or so batches. It will only fail with a triple error, which turns up in something like every six billion batches. Not a bad rate for a one percent investment!

Issues of efficiency and reliability are understandably central in computer engineering.

Spacecraft communications rely on a kind of voting technique, referred to as “majority logic decisions”.

How many copies you send depends on the expected error rate.

This information is still stored on disks, but individual disks are simply not fast enough to handle the required influx and outflow of data.

Every manufacturer would like to build the perfect disk, one that is error-free - sadly, this is impossible.

The reason you do not usually notice this is that machines are designed to spot flaws and work around them:

However, when a disk is working alongside many others in a parallel processing environment, the momentary hang up as one disk attends to a flaw can screw everything up.

The flaws we are talking about here, in any case, are not really random: they are permanent, fixed on the disk, and hence will turn up in the same place whenever the disk is operating.

The use of this Hamming coding method saved the whole idea of gang disks from going down the drain.

any number will do, but it must be non-zero, or you’ll get into trouble with what follows.

RPFLecturesComputation_Chap4_4-2_Pag106

Given this, says Shannon, if no limit is placed on the length of the batches Me, the residual error rate can be made arbitrarily close to zero.

In practice, however, it might require a large batch size; and a lot of ingenuity. However, in his extraordinarily powerful theorem, Shannon has told us we can do it. Unfortunately, he hasn’t told us how to do it. That’s a different matter!

This obviously makes sense. As the error rate drops, the upper limit on the efficiency increases, meaning that we need fewer codebits per data bit.

The actual proof of this theorem is not easy. I would like to first give a hand-waving justification of it which will give you some insight into where it comes from. Later I will follow a geometrical approach, and after that prove it in another way which is completely different and fun, and involves physics, and the definition of a quantity called information.

We now take the logarithm of both sides. The right hand side we work out approximately, using Stirling’s formula:

Taking the Jogarithm to base two of the left hand side of the inequality, and dividing both sides by Me, we end up with Shannon’s inequality.

This inequality tells us that, if we want to code a message M, where the bit error rate is q, so that we can correct k errors, the efficiency of the result cannot exceed the bounds in (4.6).

Now to ask that our error rate be less than a number N (e.g. 10- 3 °) is equivalent to demanding that the number of errors we have to correct be less than K’,

A heck of a lot smaller than our 1O- 30 !

RPFLecturesComputation_Chap4_4-3_Pag110

What Shannon has given us is an upper limit on the efficiency.

It’s a terrific mathematical problem.

However, to do this would require a prohibitively long Me, so long that it is not practical. In fact, a scheme is used in which about one hundred and fifty code bits are sent for each data bit!

Message space, simply, is a space made up of the messages that we want to transmit.

RPFLecturesComputation_Chap4_4-4_Pag111

We could easily generalize to a four-bit message, which would have a message space that was a l6-vertex “hypercube”, which unfortunately our brains can’t visualize!

This leads us to introduce a so- called “distance function” on the message space. The one we shall use is called the Hamming distance. The Hamming distance between two points is defined to be the number of bits in which they differ.

The notion of distance is useful for discussing errors. Clearly, a single error moves us from one point in message space to another a Hamming distance of one away; a double error puts us a Hamming distance of two away, and so on.

For a given number of errors e we can draw about each point in our hypercubic message space a “sphere of error”, of radius e, which is such that it encloses all of the other points in message space which could be reached from that point as a result of up to e errors occurring. This gives us a nice geometrical way of thinking about the coding process.

We can build a message space for Me just as we can for M; of course, the space for Me will be bigger, having more dimensions and points. Clearly, not every point within this space can be associated one-on-one with points in the M-space; there is some redundancy. This redundancy is actually central to coding.

Note that we can have simple error detection more cheaply; in this case, we can allow acceptable points in Me to be within e of one another. The demand that points be (2e+ l) apart enables us to either correct e errors or detect 2e.

RPFLecturesComputation_Chap4_4-4_Pag113

If we wanted single error correction for this system, we would have to use a space for Me of four dimensions.

Using this, we can obtain an inequality relating the volume of Me to that of the spheres.

There is no need to go through the subsequent derivations again; you should be able to see how the proof works out.

You should find that it is actually impossible to detect doubles without bringing in the length of the message.

What the poor parents are doing is converting long English sentences into single bits!

You’re calling long distance but getting the message through costs you nothing because he doesn’t pick up the phone.

SO we could exploit the structure of the language to send fewer symbols, and this packing, or compression, of messages is what we’re going to look at now.

A good way to look at the notion of packing is to ask, if we receive a symbol in a message, how surprised should we be by the next symbol?

What we want to guess is how much freedom we have.

But in any case, people will be able to guess the next letter at a much better rate than one in thirty!

And that gives you an idea of how English can be compacted. It also introduces us to the notion of how much information is in a message.

You cannot get a more efficient sending method than this, compressing a whole message down to one number.

This number, the number of bits we minimally need to send to convey as much as we possibly could have conveyed in the N bits of English (or whatever other system being used) is called the information content, or just the information being sent.

that we are coding messages into a binary system and looking at the bare minimum of bits we need to send to get our messages across.

It is possible to give a crude definition of the “average information per symbol” using the notions we have developed.

In general, of course, the precise answer will be horribly difficult to find.

There can also be “long-range correlations” where parts of the string influence others some distance away. This is true in English, where the presence of a word at one point in a message typically affects what other words can appear near it. As we are not being rigorous yet we will not worry about such problems.

We call this ratio the information per symbol of our system, an interpretation that seems clear from the right hand side of (4.23).

In a sense, the amount of information in a message reflects how much surprise we feel at receiving it.

In this respect, information is as much a property of your own knowledge as anything in the message.

This illustrates how “information” is not simply a physical property of a message: it is a property of the message and your knowledge about it.

In this sense, the string contains a lot of “information”. Receiving the message changes your circumstance from not knowing what it was to now knowing what it is; and the more possible messages you could have received, the more “surprised”, or enlightened, you are when you get a specific one. If you like, the difference between your initial uncertainty and final certainty is very great, and this is what matters.

This makes sense, given our claim that the information in a message represents the “surprise” we experience at receiving it.

This is actually quite an unrealistic assumption for most languages.

Combinatorics comes to our rescue, through a standard formula.

RPFLecturesComputation_Chap4_4-6_Pag121

We earlier defined information to be the base two logarithm of the number of possible messages in a string.

That definition was based on the case where all messages were equally likely, but you can see that it is a good definition in the unequal probability case too.

Assuming N to be very large, and using Stirling’s approximation, with which you should be familiar by now, we find:

We can therefore obtain the average information per symbol:

RPFLecturesComputation_Chap4_4-6_Pag122

Note how this ties in with our notion of information as “surprise”: the less likely the message to appear, the greater the information it carries.

RPFLecturesComputation_Chap4_4-6_Pag123

Incidentally, Shannon called this average information the “entropy”, which some think was a big mistake, as it led many to overemphasize the link between information theory and thermodynamics 6.

This gives the probability Pm of message m being sent and we can treat each message as the symbol of some alphabet, similar to labeling each one, rather like the parent-student card we looked at earlier.

These codes are unlike those we’ve considered so far in that they are designed for messages in which the symbol probabilities vary.

In fact, it is possible to invent a non-uniform code that is much more efficient, as regards the space taken up by a message, than the one we have. This will be an example of compression of a code.

The idea is that the symbols will vary in their lengths, roughly inversely according to their probability of appearance, with the most common being represented by a single symbol, and with the upshot that the typical overall message length is shortened.

Continue in this vein until we reach the right hand of the tree, where we have an “alphabet” of two symbols, the original maximally probable one, plus a summed “joint” symbol, built from all the others.

It is worth pointing out that other Huffman codes can be developed by exploiting the ambiguity that occasionally arises when a joint probability at some point equals one of the original probabilities.

We can easily calculate the length saving of this code in comparison with a straight three-bit code.

It has the property that no code word is the prefix of the beginning of any other code word. A little thought shows that a code for which this is not true is potentially disastrous.

There is an ambiguity due to the fact that the symbols can run into each other. A good, uniquely decodable symbol choice is necessary to avoid this, and Huffman coding is one way forward.

Huffman coding differs from our previous coding methods in that it was developed for compression, not error correction.

However, as I have stressed by the example of English, such dependence is extremely common. The full mathematical treatment of source alphabets comprising varying probabilities of appearance and intersymbol influence is quite complex and I will not address it here. However, I would like to give you some flavor of the issues such influence raises, and I will do this by considering predictive coding.

RPFLecturesComputation_Chap4_4-7_Pag128

It’s not difficult to see how, if we send this string, we can reconstruct the original message at the other end by using an identical predictor as a decoder. It simply works backwards.

Interspersed between the ones will be long runs of zeroes. The key is this - when sending the message, we do not send these runs: instead we send a number telling us how many zeroes it contained. We do this in binary, of course.

Predictive coding enables us to compress messages to quite a remarkable degree.

You can get pretty close to Shannon’s limit using compression of this sort.

Ordinarily, information like the oil pressure in a car, the torque in a drive shaft, the temperature variation on the Venusian surface, is continuous: the quantities can take any value.

This is not a matter of fundamental principle; it is actually a practical matter. I will say a few words about it despite the fact it is somewhat peripheral. You could say that the whole course is somewhat peripheral.

The secret of sending the value of S is to approximate it.

Then, all we need do is split the interval [0, l] up into one hundred slices (usually referred to as “bins”), and transmit information about which slice the value of S is in; in other words, a number between 0 and lOO.

This is not uncommon. Usually, such a variable will have values that are not evenly distributed.

(very, very few physical quantities have flat probability distributions).

RPFLecturesComputation_Chap4_4-8_Pag130

RPFLecturesComputation_Chap4_4-8_Pag131

The basic idea for transmitting S in this general case is the same. We divide the effective range of S into a number of bins with the important difference that we size these bins so that they are all of equal probability

RPFLecturesComputation_Chap4_4-8_Pag131-2

RPFLecturesComputation_Chap4_4-8_Pag132

P(s) is just the cumulative probability function of S, the probability that S:::; s. It clearly satisfies the inequality (0:::; P:::; 1). One well-known statistical property of this function (as you can check) is that its own distribution is flat: that is, if we were to plot the probability distribution of P(s) as a function of s in Figure 4.6, we would see just a horizontal line.

Consideration of such a problem will bring us on to consider the famous Sampling Theorem, another baby of Claude Shannon.

Of course, for a general function, the receiver will have to smooth out the set, to make up for the “gaps”. However, for certain types of continuous function, it is actually possible to sample in such a way as to encode completely the information about the function: that is, to enable the receiver to reconstruct the source function exactly! To understand how it is possible to describe a continuous function with a finite number of numbers, we have to take a quick look at the mathematical subject of Fourier analysis. I will cover this because I think it is interesting;

Recall that, according to Fourier theory, any periodic function f(t) can be written as a sum of trigonometric terms.

RPFLecturesComputation_Chap4_4-8_Pag133

Now the typical function (signal) that is encountered in communication theory has a limited bandwidth;

RPFLecturesComputation_Chap4_4-8_Pag133-2

This is the Sampling Theorem.

In other words, if we sampled the function at times spaced (Pi/v) time units apart, we could reconstruct the entire thing from the sample!

Then, the sum (4.4l) is no longer infinite and we only need to take a finite number of sample points to enable us to reconstruct .f(t).

Although I have skated over the mathematical proof of the Sampling Theorem, it is worth pausing to give you at least some feel for where it comes from.

RPFLecturesComputation_Chap4_4-8_Pag134

The Fourier transform of the sampled function, F(t), is obtained by the process of “convolution”, which in crude graphical terms involves superposing the graph of <1> with that of X.

There is as much information in one of the Fourier-transformed bumps in Figure 4.ll as there is in the whole of Figure 4.9! As the former transform comes solely from the sampled function F(t), we can see the basic idea of the Sampling Theorem emerging.

Under such circumstances we get what is known as aliasing. The sampling interval will be too coarse to resolve high frequency components inf(t), instead mapping them into low frequency components - their “aliases”.

However, evidence that sampling has occurred often shows up. Maybe the best known is the behavior of wagon wheels in old westerns.

The explanation for this phenomenon lies in inadequate sampling.

To avoid aliasing, we would need to filter out of the signal any unduly high frequencies before we sampled.

Such developments will transform the future.

It seems that the technological world progresses, but real humanistic culture slides in the mud!