Biology and computing

Biology and computing used to be at opposite ends of the scientific spectrum. But the deluge of DNA data and the complexity of computer networks have begun a productive conversation between the two-which may lead to a kind of union.
December 20, 2000

Inside a long, low sliver of glass and steel in upstate New York, the world's most powerful computer is taking shape. It will consist of 256 cube-shaped racks laid out on a 16-by-16 grid. Each rack will contain four circuit boards, each of which will have 36 chips mounted on it. Each of these chips, in turn, will contain 32 processors, each one equivalent in computing power to one of today's fastest desktop PCs. The resulting supercomputer, which is called Blue Gene and is being constructed by IBM at its Thomas J Watson research centre, will thus consist of a total of 1.2m processors. Collectively, they will be capable of performing a million billion operations per second, making Blue Gene over 100 times faster than any computer yet built.

And what will this computer be used for when it is finally switched on, sometime in 2003? The answer is surprising. It will not be put to work modelling nuclear explosions, forecasting the weather, cracking codes or analysing collisions between galaxies. Instead, it will be used for biological research.

Although astronomers, meteorologists, mathematicians and physicists have been using computers ever since they were invented, biologists have not. In part, this is because biology does not yield to mathematical abstraction in the way that other sciences do. At least, it did not until recently. But biologists are now drowning in genomic data. The sequencing of the human genome is the best known example; but the gene sequences of other organisms, from bacteria to mice, are also spewing into databases. In order to make sense of it all, biologists need computers-big ones.

While biologists rush to master computing, computer scientists are brushing up on biology. For the realisation that biology is, at root, an information science has coincided with a growth in the complexity of computer systems, notably the internet, to the extent that they increasingly resemble biological systems. At a low enough level, biology resembles computing; and at a high enough level, computing resembles biology.

Computer viruses-more than a metaphor

Computing has been borrowing ideas from biology for some time. We speak routinely of computer viruses, or of a processor as a computer's "brain." Often such analogies are inaccurate. But many researchers have found that rather than trying to solve a difficult computing problem from scratch, it often makes sense to steal ideas from Mother Nature.

Take the problem of network intrusion- where malicious hackers break into a company's network, as recently happened to Microsoft. Preventing such break-ins is hard, because rules about what is or is not allowed are either too lax to prevent attacks, or so strict that they antagonise legitimate users. The problem is one of distinguishing between legitimate actions by users within the network and malicious actions by outsiders. It is, in other words, the same problem as that faced by the immune system. Hence the decision by Stephanie Forrest and her colleagues at the University of New Mexico to build an artificial immune system, called Artis, that deliberately imitates its biological equivalent. It relies on the idea that when someone tries to break into a network, the intrusion changes the pattern of data flow from its usual state.

In a real immune system, cells called lymphocytes are created at random, and those that do not react with any of the body's naturally occurring molecules are released into the blood to search for intruders (others are destroyed). Similarly, Artis creates software lymphocytes called detectors, which consist of random strings of binary digits (bits). Most are deleted, but if one survives for more than two days, it is then released, like a lymphocyte, to search for intruders. The chances are that it will only match network traffic that deviates from normal patterns. Haphazard though this approach sounds, it proved effective when tested against simulated network attacks, identifying all of the break-ins, and with a lower rate of false alarms than competing systems.

Dealing with computer viruses is another area where a biological approach makes sense. A computer virus is a small piece of self- replicating code. On its own, on a floppy disk, say, it is harmless; but once it gets inside a computer, it can spread. In this respect it is like a biological virus, which is simply a bit of genetic material wrapped up in a protein coat. Just as a computer virus exploits the environment inside a computer, a real virus hijacks the machinery inside a living cell.

So researchers at IBM have proposed creating an immune system for the internet, which they call the Digital Immune System. When a PC detects a suspicious file, rather than just sounding a warning, it passes the file across the network to a central location for analysis. The file is analysed and used to infect an isolated test network of PCs to find out how the virus inside it works-just as biologists studying viruses grow them in culture. A signature to identify the virus, along with an antidote, are then generated, tested on the infected test network, and finally distributed over the internet to eradicate the original infection and prevent infections elsewhere.

There are many other examples of biologically inspired computing. Neural networks, primitive brain-like networks of interconnected neurons simulated in software, are used to allow computers to learn particular patterns. Then there are genetic algorithms, where a solution to a problem is represented as a string of symbols that has random mutations applied to it; the most successful are then chosen and have subsequent mutations applied to them. Genetic algorithms can be used to solve problems from route-planning to engine design. Most importantly, both neural networks and genetic algorithms can be used to solve problems more quickly than traditional methods.

"We believe that computers and software have, without people really noticing it, become far more biological," says Forrest. As a result traditional approaches to software engineering are no longer going to work. "We have to think about managing and controlling these systems more like natural ecosystems," she says. One reason today's computers are so susceptible to viruses and break-ins, she adds, is that the ubiquity of Microsoft's Windows operating system has resulted in what biologists would call a monoculture. One possible way to counter this kind of vulnerability is to make tiny modifications to software, such as adding occasional "do nothing" instructions, that do not affect the way a program works but ensure that different copies are not identical. This would make life harder for virus writers or malicious hackers.

Computers and genomics

As computer scientists take inspiration from biology, so biologists are becoming increasingly reliant on computing. Indeed, the sequencing of the human genome was as much a computational exercise as a biological one. The "shotgun sequencing" method used by Celera, the private company working on the project, involved shattering strands of DNA into tiny fragments, working out the genetic information encoded in each fragment, and then fitting all the small pieces of genetic code together again using computer analysis.

Tim Hubbard, a computational biologist at the Sanger Centre in Cambridge (who put the human genome on a CD for Prospect's October issue), points out that the amount of sequence data available is now doubling every six months. A single laboratory can now produce 100 gigabytes of data per day-20,000 times the volume of data in the complete works of Shakespeare. So just to stay on top of this mountain of data requires constant infusions of new equipment. And to do things like compare one genome (of a human, say) with another (of a mouse, perhaps) will require further improvements in hardware and software.

But the biggest challenge facing computer-wielding biologists at present is the protein-structure problem. Now that the human genome has been sequenced, the next step is to work out what it all does. Since genes are merely instructions for making proteins, that means working out the structure, and hence the function, of the protein associated with each gene. Knowing the function means you can work out, for example, which genes are associated with particular diseases. And knowing the structure of the corresponding protein makes it easier to design drugs to interact with it, since each kind of protein folds up into its own characteristic shape.

The traditional way to determine the structure of a protein is experimentally, by getting the protein to form a crystal and then probing its shape by X-ray crystallography. The problem is that not all proteins can be crystallised. So the hope is that it will be possible to determine protein structures using computers instead. The relevant gene sequence would be fed in, and translated into a ball-and-stick atomic model of the corresponding protein. By working out the forces between the atoms, it ought then to be possible to simulate how the protein folds itself up, and thus determine its characteristic structure.

So far, however, attempts to model protein folding using computers have made little headway. This is either, says Hubbard, because the ball-and-stick models are too crude, or because not enough processing power has been thrown at the problem. The amount of computing muscle required is enormous: the simulation must re-evaluate the protein's structure after every millionth of a billionth of a second of simulated time. Since a protein takes up to a second to fold up, this means a million billion time steps are required-each of which involves calculating the forces between thousands of individual atoms. Even a Cray supercomputer working for three months was only able to model the behaviour of a small protein for a millionth of second.

Hence IBM's decision to build Blue Gene, a supercomputer more powerful than anything ever built, which has been designed to simulate protein folding. Even Blue Gene running for a year will not be enough to fold a single protein fully. But by performing thousands of small test runs and analysing how different starting conditions change the way a protein folds, it ought to be possible to start to tease out the rules that govern the process. Ultimately, understanding these rules would make it possible to do more than just work out protein structures. It would also allow scientists to design and test new drugs "in silico," and to probe the causes of diseases such as Creutzfeldt-Jakob disease, which is caused by misbehaving proteins caused prions. Once protein behaviour can be modelled, it should be possible to simulate cells, organs, and possibly even whole organisms. (Already, biologists are working on computer models of the human heart, lungs and pancreas.)

Evidently there is just as much scope for computing in biology as there is for biology in computing. But there is a problem: how to find enough people with the right skills. As Sylvia Spengler, a computational biologist at the Lawrence Berkeley National Laboratory in California, gingerly puts it: "Many biologists, if you ask them why they chose the subject, will tell you they had a problem with maths." A report issued last year by America's National Institutes of Health warned of "the alarming gap between the need for computation in biology and the skills and resources available to meet that need." Many people working in the field did not start their careers as biologists; a lot of them have backgrounds in computing or physics and have chosen to switch to what has become one of the most exciting fields in science. But in the long term, what is needed is a new kind of scientist, who is equally at home in the worlds of computing and biology.

the coming fusion

Where will all this lead? The ultimate fusion of biology and computing would be the construction of a molecular computer. If biologists and computer scientists can determine the rules that govern molecular interaction, they may be able to exploit them to build a computer entirely out of biological molecules.

The first steps in this direction have already been taken. The field was inaugurated in 1994 by Leonard Adleman of the University of Southern California, who solved a simple route-planning problem by manipulating strands of DNA in a test tube. He encoded all of the possible answers to the problem in billions of strands of DNA, and used a series of clever steps that relied on biological logic to discard the incorrect answers so that only strands representing the correct answer were left.

Last year Ehud Shapiro of the Weizmann Institute in Israel took the next step when he unveiled his design for a biological model of a Turing machine, the simplest general-purpose computer. Despite its simplicity, it is an axiom of computing that any task that can be solved by an arbitrarily complex computer can also be solved by a Turing machine (though it may take longer). Shapiro's proposed molecular machine would use a long strand of RNA (a cousin of DNA) in place of the paper tape, and the whole thing would be immersed in a soup of "rule molecules" that would encode the fixed rules. Shapiro is under no illusion about the difficulties of building such a machine. It would require complete human mastery of the processes that go on inside cells. But perhaps in 30 years, he says, it will be possible.

Other researchers, meanwhile, have been experimenting with logic gates and switches-the building blocks of electronic computers-based on biological molecules. So far, progress is roughly equivalent to the construction of the first transistor in 1947. It didn't look too impressive and couldn't do much on its own, but it was a step towards the construction of today's computers, which consist of chips containing millions of miniscule transistors. Tiny though these transistors are, however, they will not be able to go on getting smaller forever. Molecular computers, if they can be built, could be smaller and faster than their electronic counterparts. Molecular memory would make it possible to cram hitherto undreamt-of quantities of information into tiny volumes: a library in a sugarcube.

The future of computing lies in biology, and that of biology in computing. When this marriage is blessed with offspring in the form of a fully biological computer, will the people who build it be computer scientists or biologists? Soon there may be little difference.