The Pattern Hunters
20 September 2006
Discovering patterns is a quintessential part of what we call 'intelligence'
In modern society we have turned this instinct into an industry, but now it is computers – not people – looking for patterns because computers are less easily fooled by randomness and are more capable of sifting through gigabytes of information looking for faint – but significant – statistical signals. The discovery of subtle relations in data can lead to endless applications in science, business and technology. For example, it is one of these algorithms that prepares your recommendations for further reading after each visit to Amazon.com; patterns in customers’ behaviour are analysed by marketing experts trying to discover a new niche or new business opportunity; election strategists follow much the same approach. Unique patterns in voices, irises and fingerprints, are exploited by biometrics professionals to develop the next generation of identification systems – the one that will replace signatures and PIN codes.
Violations to patterns are as useful as patterns themselves
But violations to patterns are as useful as patterns themselves, flagging anomalous behaviour as suspicious to your credit card company – the entire credit card system relies on pattern recognition software for fraud detection. It would also be much harder to make sense of the eight billion web pages currently available without specialised software capable of determining relevance and similarity of those pages, based on statistical patterns in their text and link structure.
In a highly controversial development, governments have turned to pattern analysis to help monitor telecommunications, and to predict the risk level posed by individuals. In the past, attempts were made to predict our behaviour by analysing bumps on our head. Later, the measurements of bodily proportions were used. Today it could be the patterns of our everyday transactions – all of which are, of course, recorded – that predict the level of risk we pose. One day, such patterns could be sought in our genes.
So it seems the entire paradigm of scientific research is undergoing a revolution from being hypothesis-driven to being data-driven. Now it is not uncommon to first gather massive amounts of data – more than any scientist could possibly look through – and then use computers to sift through it in search of interesting relations. This is how genomic projects are done, as well as surveys of the universe and some experiments in physics. Data-driven approaches are also becoming the norm in industrial applications where massive amounts of data are systematically gathered (and traded) for later analysis by computers.
The web is an awesome repository of information
But despite holding such a strategic position in the modern information-technology society, for a long time pattern analysis remained more of an art than a science. Only recently has a unified theoretical framework started emerging, based on ideas from statistics, artificial intelligence and theoretical computer science. This has led to a new generation of pattern analysis algorithms, one based on mathematical principles rather than loose analogies with biological learning systems. Gone are the neural networks and evolutionary algorithms of the 1980s (as data analysis tools, of course, not as models to understand biology) and in are the new, statistics-based methods. They can already be found in spam filters, medical diagnosis systems, machine vision devices and a hundred other applications. A leap in performance has accompanied this transition.
The new challenges and awesome potential of pattern analysis are exemplified by the two most important data analysis tasks of this century: web content and genomic datasets. These two fields also represent the main thrust of the new pattern analysis group in Bristol. The web is an awesome repository of information. Buried in it are business leads that companies can exploit, strategic intelligence about competitors, sociological information about public opinion and attitudes to products or policies, and much more. The problem is how to extract this information. Similarly, modern biology produces enormous quantities of data which are freely available over the internet. Hidden in those datasets are the answers to age-old questions of science (and philosophy), including information about the origin and evolution of modern life forms, as well as answers to pressing medical questions such as how we age. Yet the information is not readily accessible and various strategies are being devised to extract it. Relations, similarities and anomalies across these various datasets are all of interest to the biologists. A new generation of scientists, fluent in both the language of mathematics and biology is needed. For this reason the Engineering Mathematics Department has recently introduced a new course in Computational Genomic Algorithms.
At the beginning of the 21st century the opportunities offered by this new technology are difficult to comprehend. Advances in this field could translate into benefits for science and society, as yet undreamt of.