As promised, I would like to do a series on Data Mining and Analysis based on my experience working with large sets of data for more than a decade now. It is based on a series of informal talks I gave to a group of data enthusiasts at a local UNIX user group. Although I am writing this for the general audience, I won’t gloss over mathematical concepts or dumb them down. I will try my level best to explain technical concepts using analogies and examples that are easy to understand. And for the technically oriented, I would link original research articles and studies for more details. Ok, let us start.
My Journey
My first brush with large datasets happened in biological sciences. I had accumulated a wealth of experience in mathematical simulation and optimization prior to this, but it was very clear to me that the challenge in hand was going to be very different. Biology had just about started to become a data rich discipline, thanks to the Human Genome Project. Whole genome chips became accessible easily and there was a race going on between developed nations in exploiting these rapid developments. I found myself smack in the middle of one such ambitious project in the EU. Exciting times, really. Great leaps in experimental advances oftentimes result in an overhaul and evolution of new mathematical constructs to make sense of large volumes of data. Suddenly, there was an exchange of ideas coming from image processing, pattern recognition, machine learning and other areas that resulted in the development of novel algorithms and toolkits to study data and drive interpretation. New disciplines were born that took advantage of such a crosstalk between various areas and the status quo has remained eversince, which is a good thing. Whole new computing strategies were developed and people started working on running their codes in a parallel environment. Up until then, supercomputers at our facility were primarily used to study crash simulations for automobile clients, for which they had to pay some pretty hefty premiums. I was a complete novice with parallel computing then, so the prospect of learning to develop for a supercomputing environment was electrifying!
So here I was, faced with the possibility of analyzing time series gene transcription data from Affymetrix whole genome chips comprising a little more than 46000 transcripts. Affymetrix station had a suite of tools to perform a simple data analysis. Anyone could do that. Other commercial software was also available at that time to perform data analysis (Genedata, for instance). But they would charge your soul for a freaking license. This was ruled out. R’s Bioconductor had just gotten some traction and today it has grown so big and provides a large set of programs and tools to perform sophisticated analysis of a range of biological data. Back then, it was in it’s nascent stages and you could still do a lot of statistical analysis, but we just wanted to do more. So you see, my responsibilty was to develop a set of methods and tools to not just statistically analyze the data, but also capture interactions that may provide for the design of molecular level experiments that would eventually pave the way for detailed dynamic modeling. Dynamics was our USP and that is ultimately where we wanted to go.