Posts tagged ·

General Thoughts


This consumer’s perspective on the current state of the Semantic Web.

1 comment

I spent the last few days in Boston, attending the 4th, annual, Conference on Semantics in Healthcare And Life Sciences, (CSHALS) 2011. The ideas of the semantic web are really powerful – self describing data, a promise of ease in integration and reusability of data from any place and source in the world (wide web), a promise of a network of facts over which a machine can reason and infer implicit connections… Semantic Web IS the future. In fact, even in my own research I have tapped into the paradigms of the semantic web – I’ve attempted to represent the human genome as a network of components, which would allow me to make logical statements about the genomic context of the elements in a specific individual. (NOTE: genes, a subspecies of the genomic elements, contribute to less than 5% of the human genome! Our genome is destitute of genes, but overabundant in other elements that control when and how much of a specific combination of exons should be transcribed to produce a protein that is most useful to a cell at a point in time.)

Promises, promises…”  as it turned out to be and here is why:

  1. The software infrastructure had been very tricky to assemble to meet even the most basic needs of my project – despite the published specs of the software products
  2. The implementation of features promised in the standards lagged or the software vendors plainly refused to include basic functionality such as x>y comparisons !
  3. On top of everything, the performance issues stemming from the fact that semantic web software is still very immature (unlike the relational database software benefitting from the decades of optimizations…) rendered my product unusable when populated with just a fraction of the required data…
  4. I lost any hope of ever being able to use a reasoner to query my data, abandoned the purely-semantic web and went back to my trusty (and free) MySQL – without the ability to use the reasoner there was really very little incentive to continue my anguish described in three previous points.

The Semantic Web is the future, but unfortunately the future is still quiiiiiite distant, that was my mindset over a year ago… CSHALS 2011 became an awesome opportunity for me to update my minset, and learn about the cutting edge developments in the world of Semantic Web, where is the field heading, and whether my projects could return to the world of Semantic Web…

Indeed, I learned a lot. The conference started with a set of tutorials providing a hands-on guidance for newcomers and veterans of the semantic web. We converted some tabular data into RDF,  mashed-up several resources, and even analyzed some microarray data to find genes involved in Alzheimer’s Disease that were then integrated with Gene Ontology, and published using datapress plugin for wordpress, for a cool visualization, and all that BEFORE lunch!  The tutorials were a lot of fun but at the same they were indicative that the entry into the world of semantic web is still not  a smooth process for just anyone, the learning curve can be quite steep…

The following two days brought keynote speakers, tech demos, presentations, and, overall, some really interesting points that I will try to discuss in the following passages.

The very first (keynote) presentation of CSHALS by Toby Segaran, a professional involved in data management issues for years, turned out to be a depressing confirmation that all along I’ve wanted too much from the semantic web. To summarize his talk, semantic representation of data is great to represent modest amounts of data using the semantic relationships  (as opposed to fitting the data into more rigidly structured relational databases ). The coolest (and most needed in the field of life sciences) aspect of the semantic web: reasoning, and inference, Eric stated, are still somewhat of a “red herring”, just outside of our reach. And he knows this stuff. He is a data magnate dealing with software and web development for real-life data management issues, (now) working for “a search company in California.

Toby’s talk was immediately followed by Chris Baker’s, the man behind SADI, a revolutionary framework for semantic annotation of software services. Thanks to SADI, a computer can potentially suggest how to proceed with your data- suggest a set of tools that will give you the output you are looking for !!! To contradict Toby, Chris demonstrated the power of reasoners and inference. Specifically, Chris was able to correct scientists’ mistakes who had annotated chemical compounds inaccurately. His reasoner analyzed a very complex ontology of classes of chemical compounds, and axioms describing the relationships among these compounds. With some iterative ontology creation and validation behind, Chris showed the ability of a machine to catch human mistakes. Why this disagreement between Toby and Chris?  The problem is the size of the data. To be more relevant and useful, semantic web reasoner needs to be able to deal with HUGE quantities of complex data. Chris’ ontology was relatively tiny – only a few hundred compounds were considered and reasoned over. I would like to see a reasoner that can handle MUCH bigger datasets. I mean if we are integrating/mashing up datasets we need to be able to effectively deal with large quantities of data that become even more connected i.e. larger. A domain with just a few hundred entities is a very rare case in Life Sciences today. Especially now the scientists have all sort of cool tools to generate hundreds of millions of high-throughput genomic reads per experiment… Food for thought: just one human genome has 20,000 protein-coding genes (each locus contains a set of exons which can be transcribed and spliced into different transcripts, a process which is governed by multiple promoters and enhancers also somewhere on the human DNA) some 4,000,000 distinct repetitive elements, and an even greater number of  loci transcribed into non-protein-coding RNAs… Chris’ analysis caught 4 mistakes made by human in the annotation of just a few hundred chemical compounds. I shudder to think how many mistakes have been made in the current annotation of the entire human genome.

Another contradiction that I found interesting was that between Lawrence Hunter, working hands-on with ‘omic data, and Eric Neumann, a pioneer  and evangelist for the Semantic Web, who worked in many interest groups that continue to shape the direction of where the semantic web is heading. Lawrence brought up my earlier point about the increasing volumes of data. Soon there will be more than one human genome, as the sequencing cost is becoming so cheap that we are just this close to be able to sequence any genome for close to a 1000$ (down from the initial >1,000,000$ ). The problem is that our understanding of the human genome available today is VERY incomplete. We don’t even know the functions of all our 20,000 genes (which are less than 5% of one human genome)! We, the scientists simply NEED computers to help us sift through all the available data and be smart about the data that is already pouring into our labs (In the lab where I work we can fill up 20 Terabytes of space in a week with various sequencer data sets, this doubles when one tries to perform standard analyses such as genome mapping, annotations etc). I think the semantic web is crucial in facilitating “smartness” of our data management.

Eric made a point that i though was contrary to that of Lawrence and my own about the data volumes. Eric does not see/is not worried by the influx of data just yet, he thinks that the future (potential and distant) “data tsunami” will be conquered by data filtering, which would cleanse the data sets to only the parts that are manageable, “curated” and “interesting”, presumably discarding the rest.  This makes me think of the good old days (just a decade ago) when the human genome was made of interesting genes, and all the other “junk” that was masked out (i.e. discarded) from analyses by RepeatMasker tool…What if I would like to use the non-interesting parts of the data (i.e. the “junk” ? ) I mean, I have done so (sans the semantic web technologies despite my sincerest attempts), with some interesting insights about the junk!!! Or, a more generic example, what if I, a scientist want to use the semantic web, and reasoners to actually help me with the filtering of data? Reasoners seem to be ideal for catching contradictions and illogicalities!

In all fairness Eric brought up many other issues with the core of semantic web, stating that the in the current state the semantic web is unsustainable due to the inefficiencies of DNS queries to locate EVERY URI of EVERY data unit on the web… I thought Eric’s discussion of the ontology vocabulary being abstract and detached from the tangibility of items was also very interesting. In fact I would like to believe that our GELO ontology could be a solution that attaches reality to a concept in an ontology…

To make my long story shorter, my observations about the talks bring me to the following

Conclusions about the current state of the Semantic Web Technologies

  1. Software infrastructure assembly and maintenance requires a team of dedicated engineers:
    • At the current state of software, infrastructure, standards, etc, in the world of semantic web, only big companies such as Astra Zeneca or Novartis, or  small-companies-which-will-be-bought-by-a-search-company-in-California, are capable of hiring a specialized team of knowledge engineers to deal with just that: engineering knowledge, integrating data, full time. From my own personal experience it is a full time occupation: integration of data, keeping track of software updates, maintaining the semantic web software, tinkering with the semantic web software so it behaves. Unfortunately for someone who needs to do research using an integrated resource of many big data sets while maintaining the unwieldy semantic web aspects of the resource, it’s much more productive NOT to use the semantic web at all.
    • Long term benefits of RDF be damned! I’ll convert my relational database to semantic web only when
      • I find a reliable distributed, open source, preferably free (I am a scientist on a budget) triple store
        • which can load 20 billion triples +
        • AND which can provide a transitive closure to all of them
        • AND maybe provide some custom indexing capabilities for which xSQLs are known
      • OWL 3.1 or even 4.2 emerges and stabilizes enough for software developers to provide a FULL feature set implementations that support my uber triple store described above.
      • finally, when I find some evidence that the semantic web is helpful to me rather than sabotaging my ability to graduate in time, apply for a grant in time, deliver my query result this century (all of which I can accomplish using an old fashioned SQL)
    • This is a highly undesirable state, indicating semantic web deployment as a costly and thus, in case of many small labs, impractical luxury. I know of several labs that just recently moved from a spreadsheet approach to a relational database for data storage and management. They are amazed how much easier certain tasks have become in the lab and how much more flexible and accessible their data has become to them. The tools to create and manage a relational database are aplenty, and easy to install and maintain. On the other side of the spectrum, the semantic web stores are expensive, and/or provided as quagmires of java jar files that only the mightiest of hackers have the patience to assemble, set up, configure, program and potentially use (if everything works)… On the other hand, why should software developers create a usable set of tools, if the standards are still uncertain and they continue to change?
  2. The semantics (e.g. ontologies, and vocabularies) that are needed to describe the data are either non-existent or scattered throughout the world wide web, some very difficult to find.
    • It is easy to find information about a taxonomy of genes involved in a particular compartment of a cell or a chemical process. Paradoxically, trying to semantically represent a simple “round” shape of a lesion (that a doctor had written in a diagnosis of a patients tumor) becomes tricky, as linguists, philosophers and clinicians will all probably need to sit down first and establish what does it mean for a lesion to be a “round” in context of other lesions. Without a common and extensive vocabulary to be used as metadata, the Semantic Web technologies are crippled. The Semantic Web has proven useful as a solution to data integration problems, however if two (or more) independent vocabularies are created by groups unaware of each other, the same dataset will then potentially become represented in several different ways. How are 3 different semantic representations of the same dataset really different from having 3 different relational models representing the same ideas? Currently, a semantic middleware is created to integrate independent relational databases. Unless accessible and well advertised global metadata thesaurus is created, a conceptual middleware will have to be created to bridge different semantic interpretations of the same concepts.
  3. I saved the best for the last. There are some really cool cases for the use of reasoners out there.
    • The successful uses of reasoners seem to gravitate towards problems that are much easier to contain than the heavy-volume genomic data sets over which I’ve longed to reason.
    • SADI, which I mentioned earlier, can inferring a series of well-defined step required to produce an output from a given input data set. Any software developer can tap into the API and register their service for anyone to use!
    • WINGS on the other hand is a semantic workflow management that can provide a completely reproducible “methods” section of your publication (finally). What doesn’t WINGS do. The framework can optimize your workflow, split it into batches with granularity and precision dependent on how much time is allowed for an analysis for example! And it plays nicely with GenePattern!

The current drive of the development of semantic web technologies seems to be geared towards lowering web development costs, and deploying flashy, gimmicky web applications. There are things on the horizon (BigCouch, Triple Map, Knowledge Explorer, BEL and Genestruct framework, Gruff) that may make semantic web slightly less of a luxurious nuisance and more of an actual aid in the day-to-day tasks of a sample scientist. I’m not too optimistic about the scalability/affordability of these products that would make them usable for my own tasks within a year, however the next decade should definitely render Semantic Web a necessity ;-).

Looking forward to CSHALS 2012!!!

There, I’ve said it !





no comments

It’s been a very busy last couple of days; 2010 is, clearly, picking up… Finally, however, this morning I submitted some new compositions to the copyright office, posted some new music online and that’s even before i had time to think that i don’t have time to have breakfast 😉

More: Read the rest of this entry…

It’s already been almost a whole week in 2010 …

no comments

It’s only been a couple of months since I published my first CD and I’ve already got over 300 fans on facebook and thousands of people listened to my music according to youtube !!! By the end of 2010 i modestly hope to paste another 0 at the end of the current numbers 🙂

More: Read the rest of this entry…

shpakOO on his arrival on iTunes

no comments

Yesterday evening, I typed shpakOO into a box in itunes search, quite accidentally, since i thought i’m typing my log in information somewhere else, and … my heart jumped a little, “It’s time …” by yours truly, appeared in iTunes! I mean, it was bound to happen eventually :-P, according to CDBaby within a few months… but its only been a few days 🙂
More: Read the rest of this entry…

New look, new name

no comments

As much as I love my web host, 1and1, their selection of wordpress themes was somewhat depressing… In their defense, however, i don’t think “themes” are really something they should worry about in the first place…

More: Read the rest of this entry…