Semantic Search – Fact and Fiction
The IKS project will organize in two weeks (12nd and 13th of November) its second workshop focused on the semantic search topic. IKS is a EU funded initiative which aims to semantically enhance next generation of European CMS/KMS.
The IKS team asked to me a few weeks ago to moderate the Demonstration Session which will occur during the second day of the event. Even if I am far from being a “semantic expert” and to pretend to have any legitimacy to play such a role in regards of the quality and in-depth knowledge of the speakers, I accepted it with enthusiasm: the semantic world evolved in a parallel track from the one followed by most “traditional CMS” for the last ten years. I then really think it now makes more and more sense to reconcile both worlds. Bringing CMS and Semantic folks in the same room is already a good start to help foster communication and collaboration.
The Future of the Information Access Industry
As far as I am concerned the “Semantic Search” topic is a bit too narrow in regards of the importance of the subject. Perhaps the use of a more general “Information Access Technologies” term similar to the one used by the Gartner but only applied to the next generation of semantically driven tools would be more adequate.
The Tectonic of the Web Evolution Plates
At first glance the semantic world looks like being quite simple: After the Web 2.0 which was all about harnessing collective intelligence (social media, social networks,…), the next step in the Web evolution should be the Web 3.0 and the apogee of a semantic world of things.
However even if this vision is already 10+ years old it will still probably require another 10 years in order to let the dream become true. So some “gurus” are already suggesting an intermediate minor “web release”: the “Web Squared” to help transition to this new intelligent internet era.
Meanwhile we are also hearing more and more discussions about how to best apply the results of the Web 2.0 era into the search industry. New trends such as social search helps determine the relevance of search results by considering the interactions or contributions of users and real-time search offers new capabilities in order to find information as it is produced.
Finally users behaviours are also changing. Classical "search input forms" and other "advanced search forms" are now challenged by new mobile devices, mashups and other types of social networks. Search behaviours will then also need to be adjusted accordingly and to find new ways to ease ubiquitous and instant access to information.
(from: http://blogs.zdnet.com/Hinchcliffe/?p=1007)
Various underlying semantic search technologies
In order to complicate things a bit further there is not only one single technology to improve the next generation of semantic search platforms but many.
Improved querying methods on structured data (e.g SPARQL) sits beside other new probabilistic model (e.g LSA / PLSA ). Human driven classification and taxonomies should be taken in account by automatic meaning-based categorization. Traditional keywords based searches needs to leverage social ratings, sentiment analysis, folksonomies and other crowdsourcing algorithms. This does not include other natural language processing and other kind of technological improvements.
Some existing User Stories
All these technologies are basically trying to solve certain user needs. The IKS community is trying to list all of them in a collection of User Stories which could be found on the IKS wiki site: http://wiki.iks-project.eu/index.php/User-stories
This is a collaborative project so please feel free to add new ones if you think that some topics are missing.
Eleven semantic search demos
We will have the opportunity to assist in two weeks to eleven demonstrations from various established or emerging projects evolving from proprietary to open source solutions. They are quite representatives of all the different underlying semantic trends which result in different manners of embracing and improving access to information.
Each project presented itself to the IKS community over the last months. You can find all the details on the archive of the IKS public mailing list (http://lists.iks-project.eu/pipermail/iks-community/). For convenience reason I tried to make a short summary below.
Semantic Search Demonstrations
- Trialox (Reto Bachmann-Gmür – New Generation of Semantic RDF-driven WCM)
“I'm working with trialox (http://trialox.org/) a startup founded last year at the University of Zurich. We're working on open source software that makes it easy to develop semantic web enabled applications. Our system is based on OSGi technologies and support various RDF stores as backend.”
- Yahoo! Research (Peter Mika - Researcher and data architect at Yahoo!)
“I work as a researcher and data architect at Yahoo!, based in Barcelona, Spain. Our research lab is part of Yahoo! Research (http://research.yahoo.com). I'm working as a data architect on KR questions related to how we consume and use metadata inside Yahoo. As an example, many of you might have heard of SearchMonkey, which allows site owners and developers to create applications that change the way search results are presented, by using metadata associated with those pages
- Salsadev (Stéphane Gamard - Founder and CTO of salsaDev)
“salsaDev uses a technology emerged from language acquisition research at the Rensselaer Polytechnic Institute to index textual information at a conceptual level. Our approach to information access is not a replacement solution, but a high-value added feature: knowledge workers are provided with a sense-centric/meaning-aware access to their relevant content.”
“As part of the Scribo project (http://www.scribo.ws/), we are working on integrating semantic knowledge extractors to semi-automatically enrich the knowledge base with named entities and semantic relationship found in unstructured text content using UIMA components. We plan to integrate a CRFs-based Named Entities extractor trained on multilingual corpora such as wikipedia. CRFs are a machine learning algorithms to perform Natural Language Processing of token sequences.”
- Deri (Dr Axel Polleres - Digital Enterprise Research Institute, National University of Ireland,Galway)
“...This (http://srvgal65.deri.ie/files/iks_search_engine_cloud.pdf) is the architecture that DERI would like to suggest for the IKS Semantic Search Engine. The figure contains a set of CMS sites complying to the best practises of RDF data publishing, which include RDFa, a local schema export (site vocabulary), a SPARQL endpoint. We have worked on a set of modules for Drupal detailed in a technical report at [2], but their features could be generalized to other CMSs.”
- Kiwi (Rolf Sint - Researcher and developer at Salzburg Research)
“The KiWi-System aims to break system boundaries in that it serves as a platform for implementing and integrating many different kinds of social software services.[…] In KiWi the navigation and search of content is a key issue and is realized in several ways. One possibility to navigate within KiWi is a very flexible facetted search, which allows a dynamic configuration of the search facets.”
- Zemanta (Tomaž Šolc - Head of research at Zemanta)
“Zemanta's content suggestion system is the main product of our company - it takes a fragment of plain text as its input and provides images and articles related to the topic of the text as well as relevant tags and automatic explanatory in-text links.[…] From the perspective of semantic search, Zemanta is an interesting example of automatic semantic query construction by extracting key concepts from a longer piece of text. Since to some degree we use external third-party search APIs we also had to address the problem of how to construct traditional keyword queries from semantically annotated text.”
- Trezorix (Sander van der Meulen)
"Our main software product is the RNA Toolset, a semantic web based innovative tools for working with content, metadata and reference structures. The goal of the RNA Toolset is to create an open environment for knowledge workers to create and edit their content, and to enable the knowledge workers to publish the content to a semantically rich search environment.
The roadmap for the development of the RNA Toolset points to implementing a federated Sesame/OWLim RDF layer with RDFS and OWL support as the search platform. Currently we only have RDF configurations in our test environments. In our production environments we've successfully implemented Solr as the search platform, providing superb free text and facet searching. But the lack of relational constructs and inferencing capabilities in Solr force us to move to the richer RDF environment for more complex knowledge systems."
- Sourcesense (Tommaso Teofili - Software engineer)
“I started studying, using and then contributing to Apache UIMA [3] for my graduation thesis (since November 2008), then on August 2009 I gained the committership. At the moment the project is on his way towards the 2.3.0 release and possibly become an Apache TLP. During this period I realized some prototypes of applications using UIMA for semantic search, one of which I am going to show during the workshop.”
- Semantic Technology LAB (Alfio Massimiliano Gliozzo - Researcher)
"My main research topic is hybridizing Information Retrieval, Natural Language Processing and Machine learning approaches with knowledge management tools at scale. One of the applications I'm interested in is Knowledge Retrieval, which is about retrieving structured knowledge relevant for natural language queries. This task can be performed on large RDF/OWL knowledge base. I will present an application of knowledge retrieval. This is a semantic search engine working on an RDF/OWL ontology."
- Semantic MediaWiki (Tran Duc Thanh, Markus Kroetsch - Karlsruher Instiut für Technologie)
"At AIFB (Karlsruhe Institute of Technology) I work on storage, query processing, query interface and ranking on integrated collections of structured (RDF) data and text (DB & IR). I will demonstrate the search solutions we have developed. One is a semantic search extension to SMW (http://semanticweb.org/wiki/Special:ATWSpecialSearch) that computes completions and translations of keywords. This results in expressive structured queries that can used to retrieve precise answers from semantic wiki. The other called the Information Workbench (http://iwb.fluidops.com/) supports the lifecycle of “interacting with data”, i.e. from data integration, to semantic search, data manipulation, presentation, visualization up to data publishing”.
Synthesis
It makes no doubt that the future of information access will be driven by innovation. As usual the future will certainly not be white or black but will consist of a mixture of Web 1.0 (keywords), Web 2.0 (social), Web Squared (Environment / Information shadows / Real-time) and Web 3.0 (semantic) driven search technologies.
There will be lots of opportunity for specialized niche players to best leverage such or such a technology in order to improve the resolution of certain custom use cases (best of breed approach) versus a all-in-one universal search platform.
For the CMS industry and its customers this will then be a challenging period which will require quite a lot of change management. For the moment the “information access” industry is characterized by its lack of focus on standardization and rather a focus on some vendor lock-in strategy. I already wrote a blog post about it a few months ago. It won’t be possible for a CMS to integrate the plethora of traditional keyword driven search engines combined with the new semantic-driven search platforms. There will be a need to have better defined standards to help CMS access to some abstracted interfaces which will allow them to more easily (un)plug various open source or proprietary search engines according to each of their customers needs and without having to hardcode everything and stick to one single search engine.
Of course there are already different attempts to abstract and unify access to various search engines out there (e.g: the JSR283 Abstract Query Model, the most recent CMIS API, not counting probable future evolution of OpenSearch, WebDAV Search, OASIS Search WS and probably a few others that I forgot to mention).
Let’s see now how all these information access technologies, their evolutions or revolutions and the existing or new standards will now adapt to the semantic world and all these pieces will fit together in order to help next generation of CMS propose better solutions to their millions of end users.
Disclaimer: I maintain some personal relationship with salsaDev as a board member.

http://www.scribd.com/doc/22361134/Semantic-Search-Demo-Booklet