Linking data initiatives aim at publishing more and more RDF data sets on the Web and setting RDF links between data items of different sources. Such initiatives allow sophisticated search and query capabilities over crawled data. Nevertheless, the Web remains concentrated on the interchange of tagged documents which are mostly in HTML form.
Semantic annotation consists in assigning to a document or to its parts a metadata whose semantics is defined in an ontology. When these semantic annotations are available, reasoning and ontology-based querying allow better interpretation of document content. Nevertheless, the number of documents may be huge and manual annotation is time-consuming. So the automation of annotation techniques is a key factor for the future web and its scaling-up.
The aim of SHIRI project is to develop an unsupervised and ontolexical-based approach for annotating and querying tagged documents. These documents are related to a domain of interest and their structure is heterogeneous. The proposed approach first extracts term and Named entities and associate them to ontology concepts. Then, a coarse-grained semantic annotation of tagged elements of documents is performed. The obtained annotations can then be queried using user (reformulated) ontology-based queries. SHIRI uses W3C standard languages RDF(S) for resource representation and SPARQL for their querying.