Semantic Web

From Citizendium, the Citizens' Compendium
Jump to: navigation, search
This article is a stub and thus not approved.
Main Article
Talk
Related Articles  [?]
Bibliography  [?]
External Links  [?]
Citable Version  [?]
 
This editable Main Article is under development and not meant to be cited; by editing it you can help to improve it towards a future approved, citable version. These unapproved articles are subject to a disclaimer.

Overview

The Semantic Web is a concept, first named by Tim Berners-Lee, for a "web of knowledge" in which data on the world wide web, whether in structured data stores or loosely-structured documents, would be annotated and classified so that machines can access and infer relationships based on the semantic information - that is, what the content means - rather than simply on the matching of text strings.[1] There is also a W3C standards effort[2] related to this concept.

In order to associate meaning with content, Semantic Web utilizes structures for identification, categorization and linking data. While a web page about soccer might specify how pictures and text should be arranged, what colors and font to use, and other presentation data, a similar Semantic Web document would convey the fact that the data pertained to the sport of soccer, perhaps a list of teams, scores of recent matches, and other data in categorization containers. This presentation allows other consumers (mainly programs) of the data to parse and utilize the data in meaningful ways. As opposed to modern web crawlers which must catalogue, index, and apply a certain amount of artificial intelligence to derive the meaning of documents on the web, semantic web allows data to be parsed easily for meaning - ultimately resulting in greater ability to share and discover information.

One interesting challenge that faces semantic web is the ability to not only transmit data, but also to associate metadata. Metadata is descriptive information that conveys relationships between data types. In order to provide a flexible framework that is capable of transmitting multiple different types of data, as well as the meaning and relationships of that data, semantic web has integrated metadata into the format. This allows dynamic and unpredictable data formats and types to be transmitted and consumed by facilitating consumers' ability to process data by utilizing the embedded metadata to parse and understand data and inter-relationships.[3]

What differentiates the Semantic Web from existing data exchange formats is the use of URIs to uniquely identify things, and relationships between things. The sort of problem scenario that Semantic Web technologies try to solve are those involving multiple disparate source of data - for instance, hooking together train timetables and class timetables, so a student can automatically plan their travel itinerary without having to manually match the data together.

The W3C have put forward a variety of standards built on top of the Resource Description Framework (RDF), a formal semantic model for representing things and the relationships between them.

Competing Visions

The "Semantic Web" concept has evolved under competing understandings and visions. Historically, efforts in Artificial Intelligence, notably Cyc and the Knowledge Interchange Format (KIF), sought to provide a technological backbone for a grand vision of a universal knowledge store enabling intelligent agents to apply human reasoning. Apple's Knowledge Navigator represented a vision of networked hypertext with intelligent agent mediators.[4] Very similarly, the Semantic Web was conceived with these goals in mind, but by embedding and extending existing technologies in the WWW stack while giving a formal and standardized structure to the relationships and means of data exchange of arbitrary data exposed on the web.[5]

These quite distinct provenances have confused a common understanding of the scope and goals of a "Semantic Web." On the one hand, the Semantic Web aims to create a machine-readable web through the coordinated linking of data and knowledge on a massive scale, such that intelligent agents could be devised to provide precise answers to and analysis of queries of arbitrary depth and nuance. On the other hand, it also seeks to improve human interaction and traditional web querying and information retrieval by giving a more formal structure to the web. It aims to do so by establishing connections in an incremental fashion among individual pieces of data both embedded in documents and realized as micro-transactions of activity that are conventionally stored in relational databases. In this way success is defined as the improvement of social welfare through a superior user experience of day-to-day web activities.[6]

Under the latter perspective, Semantic Web was developed to meet a specific deficiency in web based communications and is often referred to as Web 3.0[7]. Although well defined in RFC's, HTML is architected to perform exchange of information that is delimited and optimized for presentation. That is, the use of HTML is designed to communicate the appearance of documents within web browsers. This is useful when attempting to create a document that will render in the same form across multiple platforms (or web browsers) but is problematic for transmitting meaning of data. There are a few HTML specifications (notably META tags and other document head elements[8]) that convey meaning, but these are precious few.

In this way the Semantic Web is closely tied to microformats which are an alternative way to embed meaning into HTML documents. Microformats use standard HTML tags along with generally agreed upon conventions for attributes, in order to delineate certain data within documents. For instance, microformats can be used to embed contact data or calendar data in web pages for easy integration with other programs. This can allow users of popular calendaring or contact management software to simply click on elements within web pages and import calendar events, or contacts, directly into their calendaring or address book software.[9]

There is a final perspective that focuses less on the machine-readable component of Semantic Web (linking data in terms of relationships) than the universal metadata cataloging and tagging of existing documents and data for human consumption. This perspective has received less attention, especially as advanced indexing and search tools - both with Google on the web as a whole and in individual curated collections - have largely addressed these needs.

Recent efforts have focused on developing a network of data through shared standards and ontologies, and have largely been developed within loosely-connected "islands" of domains that have been individually-designed with an understanding of the problems to be solved.[10] These types of systems have also been referred to as Cooperative Information Systems.[6] By adopting domain-invariant standards, however, this is perhaps done with the tacit understanding of contributing to a more universal "web of knowledge".

Linked Data

"Linked Data" is a term coined by Tim Berners-Lee to describe the way in which a "Giant Global Graph" of semantic data serialized in triplestore serves as the core of Semantic Web. "Linked Data" connotes the shared idea of both exposing data in a standard format (RDF) as well as establishing individual links between these data. As a separate term it narrows the focus of Semantic Web from an abstraction to actual data linked between arbitrary things, which are identified by URIs and described by RDF.

Because data are identified, modelled, described and linked in a formal standard, linked data by itself permits browsing, searching and combining different sources and domains of data. Machine crawlers and indexers can be applied to the graph data in a tractable way and applications can solve sophisticated problems by utilizing the data and its relationships. Humans can interface the data by browsing and structured querying via SPARQL and related interfaces (ex. facets). All this can, and perhaps must, evolve before the Semantic Web enables more advanced agent intelligence.

Semantic Web Technologies

The stack of technologies comprising the Semantic Web infrastructure is largely standard and mature. HTTP URIs identify concepts and objects, RDF (Resource Description Framework) describes a data model, RDF Schema defines vocabularies on RDF elements, OWL expresses ontologies, and the SPARQL permits operations on the resultant graph data.

Triplestore

Triplestore is the data convention utilized by Semantic Web and RDF to relate objects and meaning. Triplestore is a rather simple linguistic convention that makes it easy to classify data and make connections. Triplestore takes the form "Subject" - "Predicate" - "Object". For example:

Garden location Backyard
Firstrow location Garden
Firstrow plantedWith Beets
Firstrow plantedWith Carrots

Using this standard convention it is easy to catalogue data and to trace relationships between them. For instance, using the above example I can figure out what is planted in the first row of the garden in the backyard by tracing the relationships:

?Garden location Backyard -> finds the Garden I'm looking for
?Firstrow location Garden -> finds the row in the Garden just retrieved
Firstrow plantedwith ?Veggie -> gets the vegetable planted in the first row

This rather simple model makes it possible to define (and query) complex relationships without first having a defined data model. This convention gives semantic web the adaptability to handle evolving dynamic data without constraining that data. This also means that the model doesn't have to be redefined to deal with emerging data types.

Triplestores can be used to create complex graphs of data. When expressing these data using RDF/XML they are typically rendered as N-Triples, which are expressed in plain text and used for transmitting this data across the network. N-triples do contain redundancy, however, so when moving N-triples across the wire it is common to utilize the RDF N3 notation, which compresses the data by removing duplication.

RDFa

Although using RDF is compact, it is not easily human readable. RDFa is a response to the disparity of data presentations between XHTML and RDF. RDFa allows RDF data to be embedded in XHTML content. Using standard XHTML tags like the <span> tag semantic web data can be mixed into XHTML presentation. For example:

<span 
    xmlns:example="http://example.tld/example/0.a" about="#XHTML" 
    typeof="example:Technology" property="example:name" 
    rel="example:supports" resource="#RDFa">XHTML supoprts RDFa
</span>

This example demonstrates the basics of RDFa representation of triplestores. The "xmlns" attribute defines the namespace definition (for the subjects, predicates, and objects) used for the triple. This namespace URI is shortened to a Compact URI (or CURI) "example:" that is used in the property attribute. CURI's allow developers to refer to arbitrarily long URI's without having to type out the entire URI on each reference.

The subject in the above example is XHTML, which is a type of "Technology" defined in our namespace. Technologies have a "name" property, in this case the name of the subject is "XHTML". We're forming the triple "XHTML supports RDFa" so the predicate is "supports" and is defined by the "rel" attribute. "Supports" is also prefixed by a CURI, to indicate that it is a defined predicate in our namespace. Finally the object is specified in the "resource" attribute. This attribute does not use the CURI, and assumes that the RDFa object was defined elsewhere in the data (as designated by the hash symbol, which is also used for relative linking within HTML pages)

Turtle

Turtle, or Terse RDF Triple Language, is another w3c standard notation for expressing triples in semantic web technologies.[11]

RDF Schema

The RDF data model makes no assumptions on the vocabularies used to describe object properties. RDF Schema (RDFS) allows for the definition of these vocabularies. The W3C specification defines a vocabulary description language for RDF for this purpose and contributes some high-level RDF vocabularies that may be shared across domains.[12]

In RDFS, Classes and Properties distinguish between types of objects from specific, individual objects. Classes form hierarchies of types where objects belonging to that class are referred to as instances. This is specified using the rdf:type predicate. The use of classes in RDFS imposes restrictions on what kinds of statements can be made about objects.

For example, say that there is some RDF records describing employees, and every employee has a year of birth. The "when-born" property of might expect a person as the subject and a date as the object. In this example, an employee is a type of person, a date is the range (rdfs:range) of the property and person is the domain (rdfs:domain). Properties may also form hierarchies - for example "year-born" may be a subproperty of "when-born". The RDF/XML for this example could be:

<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">

<rdfs:Class rdf:about="person" />

<rdfs:Class rdf:about="employee">
  <rdfs:subClassOf rdf:resource="person">
</rdfs:Class>

<rdf:Property rdf:ID="year-born">
  <rdfs:subPropertyOf rdf:resource="#when-born"/>
  <!-- An example datatype from XML Schema: -->
  <rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#nonNegativeInteger"/>
</rdfs:Property>

<rdf:Property rdf:ID="when-born">
  <rdfs:domain rdf:resource="person" />  
  <!-- For simplicity, dates are XML literals in this example: -->
  <rdfs:range rdf:resource="&rdf;XMLLiteral"/>
</rdf:Property>

</rdf:RDF>

As shown in this example, datatypes, such as integers, may be expressed in RDF Schema, although the W3C specification for RDFS does not define any.

OWL

OWL refers to Web Ontology Language. It allows users to specify a set of rules which apply to RDF triples in a richer way than RDF Schema so that formal ontologies may be defined.

History

In 2004, the first version, OWL, was standardized by the W3 Web Ontology Working Group as part of their Semantic Web Activity.

In 2009, a second version, OWL2, was standardized by the W3 OWL Working Group, similarly as part of the Semantic Web Activity. One of the main changes was the introduction of profiles, intended to improve scalability.

Ontology

Ontology is a model for describing sets of types, properties, and relationship types. In computer science it is a language to formally describe the meaning of terminology in web documents.

For OWL purposes, the user defines a set of axioms which place constraints on individuals (belonging to classes) and the types of relationships permitted amongst them.

OWL

Sublanguages

OWL Lite - This version is easier for frame based tools to transition to and easier for reasoning.

OWL DL - This sublanguage fits the Description Logic and decidable reasoning.

OWL Full - This language is an extension of RDF. It allows for classes as instances and modification of RDF and OWL vocabularies.

Syntax

OWL support two types of syntax - Abstract and RDF/XML.

Owl 2

Profiles

OWL 2 EL - This profile is useful for applications which contain large numbers of properties and classes. This will only support logic with existential quantification.

OWL 2 QL - This is used for applications which use large volumes of instance data. Here, query answering is the most important reasoning task. Queries are rewritten into a standard relational Query Language (QL).

OWL 2 RL - This is best used for applications requiring scalable reasoning. Reasoning is done using a standard Rule Language (RL).

Syntax

OWL2 supports 5 types of syntax:

•Functional-Style

•RDF/XML

•Turtle

•Manchester

•OWL/XML

Semantics

OWL2 supports two types of semantics:

•OWL 2 Direct Semantics

•OWL 2 RDF-Based Semantics

SPARQL

SPARQL stands for SPARQL Protocol and RDF Query Language[13]. SPARQL is a syntactically SQL-like language for querying RDF graphs through pattern matching. SPARQL supports triple patterns, conjunctions, disjunctions, and optional patterns. The query results may be either sets or RDF graphs. Though similar to SQL, SPARQL does not support the commands INSERT, UPDATE, nor DELETE.

Most SPARQL queries contain a set of triple patterns called basic graph pattern. Triple patterns are like RDF triples, except subject, predicate, object may be variables. A basic graph pattern will match a subgraph of the RDF data provided.

Programming with Semantic Web

Because RDF is an open format, libraries exist for almost every programming language to make it easy for programmers to produce and consume RDF data. Some examples include the RDF.rb[14] library for Ruby and a PEAR[15] RDF package for PHP. ARC is another PHP supporting project that enables RDF and SPARQL support for Linux/Apache/PHP/MySQL (LAMP) stack servers. Java has a number of notable packages that support Semantic Web development including , JRDF[16] and Sesame.

Domain-specific semantic models

Medicine

Semantic models seem the major trend in expert support to medicine. As an example of how semantic methodologies are used, consider several isolated concepts, which could be considered "nouns":

One of the notations for relationships is the Unified Medical Language System® (UMLS®). Informally, some of the "verb" semantic relationships among the above could be:

  • beta-adrenergic antagonists TREAT hypertension and benign hand tremor
  • beta-adrenergic antagonists CAUSE bradycardia
  • beta-adrenergic antagonists TRIGGER asthma

"Hypertension" would have a number of other TREATS relations, from drug classes such as thiazide diuretics, angiotensin-II converting enzyme antagonists, calcium channel blockers, angiotensin-II receptor blockers, etc.

ULMS is now being extended with formal ontologies: [17]

Semantic Web in CMS

Content management systems (CMS) can benefit greatly from RDF features. RDF is an expressive means by which CMS can both publish and consume data. Because RDF makes data more easily machine readable it is perfect for systems that integrate data (such as CMS).

Drupal

The Drupal content management system is making a big push to include RDF and semantic web as part of the upcoming Drupal 7 release.[18] There is a Drupal group devoted to semantic web as well as a code sprint devoted to the topic. Drupal 7 will automatically include RDFa elements in page presentation. The will mean that new Drupal 7 sites will automatically include RDFa data without any additional overhead, coding, or administration necessary from site administrators. This powerful new feature will allow site users to leverage RDFa seamlessly. With over significant and growing market share of CMS, Drupal's support of semantic web will mean a vast increase in implementation of RDF.[19] Prior to the release of Drupal 7, Drupal 6 can support RDFa with the use of the RDF module, although this module is currently only in a pre-release phase. Closely related is the SPARQL module, which supports SPARQL queries within Drupal, and which is also in a pre-release status.

Wordpress

Wordpress has several third party plugins that implement RDF.[20]

MediaWiki

Mediawiki has Semantic MediaWiki to integrate the Semantic Web in a wiki setting.

Other Notable Uses

The BBC made heavy use of semantic web technologies for their internet coverage of the 2010 World Cup games.[21]

Facebook recently announced support for open graph protocol which is an RDF implementation of semantic web. Open Graph Protocol has supporting libraries for Java, PHP, Perl, Python, Ruby and JavaScript. Although developed by Facebook, Open Graph Protocol defines namespaces, including People, Places, Entertainment (movies and music), Groups, and other categories, that can be easily used in a wide range of presentations.

Google has announced support for "Rich Snippets" which appear as summary data in search results (for things like customer reviews, map location, etc.) utilizing RDFa. [22]

DBpedia is a project designed to extract structured data from Wikipedia and expose it as linked data. It is notable for being one of the initial large-scale datastores exposed on the Semantic Web and defining generalized ontologies of data from a variety of domains.

Freebase is a general-purpose crowd-sourced data store that exposes RDF views on some of their data. Freebase Gridworks is an open-source application that utilizes this to aid in cleaning data and linking local, static data to data exposed as RDF on Freebase. It is also a publishing tool for local data; by defining schemas on their data using Freebase URIs users can upload linked data directly to Freebase.

MIT Simile project produces a number of tools that support Semantic Web.

Issues and Criticism

For the Semantic Web to reach its potential, it must overcome a number of technical and social hurdles:

Killer app: Whereas the world wide web was popularized by the web browser, no "killer app" has emerged that allows everyone to leverage the power of the semantic web easily. Such a catalyst may be necessary for adoptions.

Consistency: Links between data must be consistent - that is, they must convey information that doesn't conflict and using the same naming standards. It requires repetition of the same information on the part of many parties. Databases like DBPedia that are sourced by a multitude of public contributors are certain to contain inconsistent information, although widely varying by domain. Further, these sources are less likely to have the requisite formal structure to fully expose them to the broader web.[23]

Completeness: One-way assertions make inferring relationships more difficult and error-prone and two-way browsing impossible.[5] Completeness also requires multiple copies of the same data, rendering inefficient storage of the data.

Privacy: Exposing personal data on an open semantic web, possibly without one's knowledge or consent, may reveal more information than the originator wished to share by doing so. The advent of popular applications and platforms adopting semantic web technologies compounds this already growing concern.

Quality: High standards of data quality must be maintained for the information conveyed therein to be useful and robust in the face of ambiguity and spoofing. It is not a necessary consequence of Semantic Web technologies that it would be easy for semantic web users to be able to discern the quality of the data being used, either explicitly or implicitly.

Precision: Because the Semantic Web purposefully aims to capture the broadest aspects of human knowledge, it's rendered difficult to establish meaning from human-oriented abstractions.[10] For instance, presenting an opinion of someone on the semantic web as unvarnished data ("X is a good person"; "Y is a funny movie") one runs into the problem of interpreting the meaning of those statements - what makes someone good or something funny? Much of the embedded semantic knowledge on the web is not apolitical, and would be difficult to automate meaning from. It does enable, however, a more limited use of the data if the query seeks to extract objective information from the data - the percentage of reviewers who thought movie Y was funny, for example. For more universal uses however the widespread adoption of precise and well-defined ontologies are required.

Domain-Transferability: The same concept in two domains can represent different information that a machine would not differentiate.[10]For example, the concept of "cost" could represent either a budgetary amount or a moral abstraction, as in "the cost of war."

Ambiguity: Much of knowledge on the web loses information when transformed from its wider setting into a simple subject-predicate-object triple representation. Building context-aware representations of data requires transferring implicit domain knowledge in way that isn't obvious from standard constructions.

Trust: Related to quality, inferring relationships within and across domains requires trust in the sources of those data. Robust systems need be developed to circumvent the inevitable noise and spoofing should a generally useful Semantic Web materialize. There is also a distinction to be made between "material" and "intellectual" trust; for example, information that is naturally quantifiable like prices is more tractable than assertions or distillations of facts made by organizations and individuals.[10] Addressing this issue requires intelligent agents working in coordination with the addition of yet to be standardized mechanisms within the Semantic Web stack of technologies.[24]

User Cognition: Adoption of Semantic Web concepts and technical constructs has been slow to develop, but have quickened over the last few years. Nevertheless, an additional burden is placed on individuals, both technical and non-technical, who wish to contribute meaningfully the semantic information embedded in their web content. [25] Tools can ease the transition, however merely the act of explicating these relationships can be a complex and subtle task that requires learning, in detail, their domain-specific representations. Well-meaning individuals may unwittingly attach false or ambiguous claims to their data.

Altruism: Semantic Web adoption itself relies in large part on the altruistic spirit of individuals and organizations. Early adopters are unlikely to see any immediate gains from open semantic publishing. Due to the scope of the project and difficulties mentioned, attaining a "critical mass" of responsible contributors may prove impossible the realization of a vision that approaches the expectations of the Artificial Intelligence field.

References

  1. The Semantic Web, Scientific American Magazine, 2001
  2. W3C Semantic Web Frequently Asked Questions. W3C (2010). Retrieved on 2010-07-11.
  3. Segaran, Toby; Colin Evanas, Jamie Taylor (2009). Programming the Semantic Web. O'Reilly. 
  4. Mui, Chunka (24 October 2011). How Apple Invented The Future (and the iPad) in 1986. Forbes. Retrieved on 12 November 2013.
  5. 5.0 5.1 Berners-Lee, Tim (1998). What the Semantic Web can represent.
  6. 6.0 6.1 Antoniou, Grigoris; Frank van Harmelen (2008). A Semantic Web Primer, 2nd. MIT Press. 
  7. Entrepreneurs See a Web Guided by Common Sense. New York Times (2006).
  8. The global structure of an HTML document. W3C.
  9. Microformats hCal example. Microformats.org (2010).
  10. 10.0 10.1 10.2 10.3 Marshall, C. C.; Shipman, F. M. (2003). "Which Semantic Web?". Proceedings of the Fourteenth ACM Conference on Hypertext and Hypermedia, ACM.
  11. Turtle, Terse RDF Triple Language. w3.org.
  12. RDF Vocabulary Description Language 1.0: RDF Schema.
  13. http://www.w3.org/TR/rdf-sparql-query/
  14. RDF library for the Ruby programming language.
  15. RDF library for PHP from PEAR (PHP Extension and Application Repository).
  16. JRDF - An RDF Library in Java.
  17. Burgun, Anita & Olivier Bodenreider, Mapping the UMLS Semantic Network into General Ontologies
  18. The RDFa initiative in Drupal 7, and how it will impact the Semantic Web.
  19. Drupal RDF Mapping API. Drupal.org (2009).
  20. Does Facebook Really Want a Semantic Web?. ReadWriteWeb (2010).
  21. BBC World Cup 2010 dynamic semantic publishing (2010).
  22. Google introduces rich snippets (2009).
  23. wiki.dbpedia.org: Use Cases.
  24. Hartig, Olaf (2009). "Querying Trust in RDF Data with tSPARQL". 6th Annual European Semantic Web Conference, 5-20.
  25. (1999) "Formality Considered Harmful: Experiences, Emerging Themes, and Directions on the Use of Formal Representations in Interactive Systems". Computer Supported Cooperative Work (CSCW) 8 (4): 333-352.