DIADEM background
DIADEM logo

Why …

How do you go about finding a flat in a foreign city? Use a real-estate aggregator; ask around about real-estate agencies; rummage through the local classifieds … Why isn’t this as easy as searching in Google or shopping at Amazon?

Where the object of publishing is text, the web has become the “great equalizer”—anyone can become a publisher; where the objects are real-word or digital goods, however, this has not manifested. Search engines have not been able to provide comprehensive, automated object search. Though everyone can publish a Web site about a flat for rent, finding that offer in a search engine is nearly impossible.

Why is that? Searching for objects is harder than for text: we search for objects by their attributes—the size of a flat, the number of bathrooms, the distance to the next Opera house. To find an object offered on some Web page, we have to recognize that an object occurs on that page, identify that object’s type and the value of its attributes.

Todays technology requires those tasks to be performed by the publisher. Yet it puts all the power in the hand of large aggregators such as Google: They decide who gets aggregated and how each publisher has to describe its objects. Publishers have to follow the object types defined by aggregators and provide the value of the offered objects; worse, they have to do this for many aggregators.

details
DIADEM logo

What …

Many computer scientists see the answer for the problem of object search in publishing objects using formal vocabularies. Initiatives such as Linked Open Data and the Semantic Web create such vocabularies for publishers to annotate their published objects. Though this provides more choice and freedom to publishers, it is freedom for technology experts only—considering that today publishers overwhelmingly fail to produce even syntactically correct Web pages.

DIADEM takes a bolder view: We believe that the hard and repetitive tasks necessary for object search can be automated—given a number of significant, but realistic breakthroughs in automated Web data extraction. DIADEM allows publishers to focus on the object descriptions for humans and transforms these into objects with searchable attributes.

DIADEM’s web extraction is based on the observation that object descriptions for humans occur in a limited set of patterns—at least within a given domain: book descriptions contain title and author, the title usually larger or in bold font. Such patterns of occurrence form the phenomenology of objects in that domain.

We assemble domain and phenomenological knowledge about a domain of interest. With this knowledge we can automatically analyze arbitrary object descriptions from that domain and identify which objects occur using which patterns. The result of the analysis is an extraction program. It can be used to extract automatically all similarly published objects and their attributes.

details
DIADEM logo

How …

DIADEM is funded primarily through an “Advanced Investigator Grant” of the European Research Council (ERC). The grant finances the majority of the Oxford team as well as visiting collaborators.

details
DIADEM logo

Who …

The idea of DIADEM is the brainchild of Georg Gottlob, Professor of Computer/ing Science at the University of Oxford and the Vienna University of Technology. His current research deals with web data extraction, database theory, query languages, data exchange, and with graph-theoretic problem decomposition methods that can be used for recognizing large classes of tractable instances of hard problems. Has recently been elected a Fellow of the Royal Society for his fundamental contributions to both artificial intelligence and to database systems.

In Oxford, he has build a team of four postdocs and several Ph.D. students that work primarily or to a large extent on DIADEM. Tim Furche takes care of the day-to-day management of DIADEM and complements Georg's expertise on query and reasoning languages for the web with a twist towards search. Giovanni Grasso brings deep knowledge in ontology languages and the integration of ontologies and logic programming to the table. Christian Schallhart is our resident expert in software engineering, having a strong background in test case generation, runtime verification, and formal methods—and we count on him to bring method to DIADEM. Giorgio Orsi is associated with DIADEM and will help us with scaling ontology and reasoning languages and Datalog±. For our Ph.D. students see the full team list.

Due to the ambitious and cross-disciplinary nature of DIADEM, we place a lot of emphasis on collaboration with researchers outside DIADEM. For DIADEM associates in Oxford and the rest of the world see the full team list.

details