How many woodchucks could a woodchuck chuck? Google may soon be able to finally answer that question for you. Google’s reported roll out of semantic search is causing a lot of buzz. According to a Wall Street Journal report Google will start rolling this technology out over the next few months. But what really is semantic search? I’ll give a brief technical description of these concepts here.
Semantic search is a technology process that infers relationships between things and their properties. It’s a much more powerful analysis (and computationally more difficult task) than simply indexing a webpage based upon the keywords it contains. Semantic search gives deeper meaning and understanding to the words on a page, which can help identify better relevant results and also potentially answer your query directly.
Google hasn’t mentioned how or if semantic search will affect rankings. Initially, it’s likely to be an additional “answer” listing type along with the other universal search results such as videos, images, and news. Most likely showing up for more focused question type queries.
However, the technology has the potential to significantly alter rankings in the long term and potentially alter how we interact with search engines completely. It’s essential, therefore, to understand what semantic search is. There are 3 major components of semantic search. Let’s examine these here.
1. Entities Rule
Semantic search starts with identifying things or more specifically named entities. An entity can be a person, place, corporation, movie, TV show, or even an event. For example, “Person:Barak Obama”, “City:New York City”, or “Movie:The Lord of the Rings”. A named entity is a specifically named occurrence of a thing.
The key technology when dealing with entities is a natural language processing (NLP) function known as named entity recognition (NER). This is a process that parses text and identifies any named entities. This can be done from a vocabulary of known names or by using statistical models of sentence parts to detect an entity. NER is not new by any stretch. It’s been a topic of text analytic research for a very long time. There are in fact many open source projects dedicated to this.
Finding and identifying named entities is the basis for semantic search. Google is said to have quietly amassed a database of hundreds of millions of entities, but entities are just a starting point for semantic search.
2. Properties and RDF Triples
Finding and identifying entities is the first step, but the real fun begins when you can determine an entities properties. A property is something that describes the entity. For example, the color “blue” for the sky, “actor” as the profession for Harrison Ford, or 1968 as birth year for Will Smith. The collection of properties describe the entity.
The Resource Description Framework (RDF) is a modeling language that is used to express the properties of entities in a standardized form. This entity/property information forms what is called an RDF statement. RDF statements are expressed in the form of subject-predicate-object, known as an RDF triple. For example:
Andy Warhol : birth place : Pittsburgh, PA
Andy Warhol : profession : Artist
Apple, Inc : product : iPhone
New York, NY : population : 8,391,881
RDF triples add actual information about the entities and constitute a fundamental block of knowledge. It’s important to recognize that the “predicate” is what describes the entity and the object is the value of the predicate. Each entity could have hundreds or even thousands of RDF triples collected from every piece of text that references the entity.
3. Alex, What is a “Knowledge Graph”?
Imagine a vast set of RDF triples extracted from every webpage and database in existence…
Now connect the first entity in the triple to the last entity in the triple. This is surprisingly similar to the way one web page links to another web page. The result is an enormous network of interlinked entities and their properties.
This is called a knowledge graph. A vast interconnected network of entities and referenced properties. Google has quietly amassed this database over the years and seems ready to start using it.
A great example of a knowledge graph is dbpedia.org. Dbpedia.org is a knowledge graph built on top of Wikipedia and provides a very large information graph. For example:
4. Knowledge Graph -> Social Graph
Another great example of a knowledge graph is actually Facebook. Facebook has long touted the “social graph”, which most people associate to the interconnected friend lists. However, a social graph is simply a knowledge graph only for people. While connecting people to people is an immensely valuable asset, the other valuable component in a social is your profile data. Your likes, favorite movies, books, artists, etc. This constitutes a knowledge graph of human preferences.
Facebook certainly is a different type of knowledge graph when compared to the factual entity/properties derived from Google’s crawl, but it raises a lot of interesting possibilities and perhaps may even be more monetizable. However, imagine connecting the human social graph with the factual knowledge graph. This puts a new perspective on Google’s drive to push Google Plus?
All that aside, the true power of a knowledge graph is not to answer simple factual questions. The REAL power a few years ahead lies in the ability to actually infer logical answers to questions not explicitly found in the knowledge graph. This is known as a reasoned query and I’ll discuss it in my next post.
Creating a usable knowledge graph from a closed set of trusted documents like Wikipedia is one thing. Creating a knowledge graph from the general population of billions of webpages is another.
Entity disambiguation is a major problem for knowledge graph construction. How do you know which “Andy Warhol” this document is referring to? It’s a very difficult task. However, you can start to see how Facebook’s profile information may actually help to solve some issues of entity disambiguation. A true and accurate human social graph may actually be a prerequisite for a large scale knowledge graph.
Specific names aside, if a webpage mentions “The President”, who does that refer to? The entities are what form the links in the graphs so it’s essential to get it right. Ensuring accurate properties is another key challenge. Are the statements made actually accurate? How well do you trust the source? Funny, but those questions sound familiar for a search engine, don’t they?
I suspect we will see semantic search 1.0 and it will improve over time. The last time I looked Google could still not answer “How many woodchucks could a woodchuck chuck?”. However, they seem to be doing well with “What is the population of new york city?”