Vince Fioramonti had an epiphany back in 2001. He realized that valuable investment information was becoming increasingly available on the Web, and that a growing number of vendors were offering software to capture and interpret that information in terms of its importance and relevance.
"I already had a team of analysts reading and trying to digest financial news on companies," says Fioramonti, a partner and senior international portfolio analyst at Hartford, Conn.-based investment firm Alpha Equity Management. But the process was too slow and results tended to be subjective and inconsistent.
The following year, Fioramonti licensed Autonomy Corp.'s semantic platform, Intelligent Data Operating Layer (IDOL), to process various forms of digital information automatically. Deployment ran into a snag, however: IDOL provided only general semantic algorithms. Alpha Equity would have had to assign a team of programmers and financial analysts to develop finance-specific algorithms and metadata, Fioramonti says. Management scrapped the project because it was too expensive.
(For more information about semantic technologies, including search, see Part 1 of this story, "The semantic Web gets down to business.")
The breakthrough for Alpha Equity came in 2008, when the firm signed up for Thomson Reuters' Machine Readable News. The service collects and analyzes online news from 3,000 Reuters reporters, and from third-party sources such as online newspapers and blogs. It then analyzes and scores the material for sentiment (how the public feels about a company or product), relevance and novelty.
The results are streamed to customers, who include public relations and marketing professionals, stock traders performing automated black box trading and portfolio managers who aggregate and incorporate such data into longer-term investment decisions.
A monthly subscription to the service isn't cheap, Fioramonti says. According to one estimate -- which Thomson Reuters would not comment on -- the cost of real-time data updates is between $15,000 and $50,000 per month. But Fioramonti says the service's value more than justifies the price Alpha Equity pays for it. He says the information has helped boost the performance of the firm's portfolio and it has enabled Alpha Equity to get a jump on competitors. "Thomson Reuters gives us the news and the analysis, so we can continue to grow as a quantitative practitioner," he says.
Alpha Equity's experience is hardly unique. Whether a business decides to build in-house or hire a service provider, it often pays a hefty price to fully exploit semantic Web technology. This is particularly true if the information being searched and analyzed contains jargon, concepts and acronyms that are specific to a particular business domain.
Here's an overview of what's available to help businesses deploy and exploit semantic Web infrastructures, along with a look at what's still needed for the technology to achieve its potential.
The key standards
At the core of Tim Berners-Lee's as-yet-unrealized vision of a semantic Web is federated search. This would enable a search engine, automated agent or application to query hundreds or thousands of information sources on the Web, discover and semantically analyze relevant content, and retrieve exactly the product, answer or information the user was seeking.
Although federated search is catching on -- most notably in Windows 7, which supports it as a feature -- it's a long way from a Webwide phenomenon.
To help federated search gain traction, the World Wide Web Consortium (W3C) has developed several key standards that define a basic semantic infrastructure. They include the following:
• Simple Protocol and RDF Query Language (SPARQL), which defines a standard language for querying and accessing data.
• Resource Description Framework (RDF) and RDF Schema (RDFS), which describe how information is represented and structured in a semantic ontology (also called a vocabulary).
• Web Ontology Language (or OWL), which provides a richer description of the ontology and also includes some RDFS elements.
The final versions of these standards are supported by leading semantic Web platform vendors such as Cambridge Semantics, Expert System, Revelytix, Endeca, Lexalytics, Autonomy and Topquadrant.
Major Web search engines, including Google, Yahoo and Microsoft Bing, are starting to use semantic metadata to prioritize searches and to support W3C standards like RDF.
And enterprise software vendors like Oracle, SAS Institute and IBM are jumping on board, too. Their offerings include Oracle Database 11g Semantic Technologies, SAS Ontology Management and IBM's InfoSphere BigInsights.
Semantic basics
Semantic software uses a variety of techniques to analyze and describe the meaning of data objects and their inter-relationships. These include a dictionary of generic and, often, industry-specific definitions of terms, as well as analysis of grammar and context to resolve language ambiguities such as words with multiple meanings.
The purpose of resolving language ambiguities is to help ensure, for example, that a shopper who does a search using a phrase like "used red cars" will also get results from Web sites that use slightly different terms with similar meanings, such as "pre-owned" instead of "used" and "automobile" instead of "car."
For more information about semantic technologies, including search, see Part 1 of this story, "The semantic Web gets down to business." It explores the technology's potential uses and paybacks, illustrated with real business cases, including ones involving the use of sentiment analysis. It also provides some best practices and tips from the trenches for anyone planning, or at least considering, a deployment.
W3C standards are designed to resolve inconsistencies in the way various organizations organize, describe, present and structure information, and thereby pave the way for cross-domain semantic querying and federated search.
To illustrate the advantage of using such standards, Michael Lang, CEO of Revelytix, a Sparks, Md.-based maker of ontology-management tools, offers the following scenario: If 200 online consumer electronics retailers used semantic Web standards such as RDF to develop ontologies that describe their product catalogs, Revelytix's software could make that information accessible via a SPARQL query point. Then, says Lang, online shoppers could use W3C-compliant browser tools to search for products across those sites, using queries such as: "Show all flat-screen TVs that are 42-52 inches, and rank the results by price."
Search engines and some third-party Web shopping sites offer product comparisons, but those comparisons tend to be limited in terms of the range of attributes covered by a given search. Moreover, shoppers will often find that the data provided by third-party shopping sources is out of date or otherwise incorrect or misleading -- it may not, for example, have accurate information about the availability of a particular size or color. Standards-based querying across the merchants' own Web sites would enable shoppers to compare richer, more up-to-date information provided by the merchants themselves.
The W3C SPARQL Working Group is currently developing a SPARQL Service Description designed to standardize how SPARQL "endpoints," or information sources, present their data, with specific standards for how they describe the types and amount of data they have, says Lee Feigenbaum, vice president of technology at Cambridge Semantics and co-chair of the W3C SPARQL Working Group.
Building blocks and software tools
Tools, platforms, prewritten components and services are available to help make semantic deployments less time-consuming, less technically complex and (somewhat) less costly. Here's a brief look at some options.
Jena is an open-source Java framework for building semantic Web applications. It includes APIs for RDF, RDFS and OWL, a SPARQL query engine and a rule-based inference engine. Another platform, Sesame, is an open-source framework for storing, inferencing and querying RDF data.
Most leading semantic Web platforms come with knowledge repositories that describe general terms, concepts and acronyms, giving users a running start in creating ontologies. "Customers have conflicting demands: to have the platform be able to come back with accurate answers out of the box, and to have it tailored to their business area," says Seth Redmore, vice president of product management at Lexalytics.
To address that quandary, Lexalytics sells its semantic platform primarily to service provider partners, who then fine-tune it for specific business domains and applications. Thomson Reuters' Machine Readable News service is one example.
Other platform vendors have been rolling out business-specific solutions. Endeca, for example, provides application development toolkits for e-business and enterprise semantic applications, including specific offerings for e-commerce and e-publishing.
There are also tools to automatically incorporate semantic metadata, and W3C standards, into existing bodies of information. For example, Revelytix's Spyder utility automatically transforms both structured and unstructured data to RDF, according to Lang. It then presents, or "advertises," the information on the Web as a SPARQL endpoint that can be accessed by SPARQL-compliant browsers, he adds.
An open-source tool called D2RQ can map selected database content to RDF and OWL ontologies, making the data accessible to SPARQL-compliant applications.
Revelytix sells a W3C-compliant knowledge-modeling tool called Knoodl.com, a wiki-based framework designed to help everyone from technical specialists and subject matter experts to business users collaboratively develop a semantic vocabulary that describes and maps domain-specific information residing on multiple Web sites. Communities of interest can then use Knoodl.com to access, share and refine that knowledge, according to Lang.
For example, consultancy Dachis Group has developed what it calls a Social Business Design architecture whose purpose is to help users collaborate, share ideas and then narrow down and "expose and make sense of" data within a business organization or other community of relevant individuals, such as customers or partners, says Lee Bryant, managing director of the firm's European operations.
Such offerings can significantly ease the task of developing a semantic infrastructure. For instance, Bouygues Construction used Sinequa's semantic platform, Context Engine, and needed only about six months to do an initial implementation of a semantic system for locating in-house expertise, according to Eric Juin, director of e-services and knowledge management at Bouygues.
Bouygues has since developed a semantic search application that helps knowledge workers quickly find information that resides either on internal systems or on the Web, Juin says.
Context Engine indexed and calculated the relevance of people and concepts in a half-million documents, including meeting minutes, product fact sheets, training materials and project documentation, he says. The platform includes a "generic semantic dictionary" of common words and terms, which it can translate between various languages, according to Juin. For example, a French employee could semantically search a document written in German.
Certain business-specific acronyms and terms have to be added manually -- that's an ongoing process that requires semantic experts to collaborate with business users, Juin says. Over time, however, his group has been adding fewer keyword definitions, because the semantic engine can use other, related words to determine a term's relevance to a specific subject, he says.
The SaaS option
Companies that lack the internal resources to build their own semantic Web infrastructure can follow Alpha Equity's lead and go with a semantic service provided by a third party.
One such provider is Thomson Reuters, which, in addition to its Machine Readable News service, offers a service called OpenCalais through which it creates semantic metadata for customers' submitted content. Customers can deploy that tagged content for search, news aggregation, blogs, catalogs and other applications, according to Thomas Tague, a vice president at Thomson Reuters.
OpenCalais also includes a free toolkit that customers can use to create their own semantic infrastructures and metadata, and to set up links to other Web providers. The service now processes more than 5 million documents per day, according to Tague.
DNA13 (now part of the CNW Group), Lithium Technologies (now the owner of Scout Labs) and Cymfony are among the semantic service providers that query, collect and analyze Web-based news and social media, with an eye toward helping customers in areas such as brand and reputation management, customer relationship management and marketing.
When will the semantic Web really matter?
In a 2010 Pew Research survey of about 895 semantic technology experts and stakeholders, 47% of the respondents agreed that Berners-Lee's vision of a semantic Web won't be realized or make a significant difference to end users by the year 2020. On the other hand, 41% of those polled predicted that it would. The remainder did not answer that query.