BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

Why Knowledge Bases Are The Next Big Thing

This article is more than 4 years old.

There is an intriguing fallacy (what I call the disconnected data fallacy) that seems pervasive in enterprise circles. It goes something like this:

Most organizations are filled to bursting with databases, most specifically dedicated to supporting one particular application or another. There’s also data in documents, spreadsheets, and other data stores, and that it’s just a matter of laziness that most of this data is not available to other parts of an organization. Buy the right tool (or hire enough open source developers to build the right tool) and you achieve digital nirvana.

There are many problems with this belief.

  • Synchronization of existing data systems is hard, costly, and usually provides very little return for the investment.
  • Each system has different representations of the “same” conceptual data, and in many cases those representations vary significantly from one system to the next.
  • Identity of objects is seldom consistent from one system to the next, let alone one organization to the next.
  • Data warehousing solves a plumbing problem, but very seldom solves the more complex language problem.
  • Politically, trying to consolidate data systems is usually untenable.

Given these issues, you’d think that most companies would recognize that this approach doesn’t work, but despite that, billions (if not trillions) of dollars are spent every year in doing the same damn thing, over and over again.

Enterprise data is not application data. It fulfills a different need, has a much greater requirement for metadata, and should be handled in a different manner.

Context: A Guide to Semantic Data by Kurt Cagle

So what’s the best solution? Just give up and assume that digital transformations are not possible? Not really. There are ways that you can transform an organization to work around a consolidated enterprise data model, but it requires recognition of several key caveats:

  • Enterprise data deals with the resources within the enterprise as well as those from outside that impact the enterprise.
  • Enterprise data also includes the categorizations and related metadata about those resources, information which is found far less often in relational databases.
  • Before you can bind other data systems to this data, you have to, for every resource, build a key ring that identifies that data in a given system.
  • A significant amount of the information that is useful in an enterprise data system requires human curation, because it provides human context
  • Enterprise data needs to be machine-readable, but it also needs to be human readable, queryable, and curatable.
  • Enterprise data modeling is an ongoing process - you do not know what you do not know, but as you learn, your model should be refinable. This means that schemas are fundamentally dynamic.
  • What is often called program logic is almost invariably higher order metadata within the model. These include rules, constraints, and even presentation.
  • The distinction between data and metadata is only one of context.

These principles differ fairly dramatically from the requirements that are typically placed upon dedicated databases. Put another way, enterprise data is not application data. It fulfills a different need, has a much greater requirement for metadata, and should be handled in a different manner..

What Is A Knowledge Base?

A knowledge base can be thought of as a data encyclopedia that’s specific to an organization, subject domain or location. For a retail outlet, a knowledge base might contain the catalog for that outlet, but it might also include the sales staff, known customers, store information, and even marketing campaigns. A sports franchise may have a knowledge base focused on players, teams, coaches, games, seasons and so forth. An art museum would have exhibits, locales, works of art, artists, collectors, etc.

In all of these cases there are underlying categories of things and relationships between these things. Each thing in the knowledge base has a globally unique identifier, an array of attributes and typically external relationships that point from one type of thing to the next. In the baseball example, such a relationship may be between player and a team, though temporal knowledge bases will more likely indicated that there is a contract that binds a particular player to a position with a team for a certain period of time.

Indeed, one of the most powerful aspects of such knowledge bases is that both types of relationships may very well exist in the same database, where a rule can add a property that says a player is on a given team at a specific time if the contract for that player bounds the time in question. It can similarly remove this relationship if this is not true. In a relational database, this can only be accomplished by setting up a specific table with a property that either indicates that player A is on team B or is null; in a semantic knowledge graph, the property is simply not present if the relationship is false.

This seemingly simple characteristic dramatically changes how information can be stored, searched and transformed. A property can have more than one value without requiring the creation of an entirely new table. Properties can be annotated, both to provide more comprehensive definitions and in some cases to perform additional logic if the property contains certain values. Information can also be segregated in different collections (confusingly also called graphs) then merged or deleted once the data has been processed.

Finally, it is possible with specific graph query languages to move across relationships between objects without necessarily knowing what those relationships are, even to the extent of identifying connections between two objects across an indeterminate number of hops. This kind of analysis is very useful for discovering previously unknown connections, and is very difficult to do using relational databases.

Without digging too much farther into the weeds, knowledge graphs in general are more flexible than relational databases, are able to store, manipulate and delete metadata about data far more efficiently, and is able to work with data both in tabular form and as rich “documents”.

A knowledge base is then built on top of knowledge graphs - you can think of it as an application to get at information without needing to know much if anything about the structure of that information.

Dynamic Interfaces With Knowledge Graphs

In any organization, there are two competing demands on data systems that can loosely be described as centralization vs. distribution. Centralization involves keeping a tight rein on data structures that are used to describe organizational entities such as products, people, events, locations, organizations and so forth. There is also typically a wealth of information about a resource that doesn’t necessarily fit easily into the numeric view of data that tends to pervade organizations - descriptive content, provenance information, relationships with categories or other resources and similar types of content.

There are two key approaches that can be taken when designing relational data systems. The first approach is to do baseline data modeling, in effect hard coding the relationships that exist directly into the database. This describes the approach taken by about 90% of all developers. It tends to make for relatively fast data systems, but once designed, changing this model becomes much more complicated, especially when a large amount of data has been introduced into the system. This is roughly analogous to creating a hardware chip that encodes business logic. Change is expensive in such systems, and it becomes harder the longer that the data structures remain undocumented (something that occurs most of the time).

Nearly twenty years ago, Drupal came out, and with it a fairly radical idea was introduced. Every document within Drupal could be treated as a distinct node in a graph structure. You could decorate that node by giving it a type, assigning it properties, and creating presentation views, but the fact that the node existed and the node identifier was uniquely specifiable meant that a Drupal designer could turn nodes into anything. One problem with this, however, was that at the end of the day the database underneath it was still a relational database, and the overhead of building classes by indirection typically began eating up a significant percentage of the overall computing cycles.

Graph data stores are built precisely with this scenario in mind. The data model in this case is “soft” - it is effectively constructed when it is queried, and this information can then be sent to the client application to tell the client what to do with the information. What this implies is actually quite powerful. The overwhelming majority of all software applications require an army of programmers to build user interfaces, and should the data model change, this change also necessitates UI changes, significant ones.

In a knowledge base, however, the model describes the interface. When you change the model, the interfaces should automatically change as a consequence. As there may be potentially hundreds or even thousands of classes involved in an enterprise, this becomes a big factor, given the typical “screen” for editing a class may take upwards of one to three thousand dollars to change to create. This is in fact one of the big reasons that software becomes obsolete: changing the model may be fairly trivial, but the knock-on effects of changing the viewers and editors of that model add up quickly and painfully.

This is similarly true of things like pick-lists, drop downs and multi-item selectors. The model (in a knowledge graph) indicates what class a given property is expecting to draw values from for a drop-down. It can also indicate constraints that make cascading possible - selecting the make, model and trim of a car, for instance, where the selection of one limits the values that appear in the next. This process, called faceting, is a pain to write using traditional UI generation tools, but trivial to build with knowledge bases.

Semantic systems can more accurately weight text searches, for instance, identifying that a label, an abstract description and a full description each can be used for searches, but that you are more likely to get relevant searches from labels than the abstraction or even the full content. Since properties values can also be text-search indexed, this means that a semantic knowledge base will retrieve content that is more likely what is being searched for, and will do so in roughly the same amount of time that more traditional search engines would.

Finally, while there are some variations out there with regards to graphs, the graph database industry has mostly standardized on a very well defined stack of technologies. This means that knowledge bases can in general work with multiple triple stores and related stacks with minimal rewriting.

The Business Benefits of Knowledge Bases

Given all that, what benefits do knowledge bases have for businesses? There are several, as it turns out:

  • A knowledge base acts as a reference database: it contains products in your catalog, categorizations, people, locations, relevant organizations, and so forth.
  • Every item within the knowledge base has a globally unique identifier.
  • A knowledge base can also contain the identifier for that same resource in other systems, as well as identifying what those systems are. This makes it ideal for managing master data, as well as identifying what a given identifier is in one system if given the id from a different system.
  • Every resource can incorporate a human legible description (up to and including whole web pages and similar content)
  • A knowledge base can either serve as or integrate with a digital asset management system.
  • Knowledge bases can generate output not only as web content, but also as web services providing JSON, XML, raw text or other content. Because of this, you can use a knowledge base to significantly simplify an often complex and poorly documented services architecture.
  • New properties can be added or removed dynamically without having to get “under the covers”. This makes it easier to start tracing new properties without having to rewrite or even restart the application.
  • Knowledge bases become more valuable over time as the connections become richer and more complex. Rules and scripts can also search for connections as data gets entered and build out those that don’t yet exist.
  • Many knowledge bases are designed to federate. This means that the knowledge base can work across multiple storage devices and can scale as demand increases. It also means that, knowledge bases can pull in information dynamically from other data systems, making it a good system for managing specific enterprise level data.
  • It is possible to store both binary and text format content within most knowledge bases. Such content can come from external stores, or be calculated using the query language of the knowledge graph.
  • With a knowledge base, it becomes far easier to maintain provenance trails, to allow access to information at the attribute level, and to duplicate information so that it can be used in other knowledge graphs.
  • Knowledge bases often form the foundation of chatbot and related natural language processing (NLP) systems, and can tie into artificial intelligence and machine learning systems across multiple computer languages.
  • Knowledge bases are also frequently used to hold and consolidate data that may come from spreadsheets and presentation software, and can generate spreadsheets as processes.
  • In general, knowledge bases evolve over time. This means that rather than using a predefined model, most knowledge bases tend to have evolving models that change as business requirements change. This translates into lower overall costs and longer utility of such knowledge bases.
  • General universality of data formats also mean that knowledge bases can be readily spawned from existing knowledge bases, and can readily be merged back together far easier than is the case with traditional relational systems.

Knowledge bases are also ideal tools for integrating with Data Science pipelines: the flexibility of data production and the ability to map between ontologies (data languages) dynamically means that many of the big headaches involved in data analytics - de-duplication, cleansing, validation, dimensionality, ensuring consistent meaning in properties and resources, and so forth, the 90% of work that most data scientists have to do just to get data into a form that’s useful for analysis, can be done automatically.

MarketWatch has estimated that the semantic knowledge base industry will be worth $33 billion by 2023, with year over year growth of 10% through the rest of the decade.

Uses for Knowledge Bases

The previous section focused on benefits, but some real world examples can really help to elucidate where such knowledge bases may prove useful. The following are examples from real world applications, though the names of companies and specific details have been obscured.

  • In the media space, most of the large production studios have extensive knowledge bases for managing their IP, production inventory and creative teams. These in turn drive campaign marketing, franchise management, contract management and distribution of product globally.
  • Several news agencies now maintain knowledge graphs that help them identify and update content on notable figures, companies, locations and events, reducing overall research time, ensuring a platform for data accuracy and curation and providing cross-referential inferencing to determine potential stories.
  • Several large online retailers now maintain their catalogs in knowledge bases, reducing potential duplication, permitting price changes to propagate quickly through the system and eliminating the need to store contextual information, such as product imagery, in separate data systems. Increasingly these also use machine learning to handle autoclassification of products from millions of distinct vendors into the knowledge graph itself.
  • Most large search engines now use knowledge bases in the back end to surface related content in specialized “card” or data views, to drive categories, and to provide ways to filter content with minimal curation costs. Chances are, if you have made a web query on a search engine, you’ve been using a knowledge graph.
  • Health insurance companies have been exploring knowledge graphs for several years as ways of providing “Customer 360” views to its workforce and customers. This makes it possible to track visits, patients, providers, prescriptions, services as well as policy changes and coverage. In addition to simplifying the often bewilderingly complex domain of health insurance, it is also used to identify fraud and corruption in the system.
  • Manufacturers often have to deal with a morass of compliance issues across multiple jurisdictions. Knowledge bases are ideal for tracking resources and production processes and then determining, based upon rules based inferencing, what it takes to get a particular product into compliance in a given region or industry.
  • Logistics and supply chain management companies are relying upon knowledge bases to both identify and track resources over their networks, often with the knowledge bases providing for consolidated metadata that reduces the overall complexity of more traditionally related transactional systems.
  • Governments are beginning to move legal codes and tax law into knowledge bases in order to reduce redundancies, identify applicability and determine case law precedent.
  • Pharmaceutical companies have been taking advantage of knowledge bases for some time to track regulatory compliance, to identify side effects and actions, to manage proper labeling, and to track usage, abuse and fraud.
  • Financial services have been working with knowledge bases for a while for managing the complex financial instruments, ensuring regulatory compliance and assessing profitability.

Despite all this, most companies are just at the very edge of what they could be doing with the technology. One thing that seems to be very compelling with this technology, however, is that even when projects didn’t necessarily meet broad lofty goals, they still provided demonstrable value, something that’s not necessarily been true in other data management sectors.

Summary

Knowledge bases are not panaceas. They are generally good for providing a foundation for managing enterprise level data, because enterprise data has a higher expectation of curation and quality than most application-oriented data projects. They won’t replace other databases in your organization (okay, they might replace quite a few, but not all) but they should end up acting much like the tubas, cellos, tympani and bassoons of an orchestra - they set up the deep knowledge that companies need, then let their developers build off that deep knowledge to reduce the overall complexity of applications.

Follow me on LinkedIn