Semantic Analysis

This article talks about the Semantic Analysis process in Luma Knowledge.

Overview

Semantic Analysis is the process of understanding the content and meaning of an Artifact. Luma Knowledge generates Knowledge Ontology for an artifact during Semantic Analysis. In Luma Knowledge, the process includes:

  • Document Understanding: This step determines the questions and answers contained in a candidate artifact. In case, more than one question and answer pair are found, then the document can be broken into Question Answer (QnA) pairs. These pairs are called FAQs which appear as a search result for a user's inquiry

  • NLP Parsing and Normalization: In this step, the system determines the fundamental meaning in a sentence. The process includes tokenizing the sentence, identifying the part of speech (POS) of a token, categorizing the tokens into topics, subjects, predicates (action words), and motivations.  The topics, subjects, predicates, and motivations are then normalized and labeled.  The results are used as search metadata.

Luma Knowledge Management uses Azure QnA Maker to identify and generate QnA pairs and NLP Engine for Parsing and metadata generation.

  • Every artifact created in Luma Knowledge through a source document or Web URL goes through the Semantic Analysis process of QnA pair generation and metadata identification.

  • An artifact created manually using Regular Template (Mini artifact) or FAQ Template, bypasses the QnA pair generation step. The system treats the manually created Artifact or FAQ as they are and doesn't try to find QnA pairs. The process only identifies metadata for the artifact.

View Semantic Analysis List

Navigate to the Semantic Analysis tab to view the current status of the Semantic Analysis process for every artifact created in Luma Knowledge using a source Document or Web URL.

Below information for every artifact in the list is available:

  • Artifact ID #: This is the system identification id for the artifact. The artifact is recognized with this id throughout the application.

  • Artifact Name: This is the name of the artifact identified by the Curator.

  • Source: This is the source document name or Web URL used to create the artifact in the system.

  • Added on: Date and time when the artifact was added to the system.

  • Status: This indicates the current semantic analysis status for the artifact. Click on the status for the artifact to view the Status timeline.

Status Timeline

The status timeline is a visual representation of Sub flows that are triggered as part of the Semantic Analysis of an artifact. The illustration shows the sub-flow in execution, current status, and timestamp when the process is triggered.

Lets us consider an example artifact to understand the sub-flows in the process. An artifact "UPI "is added by a Curator in Luma Knowledge using a document as a source. As soon as the Curator clicks on 'Add to Parsing Queue' for the artifact, it is added to the NLP parsing queue for Ontology generation. The artifact now starts appearing in the Semantic Analysis list.

The system now executes additional NLP sub-processes. Click on the 'Status' graph to view the subprocess artifacts goes through:

The sub-flows are Question-Answering Analysis and Ontology Analysis.

FAQ Generation

FAQ Generation is also called Question-Answering Analysis. The process is intended to determine the questions that can be answered by the source document. On 'Status Timeline', Curator can view FAQ Generation process status and start time.

Each QnA pair generated through the process becomes a child Artifacts linked to the Parent Knowledge Artifact. In the case of FAQ generation processes encounter an error, the Semantic Analysis process is process fails, and the next steps are not executed. The timeline is updated appropriately to indicate the error.

For successful QnA pair generation, the source document:

  • should follow Question-Answer format.

  • should be in a supported document format such as .docx, pdf, .xlsx, txt.

  • should be within the permissible document size limit. This is configurable for each document type at the tenant level. Refer to Tenant Configurations for more information.

Ontology Generation

Ontology generation is intended to extract metadata from the document. Metadata represents key information (terms or phrases) that describes the artifact. It enhances searches for artifacts and assists the curation process by identifying known terms. NLP engine parses the document and identifies metadata elements for the Artifact and QnA pairs, using parts of speech and dependency rules. The identified meta-data is then normalized and can be used as search metadata for the artifact. 

Usually Artifact Name and Summary are used to generate Ontology for the Artifact. When using Knowledge Templates, system also generates metadata from fields that are marked for Ontology generation. For more information, refer to Create and Manage Knowledge Templates.

Below are the metadata elements that are identified for each artifact:

Topic

Topics indicate the primary object (a product or service) of an artifact.  Commonly supported products and services in a domain. It is a tangible attribute, can be a term or a phrase.  The Topic identified determines the artifact hierarchy in the Knowledge Graph. If no topic is identified or an incorrect Topic is identified, the Curator may update the Topic as required. This may occur due to a poorly worded title, and the system will suggest it be modified.

For our artifact, Topic is "UPI".

Motivation

Motivation is a term or phrase that verbalizes why someone would want to see the artifact.

Action

An action word (predicate) is usually a verb, which is an action by the subject. Action word(s) indicate the purpose of the artifact.

For our artifact, identified Action is "contains".

Subject

The subject of a sentence is the who or what in relation to an action. Example: Password, Username is a component of an application requiring authorization (Windows, etc.)  Subjects are words indicating the subject of an artifact. Zero or more subjects may be mentioned in the artifact. 

For our artifact, identified Subjects are "documents", "information ".

Identified terms

Identified terms are attributes like Provider which may be an organization that provides a product or service (Topic)

For our artifact, the Provider is "Bank ".

Parent Topics

Parent Topic indicates the hierarchy of the Topic. Usually, the parent node of the Topic is identified as the parent topic. This helps Curator understand the location of the artifact if a similar topic exists in another Domain or Hierarchy branch.

On 'Status Timeline', Curator can view the Ontology Generation process status and start time.

Persistence

The last step in the Semantic Analysis process is Persistence. In this step, all the information identified and created in the earlier steps is saved into Luma Knowledge databases. Luma Knowledge uses three databases to save and maintain information in the system:

  • MySQL database: This is the application database (RDBMS) to save the Knowledge artifact content and metadata.

  • Neo4J database:  This is the graph database used to store the relationships between Domain, topics, artifacts, and FAQs. Information in this database shows how each entity in Luma Knowledge is connected or related to the other.

  • Elastic Search: Elastic Search database enables fast searches against the large volume of data available in Luma Knowledge. The database stores artifact content and ontology information.

Once the FAQ and metadata are generated, the information is pushed to databases. On 'Status Timeline', Curator can view the Persistence process status and start time.

Once the above processes have run successfully, the Semantic Process is marked Complete. The artifact and generated FAQs are now available in the Knowledge Store. The Curator can now verify the metadata and QnA pairs before Publishing the artifact. The artifact is available for the end user's consumption only once it approved and Published by Curator.