Web Crawler Utility
The Webcrawler Utility is a software tool designed to automate the extraction of information from websites through web scraping. This utility can crawl through the main page of a website and all its child pages, generate knowledge artifacts for each page, and store the webpage’s fingerprint. On recurring sync, only the new changes made to the webpage since the last crawl are updated, significantly improving efficiency.
The utility allows users to specify the domain where the artifacts should be created and can be configured using various application properties.
Recursive Crawling: Automatically crawls all child pages from a main page, ensuring comprehensive data extraction.
Knowledge Artifacts: Create a knowledge artifact for every crawled page in Luma Knowledge.
Change Detection: Identifies changes and syncs only updated content during subsequent crawls.
Application Properties
Provides the following details to the utility configurable application properties:
Source URL: Defines the starting point for the web crawl.
Luma KM API Token: The API token required to authenticate with Luma KM for artifact creation.
Tenant ID: Identifies the tenant for the Luma KM instance, ensuring knowledge artifacts are stored in the correct location.
Prerequisites
Before running the Webcrawler Utility, ensure that the following system and software requirements are met:
Operating System: Windows 10 or newer.
Java JDK/JRE: Java JDK/JRE version 17 must be installed to run the utility.
Installation & Configuration
Install Prerequisites: Ensure that you have Windows 10 or higher and Java JDK/JRE 17 installed on your system.
Download the Webcrawler Utility: Obtain the latest version of the Webcrawler Utility from the provided source.
Configure the Utility: Set the required application properties such as the Source URL, Luma KM API Token, and Tenant ID in the utility’s configuration file.
Run the Utility: Execute the utility to begin crawling the specified website and generating knowledge artifacts.
Usage
First Run: The utility will crawl the entire website starting from the main page and will generate artifacts for every page found.
Subsequent Runs: During recurring syncs, the utility will detect any changes on the website since the previous crawl and update only the modified pages, reducing redundancy and saving time.