dcsimg

Multithreaded Geo Web Crawler with Java

This article covers Mowglee, a web crawling system that utilizes geography as its main criteria for crawling websites. Implemented through Core Java, Mowglee operates in a multithreaded mode that, by default, makes use of the robot’s exclusion protocol (robots.txt), sitemap generation, data classifiers and analyzers, and a generalized application framework.

The Mowglee multithreaded geo web crawler for Java is presently in version 0.02a. We will be discussing its Low-Level Design architecture and how to get started using it for web crawling.

What is the Mowglee Geo Web Crawler?

Mowglee is a complete web crawling framework and application that is frequently updated at: http://www.sumithpuri.me/coderanch/mowglee.zip. You can find the instructions to run the program in the folder Docs in the readme.pdf file. When setting up Mowglee, you will want to keep the crawler and the user-agent the same name (i.e., Mowglee). You can, technically, add your own variant of that name if you like - just make sure they match. For example, if you want to add the word 'developercom' to identify your variant, rename the user-agent to 'mowglee-developercom.'

Before running Mowglee, please make sure you have JDK 1.6+ installed on your system. Some of the classes in Mowglee have their own main() method but were written only for 'Unit Testing or Testing of Individual Functionalities.' The main class to run the application is MowgleeCentral.'.

What is Geo Crawling?

Mowglee seeks to enhance geography penetration in terms of higher reachability. It relies on the most important, higher throughput URLs of a specific geography when determining where to start crawling - or crawl homes. Throughput is defined as the number of varied links, text, or media with a higher data relevance for a given geography. It was developed using concepts of asynchronous task execution and multithreading in Java.

The main class to run the application is in.co.mowglee.crawl.core.Mowglee. You can also run the bundled JAR file under dist using java –jar mowglee.jar. If you are using JDK 6 for execution (recommended), then you can use jvisualvm for profiling.

Note: In the images later in this article, the class Mowglee is named MowgleeCentral.

Mowglee and Low Level Design

The core crawling system used by Mowglee's core is a hierarchy of crawlers designed to be more efficient and maximize the crawling process. MowgleeCrawl is a class that invokes all of the crawl types in Mowglee in order: static crawling, periphery crawling, and site crawling, specifically.

The starting class for the crawling process is MowgleeStaticCrawl, which reads the static geographical crawl home - or home page. You can configure more than one crawl home for each geography and begin a MowgleeStaticCrawl process for each of them. It may be easier to visualize this by imagining a very simple representation of a multi-agent system. By default there is a safe waiting period of ten seconds that can be configured as needed. Setting this wait period ensures that all data is available for other running processes and other running threads prior to the main crawl.

MowgleePeripheryCrawl is the pass one crawler. It deduces the top-level domains from a web page or hyperlink. It is designed to make MowgleeSiteCrawl (Pass 2) much easier and more measurable. The periphery crawl process is used to remove any duplicate, top-level domains across crawls so you get a more concentrated effort in the next pass. During Pass 1, we only concentrate on the links and do not focus on the data.

MowgleeSiteCrawl

is the Pass 2 crawler that instantiates individual thread pools by using the JDK 6 executor service for each MowgleeSite. The Mowglee crawl process at this stage is quite extensive and rather intrusive as far as the types of data it detects. In Pass 2, link types are classified as protocol, images, video, or audio, and metadata information for the page. The most important part of this phase is analyzing in a dynamic and controlled fashion, as we attempt to increase the thread pool size to the number of pages on the site.

Mowglee is organized as a group of workers within each of these crawling passes. For Pass 1, reading and deduction is performed by MowgleePeripheryWorker. For Pass 2, reading and analyzing links is the job of MowgleeSiteWorker. The helper classes that are used during this phase includes data analyzers, as explained later.

Java Geolocation Web Crawler

Figure 1: Mowglee — Core Crawler

MowgleePeripheryWorker uses MowgleeDomainMap for storing top-level domains. The most important pieces of code can be found in MowgleeUrlStream, which opens a socket to any given URL and read its contents, as shown in the Java code below:

    StringBuffer httpUrlContents = new StringBuffer();
	InputStream inputStream = null;
	InputStreamReader inputStreamReader = null;

	MowgleeLogger mowgleeLogger = MowgleeLogger.getInstance("FILE");

	// Check if the url is http
	try {

	    if (crawlMode.equals(MowgleeConstants.MODE_STATIC_CRAWL)) {
		mowgleeLogger.log("trying to open the file from " + httpUrl, MowgleeUrlStream.class);
		inputStream = new FileInputStream(new File(httpUrl));

		inputStreamReader = new InputStreamReader(inputStream);

	    } else {
		mowgleeLogger.log("trying to open a socket to " + httpUrl, MowgleeUrlStream.class);
		inputStream = new URL(httpUrl).openStream();

		inputStreamReader = new InputStreamReader(inputStream);
	    }

	    // defensive
	    StringBuffer urlContents = new StringBuffer();
	    BufferedReader bufferedReader = new BufferedReader(inputStreamReader);
	    String currentLine = bufferedReader.readLine();

	    while (currentLine != null) {
		urlContents.append(currentLine);
		currentLine = bufferedReader.readLine();
	    }

	    if (httpUrl != null & httpUrl.trim().length() > 0) {
		MowgleePageReader mowgleePageReader = new MowgleePageReader();
		mowgleePageReader.read(httpUrl, urlContents, crawlMode);

		mowgleeLogger.log("the size of read contents are " + new String(urlContents).trim().length(),
			MowgleeUrlStream.class);
	    }

	    // severe error fixed - possible memory leak in case of an exception! - [connection leak fixed]
	    // inputStream.close();
	} catch (FileNotFoundException e) {

	    mowgleeLogger.log("the url was not found on the server due to " + e.getLocalizedMessage(),
		    MowgleeUrlStream.class);
	} catch (MalformedURLException e) {

	    mowgleeLogger.log("the url was either malformed or does not exist", MowgleeUrlStream.class);
	} catch (IOException e) {

	    mowgleeLogger.log("an error occured while reading the url due to " + e.getLocalizedMessage(),
		    MowgleeUrlStream.class);
	} finally {

	    try {
		// close the connection / file input stream 
          	if (inputStream != null)
				inputStream.close();
	    } catch (IOException e) {

			mowgleeLogger.log("an error occured while closing the connection " + e.getLocalizedMessage(),
			MowgleeUrlStream.class);
	    }
	}
	

Geolocation Web Crawler with Java


Figure 2: Mowglee — Analyzers.

In Mowglee, there are several analyzers for different types of media and data. MowgleeLinkAnalyzer is the analyzer that is implemented as part of this codebase, which uses MowgleeSiteMap as the memory storage for links within a top-level domains. It maintains a list of visited and collected hyperlinks from all of the crawled and analyzed URLs within a given top level domain.

Web Crawler using Java

Figure 3: Mowglee – Filter, Logger, and Utilities.

MowgleeGarbageCollector, a daemon thread, is instantiated and started at the main application runtime. Since a large number of objects are instantiated in each thread, this specific thread tries to control and enforce internal garbage collection mechanisms, keeping in mind safe limits of memory usage. MowgleeLogger, meanwhile, is responsible for providing the abstract class for all logger types in Mowglee. There is also an implementation of RobotsExclusionProtocol provided under the MowgleeRobotsExclusionFilter. It inherits from MowgleeCrawlFilter. Every other type of filter that is closer to the functioning of a crawler system can extend from this specific class.

MowgleeCommonUtils offers numerous common helper functions, including deduceTopLevelDomain(). As per the sitemaps protocol, the placeholder for generating sitemaps is owgleeSitemapGenerator and is also a starting point for a more thorough - or even custom - implementation. The implementations for analyzing images, video, and audio can be added. Only the placeholders are provided.

Applications and Uses of Geo Crawlers

The following are some ideal applications for this category of web crawler. It is by no means a complete list.

Governance and Government Enforcements Websites

Enforcing any localized or custom geographical rules is possible using this system. Any type of classification that is done by manual deduction or through the automatic detection of data patterns for administrative purposes can likewise be achieved through the data generated in this manner.

Ranking Sites and Links

Ranking websites and links for a search engine that is localized to a specific geography or to certain areas can be performed using the data provided from Mowglee. Finding relations and click patterns for links within the same geography would be simpler.

Analytics

Data collected by the Geo crawler could be analyzed by third-party and custom tools for better data analysis. Relevant digital repositories could also conceivably be created using generated volumes of data.

Keyword Based Advertising

Driving advertising based on geolocation is another important application you could create based on the data collected from this crawler. Use an analyzer to find specific terms, keywords, phrases, and locations, then embed relevant advertising.

Enhancements to Mowglee

The following are enhancements for web crawlers that may be added to Mowglee's functionality:

Graph Database Storage

Implement the graph database or use a NoSQL graph storage option. This can be achieved using the MowgleePersistenceCore class.

Other Analyzers

It is possible to add more analyzers. The base class used to extend is MowgleeAnalyzer. You can refer to the MowgleeLinkAnalyzer to better understand the implementation of the analyzer.

Classifiers

Add a hierarchy of classifiers - or plugins that automatically classify your data based on metrics such as terms, keywords, geographic lingo, or phrases. An example of this is MowgleeReligiousAbusePlugin.

Deducing Complex Relationships

More complex mechanisms for deducing relationships to create more relevance within a specific geography or geolocation can be added. Ignoring links or URLs outside of a geographic region, country, or even continent for a crawl session is possible. At higher levels, like continents, this can be implemented using a coordinated multi-agent system.

Sitemap Generation

You may also use other sitemap generation libraries internally or develop your own custom sitemap generator to enhance this functionality.

Limitations of Mowglee

Mowglee is not guaranteed to create a sitemap that can be used for crawling. It relies on an async mechanism to ensure all links that might not be reachable from inside a website to itself get crawled. Currently, Mowglee has no termination mechanism. A placeholder has been provided for you to decide per your own needs. It could be based on the number of top level domains crawled, data volume, number of pages crawled, geography boundaries, types of sites to crawl, or an arbitrary mechanism.

Be sure to shut down every applications running on your system to dedicate all available system resources to Mowglee. The default starting crawl point is http://www.timesofindia.com. You may change the starting crawl point to another website. You may use the mowglee.crawl file to review the crawl analysis. There is no other storage mechanism at present. I suggest keeping this file open in an editor such as EditPlus to continuously monitor its contents.

This article should have given you an excellent grounding in building a layered multi-threaded crawler, especially for applications that need geographically-based classifications and affinity. It should also help you save time - by re-using this codebase - to build your crawlers or applications. You may also want to build enhancements out of Mowglee as mentioned above and post it back on Developer Sites for the benefit of the entire community!

Sumith Puri is a Principal Java/Java EE Architect, a Hard-Core Java // Jakarta EE Developer with 16 (and counting) years of experience. He’s a Senior Member of ACM and IEEE, DZone Core, Member, CSI*; DZone MVB, and Java Code Geek. He holds a Bachelor of Engineering (Information Science & Engineering) from Sri Revana Siddeshwara Institute of Technology, completed the Executive Program in Data Mining & Analytics at the Indian Institute of Technology, and the Executive Certificate Program in Entrepreneurship at the Indian Institute of Management. His experience includes SCJP 1.4, SCJP 5.0, SCBCD 1.3, SCBCD 5.0, BB Spring 2.x*, BB Hibernate 3.x*, BB Java EE 6.x*, Quest C, Quest C++ and Quest Data Structures.