dcsimg
Returning offline web page HTML source as string
0 posts in topic
Flat View  Flat View
TOPIC ACTIONS:
 

Posted By:   Rowan_McGuane
Posted On:   Tuesday, March 22, 2005 01:31 PM

Hi there I'm pretty new to java and programming in general. Im currently in the process of designing a search engine type application. The application will work on or search through an offline locally stored group of web-pages. The problem I have run into is how to buffer the HTML source code of a stored Web Page, tags and all, to a String. There are lots of code examples all over the internet which allow you to get the source or text of an online web page but im finding it difficult to adapt them for offline use. This bit of code is what is used to get the source online however i'm pretty sure its not adaptable for my use !? import java.net.*; import java.io.*; public class SourceViewer { private    More>>

Hi there I'm pretty new to java and programming in general. Im currently in the process of designing a search engine type application. The application will work on or search through an offline locally stored group of web-pages. The problem I have run into is how to buffer the HTML source code of a stored Web Page, tags and all, to a String. There are lots of code examples all over the internet which allow you to get the source or text of an online web page but im finding it difficult to adapt them for offline use. This bit of code is what is used to get the source online however i'm pretty sure its not adaptable for my use !?

			
import java.net.*;
import java.io.*;


public class SourceViewer
{
private static String getPageSource(String urlPage)
{
String pageSource = "";
int letter;

try
{
URL url = new URL(urlPage); //Open the URL for reading
InputStream in = url.openStream(); // buffer the input to increase performance
Reader r = new InputStreamReader(new BufferedInputStream(in)); // chain the InputStream to a Reader

while ((letter = r.read()) != -1)
{
pageSource = pageSource + (char) letter;
}
}

catch (MalformedURLException ex)
{ System.err.println(urlPage + " is not a parseable URL"); }

catch (IOException ex)
{ System.err.println(ex);}


return pageSource;
}

public static void main (String[] args)
{
String source = getPageSource("http://www.computing.dcu.ie/");
System.out.println(source);
} // end main

} // end SourceViewer


ive also been trying to fiddle around with this piece of code which gets the LINKS not the whole source from a web page, on or offline:

			

import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.*;

class GetLinks
{
public static void main(String[] args)
{
EditorKit kit = new HTMLEditorKit();
Document doc = kit.createDefaultDocument();

// The Document class does not yet
// handle charset's properly.
doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);

try
{

// Create a reader on the HTML content.
Reader rd = getReader(args[0]);

// Parse the HTML.
kit.read(rd, doc, 0);

// Iterate through the elements
// of the HTML document.
ElementIterator it = new ElementIterator(doc);
javax.swing.text.Element elem;

while ((elem = it.next()) != null)
{
SimpleAttributeSet s = (SimpleAttributeSet)
elem.getAttributes().getAttribute(HTML.Tag.Content);
if (s != null)
{
System.out.println(s.getAttribute(HTML.Attribute.Content));
}
}
}
catch (Exception e)
{
e.printStackTrace();
}
System.exit(1);
}

// Returns a reader on the HTML data. If 'uri' begins
// with "http:", it's treated as a URL; otherwise,
// it's assumed to be a local filename.
static Reader getReader(String uri)
throws IOException
{
if (uri.startsWith("http:"))
{
// Retrieve from Internet.
URLConnection conn = new URL(uri).openConnection();
return new
InputStreamReader(conn.getInputStream());
}
else
{
// Retrieve from file.
return new FileReader(uri);
}
}
}


I know this is a horribly long post but any help is much appreciated :)    <<Less
About | Sitemap | Contact