357x Filetype PPTX File size 2.70 MB Source: www.iitr.ac.in
Web Crawler DDaattaa CoColllleectictioonn MMododuullee
News Crawler
• News Crawlers are focused on retrieving newly published News Data.
• News Crawlers monitors a set of defined News sources and captures the news as soon as it publishes.
Predefined
Predefined News URL News Article
Set of News Crawl every News URL News Article
Set of News 30 Min Downloader Downloader
Downloader Downloader
Sources
Sources
New
URLs News
New URLs Articles
News
Database
Architecture of News Crawler at IITR
16BIT IITR
Web Crawler DDaattaa CoColllleectictioonn MMododuullee
Web Crawler
A Simple Java Program for
Downloading a Web Page
16BIT IITR
Web Crawler DDaattaa CoColllleectictioonn MMododuullee
Parsing a Web Page
• Given a Web Page, we can retrieve different components by Parsing it.
• Many HTML Parsers are available such as Jsoup, Xerces, NekoHTML
• Following Java program uses Jsoup parser to extract Hyperlinks from a web page.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.io.File;
public class ExtractLinks {
public static void main(String[] args) throws IOException {
File input = new File("data.html");
Document doc = Jsoup.parse(input, "UTF-8", “ ");
Elements links = doc.select("a[href]");
System.out.println("Total Number of Links:"+links.size());
for (Element link : links) {
System.out.println(link.attr("abs:href"));
}
}
}
16BIT IITR
Web Crawler DDaattaa CoColllleectictioonn MMododuullee
Retrieving Article Text
• There are many API available for extracting the main content from web pages, such as Boilerplate API
• Following Java program demonstrates the use of Boilerplate API to extract the article text from a news article
import java.io.PrintWriter;
import java.net.URL;
import de.l3s.boilerpipe.BoilerpipeExtractor;
import de.l3s.boilerpipe.extractors.CommonExtractors;
import de.l3s.boilerpipe.sax.HTMLHighlighter;
public class BoilerplateDemo {
public static void main(String[] args) throws Exception {
URL url = new URL("http://www.thehindu.com/news/national/land-acquisition-ordinance-bill-gets-a-burial/article7597517.ece");
final BoilerpipeExtractor extractor = CommonExtractors.ARTICLE_EXTRACTOR;
// choose the operation mode (i.e., highlighting or extraction)
//final HTMLHighlighter hh = HTMLHighlighter.newHighlightingInstance();
final HTMLHighlighter hh = HTMLHighlighter.newExtractingInstance();
PrintWriter out = new PrintWriter("highlighted.html", "UTF-8");
out.println(hh.process(url, extractor));
out.close();
System.out.println("Now open file highlighted.html in your web browser");
}
}
16BIT IITR
Article Extractor DDaattaa CoColllleectictioonn MMododuullee
Article Extraction
• Objective: To extract Article Content from Given News URL
• News URL:
http://www.hindustantimes.com/world-t20/amitabh-bachchan-to-sing-national-anthem-before-india-pakistan-match/story-
QXxnQAvmJsisvIYtSFv33L.html Bollywood superstar Amitabh Bachchan will sing the National
Anthem before the start of the marquee India-Pakistan World
Twenty20 cricket match at the Eden Gardens on March 19.
Bachchan has confirmed the development by retweeting a post in
his official Twitter handle while sources in the Cricket Association
of Bengal today said this was an effort by its president Sourav
Ganguly.
“The president was involved and the plan was on for a long time,”
CAB sources said.
While the ‘Big B’ will sing the National Anthem in his signature
baritone, Pakistan will also make their presence felt with classical
singer Shafaqat Amanat Ali who is slated to sing the Pakistani
National Anthem.
16BIT IITR
no reviews yet
Please Login to review.