Distributed Web crawling using Apache Spark - Is it Possible? -
an interesting question asked of me when attended 1 interview regarding web mining. question was, possible crawl websites using apache spark?
i guessed possible, because supports distributed processing capacity of spark. after interview searched this, couldn't find interesting answer. possible spark?
how way:
your application set of websites urls input crawler, if implementing normal app, might follows:
- split web pages crawled list of separate site, each site small enough fit in single thread well:
for example: have crawl www.example.com/news 20150301 20150401, split results can be: [www.example.com/news/20150301, www.example.com/news/20150302, ..., www.example.com/news/20150401]
- assign each base url(
www.example.com/news/20150401
) single thread, in threads data fetch happens - save result of each thread filesystem.
when application become spark one, same procedure happens encapsulate in spark notion: can customize crawlrdd same staff:
- split sites:
def getpartitions: array[partition]
place split task. - threads crawl each split:
def compute(part: partition, context: taskcontext): iterator[x]
spread executors of application, run in parallel. - save rdd hdfs.
the final program looks like:
class crawlpartition(rddid: int, idx: int, val baseurl: string) extends partition {} class crawlrdd(baseurl: string, sc: sparkcontext) extends rdd[x](sc, nil) { override protected def getpartitions: array[crawlpartition] = { val partitions = new arraybuffer[crawlpartition] //split baseurl subsets , populate partitions partitions.toarray } override def compute(part: partition, context: taskcontext): iterator[x] = { val p = part.asinstanceof[crawlpartition] val baseurl = p.baseurl new iterator[x] { var nexturl = _ override def hasnext: boolean = { //logic find next url if has one, fill in nexturl , return true // else false } override def next(): x = { //logic crawl web page nexturl , return content in x } } } } object crawl { def main(args: array[string]) { val sparkconf = new sparkconf().setappname("crawler") val sc = new sparkcontext(sparkconf) val crdd = new crawlrdd("baseurl", sc) crdd.saveastextfile("hdfs://path_here") sc.stop() } }
Comments
Post a Comment