PyWren Web Scraping

I was tasked with scraping information of houses for sale in Massachusetts for my data mining class. The target site in question was redfin.com, they explicitly do not tolerate web scraping and will give you a captcha if you exceeded some unknown threshhold of pages per minute or had a fishy user-agent.

Not to be deterred by a catptcha, I used selenium Chromedriver to write a scraper that worked pretty well and importantly was not caught by redfin's algorithm. Each page took ~2-3 seconds to scrape.

There are ~10,000 houses for sale in Massachusetts right now and to run the scraper to completion would take ((10,000 * 3) / 60) / 60 = 8.3 hours. I was not enthused about leaving the process running on my laptop for 8 hours so I looked for a better solution.

A few months back I read about pywren, researchers at Berkeley wrote it to create AWS Lambda functions automatically and acheived an impressive 40 Tflops. I was less interested in the compute capacity and more interested in the read/write bandwith which they achieved an impressive 80 GB/s read and 60 GB/sec write.

I figured instead of scraping each page and then waiting and going onto the next, I could parallelize it. This was the perfect type of job for pywren, it's embarrassingly paralell and network constrained.

I setup pywren (I found a bug setting up and submitted a pull request) and ran the hello world example. I then tried running it on my code, like the following

start = time.time() groups = get_url_groups('urls.csv', 40) print len(groups) wrenexec = pywren.default_executor() futures = wrenexec.map(do_urls, groups) results = pywren.get_all_results(futures) end = time.time() print("Took %f seconds..." %(end - start))

This code takes the list of 10,840 urls, splits then into groups of 40 urls, then passes that list in as the parameter to a function called do_urls. When finished, results contains a list of the execution results for each lambda function that ran.

I had 10,840 urls total, each one required 3 seconds sleep between fetching each page. AWS Lambda restricts you to 300 seconds run time and 400 concurrent processes. Trying to maximize the concurrency while staying under the 400 process limit, I split the urls into groups of 40. Each group took minimum 40 * 3 + 40 * 1.5 = 180 seconds to run (that's 3 second pause plus roughly 1.5 second processing time). I ran 10,840 / 40 = 271 concurrent processes. This took about 272 seconds or roughly 3.95 minutes to complete.

The best part is, this level of usage falls into AWS Lambda's free tier. I used roughly 73,000 GB/sec of execution. This falls well below the 400,000 GB/sec threshold for free tier. I was able to speed up the execution of my job from 8 hours to 4 minutes, for free. It's like having a superpower.

Comments ()