As pagepath Internet

Analysts sometimes need to answer questions such as: "how many sites use WordPress, and how many Ghost", "what coverage is from Google Analytics, but what Metric", "how often the site X is referring to Y". The most honest way to answer them — go through all the pages in the Internet and count. This idea isn't as crazy as it may seem. There is a project Сommoncrawl, which each month publishes fresh dump of the Internet in the form of gzip archives with a total size of ~30Тб. The data lie on S3, so, to handle the commonly used MapReduce from Amazon. There are a lot of instructions about how to do it. But with the current dollar rate, this has become a bit pricey. I would like to share a way to reduce the rate of approximately two times.

Commoncrawl publishes a list of links to S3. For July 2015, for example, it looks like this:

common-crawl/crawl-data/CC-MAIN-2015-32/segments/1438042981460.12/warc/CC-MAIN-20150728002301-00000-ip-10-236-191-2.ec2.internal.warc.gz
common-crawl/crawl-data/CC-MAIN-2015-32/segments/1438042981460.12/warc/CC-MAIN-20150728002301-00001-ip-10-236-191-2.ec2.internal.warc.gz
common-crawl/crawl-data/CC-MAIN-2015-32/segments/1438042981460.12/warc/CC-MAIN-20150728002301-00002-ip-10-236-191-2.ec2.internal.warc.gz
common-crawl/crawl-data/CC-MAIN-2015-32/segments/1438042981460.12/warc/CC-MAIN-20150728002301-00003-ip-10-236-191-2.ec2.internal.warc.gz
common-crawl/crawl-data/CC-MAIN-2015-32/segments/1438042981460.12/warc/CC-MAIN-20150728002301-00004-ip-10-236-191-2.ec2.internal.warc.gz
...

For each link, the available files at ~800MB, about this content:

WARC/1.0
WARC-Type: request
WARC-Date: 2015-08-05T12:38:42Z
WARC-Record-ID: <urn:uuid:886377b3-62eb-4333-950a-85caa9a8bce8>
Content-Length: 322
Content-Type: application/http; msgtype=request
WARC-Warcinfo-ID: <urn:uuid:54b96beb-b4cc-4f71-a1bf-b83c72aac9ad>
WARC-IP-Address: 88.151.247.138
WARC-Target-URI: http://0x20.be/smw/index.php?title=Special:RecentChangesLinked&hideanons=1&target=Meeting97

GET /smw/index.php?title=Special:RecentChangesLinked&hideanons=1&target=Meeting97 HTTP/1.0
Host: 0x20.be
Accept-Encoding: x-gzip, gzip, deflate
User-Agent: CCBot/2.0 (http://commoncrawl.org/faq/)
Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8



WARC/1.0
WARC-Type: response
WARC-Date: 2015-08-05T12:38:42Z
WARC-Record-ID: <urn:uuid:17460dab-43f2-4e1d-ad99-cc8cfceb32fd>
Content-Length: 21376
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:54b96beb-b4cc-4f71-a1bf-b83c72aac9ad>
WARC-Concurrent-To: <urn:uuid:886377b3-62eb-4333-950a-85caa9a8bce8>
WARC-IP-Address: 88.151.247.138
WARC-Target-URI: http://0x20.be/smw/index.php?title=Special:RecentChangesLinked&hideanons=1&target=Meeting97
WARC-Payload-Digest: sha1:6Z77MXWXHJYEHC75LGTN3UQMYVJAEPPL
WARC-Block-Digest: sha1:MQ4GSG7X7EU6H26SMF2NS5MADZULHOPK
WARC-Truncated: length

HTTP/1.1 200 OK
Date: Wed, 05 Aug 2015 12:31:01 GMT
Server: Apache/2.2.22 (Ubuntu)
X-Powered-By: PHP/5.3.10-1ubuntu3.19
X-Content-Type-Options: nosniff
Content-language: en
X-Frame-Options: SAMEORIGIN
Vary: Accept-Encoding,Cookie
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Cache-Control: private, must-revalidate, max-age=0
Last-Modified: Fri, 31 Jul 2015 02:16:39 GMT
Content-Encoding: gzip
Connection: close
Content-Type: text/html; charset=UTF-8

<!DOCTYPE html>
<html lang="en" dir="ltr" class="client-se">
<head>
<meta charset="UTF-8" /><title>Changes related to "Meeting97" - Whitespace (Hackerspace Gent)</title>
...

A bunch of headers and nothing is clear. There are special libraries for parsing such data, but I use the code like:

url_line = None
for line in sys.stdin:
if line.startswith('WARC-Target-URI'):
url_line = line
elif url_line is not None:
if 'www.google-analytics.com/analytics.js' in line:
# Strip 'WARC-Target-URI:' and '\r\n'
yield url_line[17:-2]

So it turns out 10 times faster and easier when you deploy. You can directly from your computer to run the command:

curl https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-32/segments/1438044271733.81/warc/CC-MAIN-20150728004431-00294-ip-10-236-191-2.ec2.internal.warc.gz | gunzip | python grep.py

And get a list of pages with Google Analytics. It's completely free. The only problem is that the process will take several minutes and should be repeated ~30 000 times for different pieces of files. In order to speed up the calculation typically use the Elastic MapReduce and process the pieces in parallel. It is important to note that worker also pump data from S3, instead of reading them from HDFS. How much will such a decision? This vary from region to region and from machines to be used.

With the region clear — you need to take one which is closer to the data, that is "US West". But with cars hard. If you take a very cheap will long be considered, if you take a powerful will be expensive. To decide, I rented 6 different cars and run them on command, as in the example above. It turned out that the most simple machine I will be able to handle ~10 pieces of files per hour, and the steepest — ~1000. Multiplying the prices for rental cars and setting them on MapReduce, saw interesting that it's more profitable to take the most super duper cool cars. The final price would be less and the cluster is more compact.

Great, but the price of ~70$ is not happy, especially if you need to repeat the calculation several months to trace the dynamics. The Amazon is such a fine thing spot instances. the Prices are much lower and change over time.

Very cool model which usually costs 1.68$ per hour, can be rented for ~0.3$. What's the catch? If there is someone who wants to give more, the car is tough stewed, nothing to preserve and transmit to that person. To Pogrebite Internet spot instances are ideal. Even if the computation terminates, it is easy to renew. It's a pity that for setting up MapReduce on spot instances Amazon discounts does not:

Already much better. But now it's a shame to pay for MapReduce ~0.3$ when the whole car is worth ~0.3$. Especially considering that neither HDFS nor the reduce step is not used. Here, perhaps, it becomes clear how to reduce the cost of the calculation two times. Roughly speaking, you need to own your MapReduce on the back of an envelope. For this task it's easy to do. You just need to write a script that does everything himself: downloads the file, unpacks it and repaet. And run it over multiple machines with different parameters, I did it through the screen like this:

scp run_grep.py ubuntu@machine1:~
ssh ubuntu@machine1-t 'screen-d-r "python run_grep.py 2015-07 0 1000"'
scp run_grep.py ubuntu@machine2:~
ssh ubuntu@machine2 -t 'screen-d-r "python run_grep.py 2015-07 1000 2000"'
scp run_grep.py ubuntu@machine3:~
ssh ubuntu@machine3 -t 'screen-d-r "python run_grep.py 2015-07 2000 3000"'
...

If you still just make an effort to capture logs from machines can do a monitoring:

Article based on information from habrahabr.ru

Поиск по этому блогу

computer express

As pagepath Internet

Комментарии

Отправить комментарий

Популярные сообщения из этого блога

Import iblock from 1C-Bitrix to MODx Revolution

Briefly on how to make your Qt geoservice plugin

Database replication PostgreSQL-based SymmetricDS