The peace botnet

Modern search engines are able to organize huge amounts of information, allowing you to quickly find materials on any topic. But when it comes to finding products at online shops or vacancies in the databases of recruitment agencies, or suggestions of cars on the sites of car dealerships, in General the search any cataloged information in Internet about self-reliance search engines can not speak, because to meet such requests, in most cases, they require sites of discharge (Datа Feed) of their catalogs in a special format.

Automatic extraction of facts from the directories that do not have semantic markup is a difficult task, but still it is a lot easier task of extracting facts from unstructured arbitrary text.

Technology


We have developed the technology to create a full folder-based search engines that do not require submission of data in any formalized way, but with a few unusual search robot is able to extract information from arbitrarily-structured web directories, compiled in any language. The singularity of the robot is that it is a JavaScript program that analyzes web pages, loading them in a frame (iframe) through a web proxy. In this controversial at first glance, the decision has plenty of benefits.

JavaScript-the robot "sees" sites as well as see them regular users. This allows him to treat even those sites where the content is partially or completely formed from JavaScript and, therefore, inaccessible to a robot. In addition, the ability to emulate a variety of events (pressing buttons, for example) allows JavaScript robot you not only to view dynamic sites, but to navigate through them.

Analyzing web pages, traditional search engines focus on content, not paying particular attention to the decoration (design). The robot is folder-oriented search engines have to be even more selective, extracting from it content some facts. Analysis of the catalogs with the images, or more specifically with how they are seen by a user who enables JavaScript the robot to better perform their work.

For "army" crawlers require considerable computing resources. Unlike traditional crawlers that requires special software and hardware, JavaScript robot allows you to embed itself directly in a search engine website that gives the possibility to use the computational power of browsers end-users while they work with the site. Something turns out an average between the botnet and the peer-to-peer networks: the website provides the user information — the user helps the site computational capacities.

How does it work?


Creating technology, we were guided by the following rule: if a task with which the machine can cope for a large but reasonable amount of time, while a person can solve the same problem much faster, the task is given to the car because, firstly, human being is priceless, and secondly, our technology allows for free use of the huge computing capabilities of end users.

To connect a new catalog search system in most cases requires only to specify the URL of the first page of the catalog. If this does not exist, then you need to specify the "next" URL. The system will load it through a web-proxy to the frame and will closely monitor the actions of user that you want to showcase as the shortest way to navigate to the top directory. Demonstration may also be required if the site uses an unusual navigation system. That is, the connection system complex as much as difficult for a regular user to get to the first page of the directory.
Each new site to be studied, during which the robot identifies its structural features. It helps in future to better identify the necessary data and also gives the opportunity to follow the changes of the design of the site and adequately respond to them. This step is fully automatic and requires no human intervention.

If the research is finished, the robot proceeds to extract information, which is based on data obtained as a result of training with previously connected directories. In case these data are not sufficient, and the robot can not identify and extract all the necessary facts, the system provides the ability boobookitty robot work with the new directory: problematic page is opened using a web proxy, and the user is only required to specify (and perhaps clarify) the system is not revealed the facts.

It may seem that our technology requires constant human intervention. It is not, I described the worst-case scenario. With each new connected directory system is becoming smarter, the human involvement is needed less and less volume.

Security


Download to the end user's browser potentially unsafe pages of web directories can cause your computer will be infected with malicious software. We are aware of the seriousness of this problem. For the moment the crawler is disabled for family browsers, Internet Explorer (for mobile platforms, too, but for different reasons). We are also working on adding validation of uploaded resources using Google Safe Browsing API. And since the robot opens all pages only using the web proxy, the latter obviously has the ability to analyze their contents. Now we are considering various options for how best to use this opportunity to best protect end-users.

On the other hand, nothing prevents the user to attempt to forge the work and send the server false information. To avoid fraud, each task is considered fulfilled only when the received the same result from multiple robots running on different computers.

Better to see once, than hundred times to hear


To demonstrate the technology in action, we have created a search engine Maperty, which is a map of real estate objects offered for rent. Currently Maperty is the only laboratory bench for our experiments, and also serves the purpose of demonstrating the technology: while the user works with a map, the search engine downloads and processes new proposals real estate agencies.

Now the map is empty, but its treatment expect about 10 thousand apartments in Moscow, St.-Petersburg, Kiev, Belarus, Estonia, Poland and Ireland. We hope that with the help of abrafati, the problem will be solved in just a few hours, but the system architecture is such that data is written to the card is not instantaneous, and portions the size of which gradually increases as the coming of the end of the day.

We invite all interested people at www.maperty.ru. The robot is activated only in Chrome, Firefox, Safari and Opera, and on the analysis of one sentence to the robot required about one minute, so if you want to participate in our little experiment, please do not close the browser window, seeing an empty map.

What (and where) has been used


Google App Engine for Java (server for the robot server to Maperty);
Google Web Toolkit (robot interface Maperty);
Google Maps API (interface Maperty)
Google Geocoding API (address translation to map coordinates in the robot);
Google Language API (it helps the robot);
Google Safe Browsing API (web proxy, in);
Yahoo! Finance API (used for currency translation in Maperty).
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Briefly on how to make your Qt geoservice plugin

Database replication PostgreSQL-based SymmetricDS

Yandex.Widget + adjustIFrameHeight + MooTools