The Twi Journal

for a Long time did not dare to write on Habr. At least, due to technical instability of the project. Now that the work is organized (sincerely hope so), we got a small recognition in the form of grant from Yuri Milner and Pavel Durov, I'm ready to send the project to aromacology.

image

My name is Nikita Likhachev, I want to tell you about The Twi Journal. This newspaper, which is based on automatic analysis of Russian-speaking Twitter.

the

project Idea


To construct a robot that can analyze the air Russian-speaking segments of the networks Twitter, Instagram and Foursquare. Then output this content in a convenient form on one site and to diversify its placement to send to other social networks. Someone I wonder what's happening on Twitter, but he is reluctant to leave Vkontakte or Facebook. And someone just a little time to follow everything he wants for ten minutes to assess the agenda.

the

sample objective


Project in any case does not claim to absolute objectivity. Because we have taken the liberty to exclude from being indexed, database accounts, hosting jokes-Boyany and other content not carrying any useful information. We also ignore mastrovito (those who subscribe to all in a row) and humans, raising his rating with bots. The first database was collected by the hands, raising the top bloggers in the white list:

If the blogger Navalny white listed and it meets the test of "not masoliver", the people he reads are automatically stored in our database.

Now the base continues to grow hands and already automatically due to the fact that the robot finds a new user in retweets of your base. Still we do not blame the inadequacy of the information because important topic can't pass at least one user from our database.

the

data processing


Information


The robot, which we call the Adam, collects all tweets are indexed and divides them into a few types: ordinary tweets; tweets with reference to third-party resources, media; with reference to well-known photohosting; with reference to the video.

Thus, main page displays popular tweets and rasparennye links to articles in media number of mentions, and individual partitions photos and videos:

image

Constantly trying to come up with algorithms that help for a short time to get the maximum amount of fresh information. On the video, for example, put limits on the date of the download to the priority to display the most recent ones. The robot also monitors Twitter feedback video and displays them as comments:

image

user Rating


Based on our base we aim to build at least some objective rating of the Russian mikroblogerov, separated on users, corporate accounts and media. The rating is built using the information of several indicators in a single formula: the average number of user mentions, retweets it records the number of followers in sootnoshenii to the number /lists in which it is added.

All Twitter ratings can be divided into two parts — those that require you to log in to participate and those that do not require. The first are considered more objective as a information about the mentions and retweets of the user. But they have a major drawback: most popular bloggers ever they do not authenticate because of distrust or unnecessary. The second type does not have this drawback, but is rarely objective, as is almost always based only on the number of followers, tweets and, perhaps, the age of the account. We have tried to combine the best of both types of rankings.
image

Rating places Foursquare


Built in real time: show the places popular in the city right now. Is calculated in the following way: every 25 minutes, start robot, which is pre-cut borders of the city (in Moscow it is checked only the center and a couple of kilometers around it) creates a matrix of dots. At each point within a radius of two kilometers are checked for popular places with Foursquare API.

image

the

a Little about the technology


Now located on the same server. The project as a whole (including demons) written in PHP. We use MySQL and MongoDB (for critical speed of entry points) — performance of InnoDB on insert us more than enough, and most of the samples from the database we are caching with memcached. In General, memcached for us is ideal, as you have to process a large amount of data that can be cached without loss of efficiency. It is possible to reduce the generation time of the main page to 40ms (I'm afraid to predict the behavior of the site when the probable dabraabraca).

Recently we began to use Gearman to parallelize tasks such as processing of tweets, the calculation of the rating and for background tasks such as saving images on Amazon S3.

Robot Adam checks for feed updates every between 15 and 180 minutes, depending on the time of day. Since the materials are gaining popularity not once, but gradually, it is important for us to accompany them for some time after publication. It was at this point we parse the tweet into components: text, links, images and videos. All references disclosed if they are shortened, and their content modificeres functions like Reader in Safari (in the style of Readability).

When processing images supported fotohosting twitpic, yfrog, pic.twitter.com, flickr, lockerz and instagr.am. For each one we wrote a simple handler API that finds the preview for pictures of the author and explanatory text. For some photohosting had to use undocumented features. Fortunately, programmers quite often think alike, especially in the names of methods and parameters for them.

image

the

development Plans


Now put a variety of experiments. For example, plan to launch The Twi Football. In this project I want to try the online broadcast of the matches on the basis of analysis of Russian-speaking Twitter. The project will be a foothold for testing technologies that we will use in the main project: the views of the fans the server receives directly from Twitter using Streaming API (new tweets for a hashtag using the teams we have will appear faster than on a native search page of the Twitter).

In my free time indulge our logo:

image

But seriously, I want to try to scale the project to other countries. Let's begin, certainly, with the USA (bought a domain twijournal.com). If you go there, we will go to other countries. Time is short, because the money given to us by Durov and Milner, quite fast, although not particularly chic.

In my wildest dreams we dream that we will be able to build a similar media on the basis of other social networks, and then combine all into one big content aggregator. But for now it's just dreams.

The Twi Journal

PS Suddenly willing to work with us, a developer or a journalist from another country reads this post? Just in case we leave here our email: editors@tjournal.ru
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Briefly on how to make your Qt geoservice plugin

Database replication PostgreSQL-based SymmetricDS

Yandex.Widget + adjustIFrameHeight + MooTools