• Stars
    star
    124
  • Rank 288,207 (Top 6 %)
  • Language
    Java
  • License
    MIT License
  • Created about 11 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

How to spot first stories on Twitter using Storm.

First Story Detection on Twitter using Storm

You can find a detailed description of the code logic at this blogpost.

This project tries to detect first stories (a.k.a new events) on Twitter as they happen.

Every day, thousands of posts share information about news, events, automatic updates (weather, songs) and personal information. The information published can be retrieved and analyzed in a news detection approach. The immediate spread of events on Twitter combined with the large number of Twitter users prove it suitable for first stories extraction. Towards this direction, this project deals with a distributed real-time first story detection (FSD) using Twitter on top of Storm. Specifically, I try to identify the first document in a stream of documents, which discusses about a specific event. Let’s have a look into the implementation of the methods used.

Implementation summary

Upon a new tweet arrival, the tweet text is split into words and represented into the vector space. Then it is compared with the N tweets it is most likely to be similar to, using a method called Locality Sensitive Hashing. This data clustering method is used because it dramatically reduces the number of comparisons the tweet will need to find the N nearest neighbors and will be explained in detail later below. Having computed the distances with all near neighbor candidates, the tweet with the closest distance is assigned as the nearest. If the distance is above a certain threshold, the new tweet is considered a first story. A detailed version of the summary above will follow in the description of the bolts which act as the logical units of the application.

Documentation

Algorithm's explanation and the code logic can be found on the wiki.

Storm

Storm is a distributed real-time computation system which can guarantee data processing, high fault-tolerance and horizontal scaling to significantly large amounts of input data. It simplifies the implementation of parallel tasks by providing a programming interface suitable for stream processing and continuous computation. Having such a volume of input tweets streamed in real-time, FSD seemed like an ideal use case for Storm framework which can scale up the work by adding more resources.

If Storm is new to you I highly recommend this tutorial to familiarize yourself with the basic elements such as spouts, bolts, functions, groupings, aggregations and generally the streaming logic. The codebase depends heavily on them. However, the logic is the same and can be implemented outside Storm too.

How to run

This project uses Maven to build and run. Install Maven (preferably version 3.x) by following the Maven installation instructions.

Tweets.txt is just a sample file, you should run the crawler first and gather a number of tweets to process. You need to fill in the twitter4j.properties file with your own credentials. You can also connect a Twitter spout similar to this instead of reading off a file.

To run the tweets crawler type the following in the command line: mvn compile exec:java -Dexec.classpathScope=compile -Dexec.mainClass=crawler.Crawler

To compile and run the project in local mode, type the following commands while being on the project root directory
mvn package
mvn compile exec:java -Dexec.classpathScope=compile -Dexec.mainClass=trident.FirstStoryDetection

To run in production cluster, first package the code into a jar by running
mvn package.
This will package your code into a jar at the path target/First-Story-Detection-{version}-jar-with-dependencies.jar.
Then you can submit your jar to the cluster using the storm client:
storm jar target/First-Story-Detection-1.0-SNAPSHOT-jar-with-dependencies.jar trident.FirstStoryDetection fsd

Please note that in the production cluster mode, you require a DRPCClient to feed the topology with tweets and get results.

Authors

Contributing

Have you found any issues?
Are there any features you want to improve (e.g. english crawler)? Even documentation or testing.

Please contact me at [email protected] or create a new Issue. Pull requests are always welcome.

License

Copyright Β© 2013 Michael Vogiatzis
See License for licensing information