• Stars
    star
    217
  • Rank 182,446 (Top 4 %)
  • Language
    Java
  • Created over 13 years ago
  • Updated almost 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Twitter Tools

Twitter Tools

This repo holds a collection of tools for the TREC Microblog tracks, which officially ended in 2015. The track mailing list can be found at [email protected].

Archival Documents

API Access

The Microblog tracks in 2013 and 2014 used the "evaluation as a service" (EaaS) model, where teams interact with the official corpus via a common API. Although the evaluation has ended, the API is still available for researcher use.

To request access to the API, follow these steps:

  1. Fill out the API usage agreement.
  2. Email the usage agreement to [email protected].
  3. After NIST receives your request, you will receive an access token from NIST.
  4. The code for accessing the API can be found in this repository. The endpoint of API itself (i.e., hostname, port) will be provided by NIST.

Getting Stated

The main Maven artifact for the TREC Microblog API is twitter-tools-core. The latest releases of Maven artifacts are available at Maven Central.

You can clone the repo with the following command:

$ git clone git://github.com/lintool/twitter-tools.git

Once you've cloned the repository, change directory into twitter-tools-core and build the package with Maven:

$ cd twitter-tools-core
$ mvn clean package appassembler:assemble

For more information, see the project wiki.

Replicating TREC Baselines

One advantage of the TREC Microblog API is that it is possible to deploy a community baseline whose results are replicable by anyone. The raw results are simply the output of the API unmodified. The baseline results are the raw results that have been post-processed to remove retweets and break score ties by reverse chronological order (earliest first).

To run the raw results for TREC 2011, issue the following command:

sh target/appassembler/bin/RunQueriesThrift \
 -host [host] -port [port] -group [group] -token [token] \
 -queries ../data/topics.microblog2011.txt > run.microblog2011.raw.txt

And to run the baseline results for TREC 2011, issue the following command:

sh target/appassembler/bin/RunQueriesBaselineThrift \
 -host [host] -port [port] -group [group] -token [token] \
 -queries ../data/topics.microblog2011.txt > run.microblog2011.baseline.txt

Note that trec_eval is included in twitter-tools/etc (just needs to be compiled), and the qrels are stored in twitter-tools/data (just needs to be uncompressed), so you can evaluate as follows:

../etc/trec_eval.9.0/trec_eval ../data/qrels.microblog2011.txt run.microblog2011.raw.txt

Similar commands will allow you to replicate runs for TREC 2012 and TREC 2013. With trec_eval, you should get exactly the following results:

MAP raw baseline
TREC 2011 0.3050 0.3576
TREC 2012 0.1751 0.2091
TREC 2013 0.2044 0.2532
TREC 2014 0.3090 0.3924
P30 raw baseline
TREC 2011 0.3483 0.4000
TREC 2012 0.2831 0.3311
TREC 2013 0.3761 0.4450
TREC 2014 0.5145 0.6182

License

Licensed under the Apache License, Version 2.0.

Acknowledgments

This work is supported in part by the National Science Foundation under award IIS-1218043. Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the National Science Foundation.

More Repositories

1

MapReduceAlgorithms

Data-Intensive Text Processing with MapReduce
TeX
620
star
2

guide

The Student's Guide to @lintool
280
star
3

Cloud9

Cloud9 is a Hadoop toolkit for working with big data
Java
236
star
4

warcbase

Warcbase is an open-source platform for managing analyzing web archives
Java
161
star
5

Mr.LDA

Scalable Topic Modeling using Variational Inference in MapReduce
Java
149
star
6

bespin

Reference implementations of data-intensive algorithms in MapReduce and Spark
Java
81
star
7

Ivory

A Hadoop toolkit for web-scale information retrieval research
Java
79
star
8

bigdata-2018w

CS 451/651 431/631 Data-Intensive Distribute Computing (Winter 2018) at the University of Waterloo
HTML
71
star
9

bigcows

Scrapes citation statistics from Google Scholar
JavaScript
59
star
10

UMD-courses

Course homepages for courses that I've taught at the University of Maryland
HTML
53
star
11

IR-Reproducibility

Open-Source Information Retrieval Reproducibility Challenge
Shell
50
star
12

my-data-is-bigger-than-your-data

My data is bigger than your data!
HTML
39
star
13

SparkTutorial

Spark Tutorial at the University of Maryland
38
star
14

bigdata-2016w

CS 489/698 Big Data Infrastructure (Winter 2016) at the University of Waterloo
HTML
38
star
15

wikiclean

A Java Wikipedia markup to plain text converter
Java
37
star
16

clueweb

Hadoop tools for manipulating ClueWeb collections
Java
26
star
17

chrome-archive-this-page

Internet Archive "Save a Page" Plug-In for Chrome
JavaScript
23
star
18

bigdata-2018f

CS 451/651 Data-Intensive Distribute Computing (Fall 2018) at the University of Waterloo
HTML
23
star
19

tools

Lintools: tools by @lintool
Java
22
star
20

art-science-empirical-cs-2022f

The Art and Science of Empirical Computer Science (Fall 2022)
20
star
21

bigdata-2017w

CS 489/698 Big Data Infrastructure (Winter 2017) at the University of Waterloo
HTML
15
star
22

TweetAnalysisWithSpark

Tweet Analysis with Spark
Scala
15
star
23

robust04-analysis

Meta-Analysis of Robust04 Papers (Yang et al., SIGIR 2019)
Python
12
star
24

JScene

A proof-of-concept in-browser JavaScript-based search engine
JavaScript
12
star
25

JASS

Anytime Ranking for Impact-Ordered Indexes
C
12
star
26

GrimmerSenatePressReleases

Grimmer's Senate Press Releases
Python
10
star
27

Enron2mbox

Converting the Enron email collection to mbox format
Python
10
star
28

OptTrees

Source code for: Nima Asadi, Jimmy Lin, and Arjen P. de Vries. Runtime Optimizations for Tree-Based Machine Learning Models. IEEE Transactions on Knowledge and Data Engineering, 26(9):2281-2292, 2014.
C
9
star
29

non-blind-review

My proposal for non-blind reviewing at *ACL
6
star
30

art-science-empirical-cs-2023f

The Art and Science of Empirical Computer Science (Fall 2023)
6
star
31

IR-Reproducibility2

The Replicability of IR Replicability Experiments
Shell
5
star
32

UROC-projects

Undergraduate Research Opportunities Conference sponsored by the University of Waterloo
5
star
33

ClueWeb09-TREC-LTR

learning-to-rank dataset extracted from ClueWeb09 using TREC judgments
5
star
34

Cassovary-vs-GraphJet

Performance comparison between Cassovary and GraphJet
5
star
35

bespin-data

Datasets for Bespin
Python
4
star
36

Tweets2013-stats

4
star
37

robust04-analysis-papers

4
star
38

AnseriniMaven

Maven repo for some Anserini dependencies.
3
star
39

nyt-covid-map

HTML
3
star
40

c-bfscan

Implementations of brute force scans for document retrieval in C
C
3
star
41

MSMARCO-Document-Ranking-Archive.test

CSS
2
star
42

GiraphTutorial

Giraph Tutorial
2
star
43

MSMARCO-Document-Ranking-Archive

Python
2
star
44

Zambezi

Real-time indexer and search engine
C
2
star
45

chrome-archive-this-page-crx

Packaged CRX distribution for Internet Archive "Save a Page" Plug-In
2
star
46

NSF-projects

NSF project homepages
CSS
2
star
47

bfscan

Document retrieval using brute force scans
Java
2
star
48

wiki-tools

Collection of tools for working with Wikipedia
Java
2
star
49

msmarco-docker

Dockerfile
2
star
50

tools-javadoc

HTML
2
star
51

hadoop1-data

1
star
52

IR-Reproducibility-exp

Experimental runs from the Open-Source Information Retrieval Reproducibility Challenge.
MAXScript
1
star
53

TweetTap

Simple program to tap the Twitter sample stream
Java
1
star
54

chrome-scholar-search-extension

Google Scholar Search Extension for Chrome
JavaScript
1
star
55

trec-mb-vis

Visualization of TREC Microblog Track relevance judgments
JavaScript
1
star
56

clueweb09en01-webgraph

Webgraph for ClueWeb09 Category B
1
star
57

cs-big-cows

List of people with great achievements in Computer Science
Python
1
star