• Stars
    star
    212
  • Rank 186,122 (Top 4 %)
  • Language
  • License
    Creative Commons ...
  • Created about 10 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Scientific Document Summarization Corpus and Annotations from the WING NUS group.

README

** LaySumm is NOT covered by CC BY 4.0 licence. Please do not email us about Elsevier's LaySumm. We are unable to respond**

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License EXCEPT for the following files which are closed source under strict copyright laws enforced by Elsevier labs. We hold no accountability for these:

CC BY 4.0

(scisumm-corpus @ https://github.com/WING-NUS/scisumm-corpus)

This package contains a release of training and test topics to aid in the development of computational linguistics summarization systems.

CL-SciSumm

The CL-SciSumm Shared Task is run off the CL-SciSumm corpus and is composed of
three sub-tasks in automatic research paper summarization on a new corpus of research papers. A training corpus with summaries for one thousand forty topics and forty topics for citance to reference span id (or provenance identification) tasks has been released. A test corpus of twenty topics is held-out as a blind test-set. The topics comprise of ACL Computational Linguistics and Natural Language Processing research papers, and their citing papers and three output summaries each. The three output summaries comprise: the traditional authors' summary of the paper (the abstract), the community summary (the collection of citation sentences β€˜citances’) and a human summary written by a trained annotator. Within the corpus, each citance is also mapped to its referenced text in the reference paper and tagged with the information facet it represents.

The manually annotated training set of 40 articles (Tasks 1a and b) and citing papers, human written summaries (1040 documents) for them and a further 1000 document corpus (ScisummNet), an auto-annotated noisy dataset with several thousands of article-citing paper papers (to aid in ' training deep learning models) are readily available for download and can be used by participants. This data can be found in /data/Training-Set-2019/Task1/From-Training-Set-2018 and /data/Training-Set-2019/Task2/From-Training-Set-2018

The last edition of CL-SciSumm was CL-SciSumm 2020. The gold test data used for 2020, 2019, 2018 are now available in public in this repo. You are welcome to use it for your evaluations and paper submissions to any conference / journal or your theses.

In 2020, we did not add any new training data.

In 2019 we had introduced 1000 document sets that were automatically annotated to be used as training data. This training data was generated following Nomoto,2018. This data can be found in /data/Training-Set-2019/Task1/From-ScisummNet-2019. Note that the auto-annotated data is available only for Task 1a. No discourse facet is provided for the classification task: Task1b. We recommend you to use the auto-anootated data only for training 'reference span selection' models for Task 1a and use the manually annotated training data from 40 document sets for Task1b.

Further, for Task 2 one thousand summaries that were released as part of the SciSummNet (Yasunaga et al., 2019) have been included as human summaries to train on. This data can be found in /data/Training-Set-2019/Task2/From-ScisummNet-2019

The test set of 20 articles is available in /data/Test-Set-2018. This is a blind test set, that is, the ground truth is withheld. The system outputs from the test set should be submitted to the task organizers, for the collation of the final results to be presented at the workshop.

For more details, see the Contents Section at the bottom of this Readme. To know how this corpus was constructed, please see ./docs/corpusconstruction.txt

Last editions proceedings:

If you use the data and publish please let us know and cite our CL-SciSumm 2019 task overview paper:

@inproceedings{,<br>
title={Overview and Results: CL-SciSumm Shared Task 2019},<br>
author={Chandrasekaran, Muthu Kumar and Yasunaga, Michihiro and Radev, Dragomir and Freitag, Dayne and Kan, Min-Yen},<br>
booktitle={In Proceedings of Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries (BIRNDL 2019)},<br>
year={2019}<br>
}<br>

Overview

CL-SciSumm ran as a shared task at EMNLP 2020, SIGIR 2019, 2018, 2017, JCDL 2016 and the Pilot Task conducted as a part of the BiomedSumm Track at the Text Analysis Conference 2014 (TAC 2014).

The task is on automatic paper summarization in the Computational Linguistics (CL) domain. The output summaries are of two types: faceted summaries of the traditional self-summary (the abstract) and the community summary (the collection of citation sentences β€˜citances’). We also group the citances by the facets of the text that they refer to.

The task is defined as follows:

Given: A topic consisting of a Reference Paper (RP) and upto 10 Citing Papers (CPs) that all contain citations to the RP. In each CP, the text spans (i.e., citances) have been identified that pertain to a particular citation to the RP.

  • Task 1a: For each citance, identify the spans of text (cited text spans) in the RP that most accurately reflect the citance. These are of the granularity of a sentence fragment, a full sentence, or several consecutive sentences (no more than 5).
  • Task 1b: For each cited text span, identify what facet of the paper it belongs to, from a predefined set of facets.
  • Task 2 (optional bonus task): Finally, generate a structured summary of the RP from the cited text spans of the RP. The length of the summary should not exceed 250 words.

Evaluation: Task 1 is scored by overlap of text spans measured by number of sentences in the system output vs gold standard. Task 2 is scored using the ROUGE family of metrics between i) the system output and the gold standard summary fromt the reference spans ii) the system output and the asbtract of the reference paper.

Contents

This is the open repository for the Scientific Document Summarization Corpus and Annotations contributed to the public by the Web IR / NLP Group at @ the National University of Singapore (WING-NUS) with generous support from Microsoft Research Asia.

./README.md

This file.

./FAQ2018

Frequently asked questions on the 2018 shared task including updates to the corpus, annotation format from the previous edition.

./README2014.md
./README2016.md
./README2017.md
./README2018.md
./README2019.md
./README2020.md

README files for the previous editionS of the shared task hosted at BIRNDL@SIGIR 2018, BIRNDL@SIGIR 2017, BIRNDL@JCDL 2016 and TAC2014.

./docs/corpusconstruction.txt

A readme detailing the rules and steps followed to create the document corpus by randomly sampling documents from the ACL Anthology corpus and selecting their citing papers.

./docs/annotation_naming_convention.txt

Describes the naming convention followed to identify annotation files for each training topic in ./data/???-????_TRAIN/Annotation/

./docs/annotation_rules.txt

Rules followed to resolve difficult cases in annotation. It can serve as a synopsis of the larger annotation guidelines. For the detailed annotation guidelines, please refer to the details hosted at http://www.nist.gov/tac/2014/BiomedSumm/

./docs/sources/*.csv

References for each of the papers for each of the topics, one file per topic.

./data/Training-Set-2019/Task?/From-Training-Set-2018/???-????
./data/Training-Set-2019/Task?/From-ScisummNet-2019/???-????

Directories containing the Documents, Summaries, and Annotations for each topic, one directory per Topic ID.

./data/Training-Set-2019/Task1/From-Training-Set-2018/???-????/Documents_PDF/

This directory contains the 10 source documents for the topic (1 RP and upto 10 CPs), one file per paper, in the original pdf format.

./data/Training-Set-2019/Task1/From-Training-Set-2018/???-????/Reference_XML/
./data/Training-Set-2019/Task1/From-ScisummNet-2019/???-????/Reference_XML/

This directory contains the source document for the RP of the topic in XML format in UTF-8 character encoding. The file corresponds to the similarly named pdf file in Documents_PDF/. All annotations and offsets for the topic are with respect to the xml files in this directory. All the files were created from the pdf file using Adobe Acrobat.
Note that there were OCR errors in reading several of the files, and the annotators often had to manually edit the converted txt files. Research groups using are free to use alternative parsing tools on the pdfs provided, if they are found to perform better.

./data/Training-Set-2019/Task1/From-Training-Set-2018/???-????/CITANCE_XML/

This directory contains the source document for the CPs of the topic in xml format in UTF-8 character encoding. Each file corresponds to the similarly named pdf file above.

./data/Training-Set-2019/Task1/From-Training-Set-2018/???-????/Annotation/
./data/Training-Set-2019/Task1/From-ScisummNet-2019/???-????/Annotation/

This directory contains the annotation files for the topic, from 3 different annotators.
Please DO NOT use older annotations; only use .annv3.txt for the 2016 Shared Task.

./data/Training-Set-2019/Task2/From-Training-Set-2018/???-????/summary/
./data/Training-Set-2019/Task2/From-ScisummNet-2019/???-????/summary/

The summary task (Task 2) is an optional, "bonus" task which participants may want to attempt. This directory contains the two kinds of summaries - i. the abstract, and ii.human written summaries of the reference paper.

Annotation

Given a reference paper (RP) and 10 or more citing papers (CPs), annotators from the University of Hyderbad were instructed to find citations to the RP in the CPs. Annotators followed instructions in SciSumm-annotation-guidelines.pdf to mark the Citation Text, Citation Marker, Reference Text, and Discourse Facet for each citation of the RP found in the CP.

Organisers' Contacts

Please open github issues for further information or to report a bug or a fix for the corpus.

Contacts for maintainers of the corpus:

We provide links for Laysumm here and are not associated with it:

README for Lay Summarization Task (LaySumm 2020)

Task Description and a sample dataset can be found at: here in this Github repo.

LaySumm


This README was updated from README2020 by Muthu Kumar Chandrasekaran in 2021. For revision information, check source code control logs.

More Repositories

1

sequicity

Source code for the ACL 2018 paper entitled "Sequicity: Simplifying Task-oriented Dialogue Systems with Single Sequence-to-Sequence Architectures" by Wenqiang Lei et al.
Python
154
star
2

JD2Skills-BERT-XMLC

Code and Dataset for the Bhola et al. (2020) Retrieving Skills from Job Descriptions: A Language Model Based Extreme Multi-label Classification Framework
Python
52
star
3

SWING

The Summarizer from the Web IR / NLP Group (WING), hence SWING, is a modular, state-of-the-art automatic extractive text summarization system. It is used as the basis for summarization research at the National University of Singapore. It performs as one of the leading automatic summarization systems in the international TAC competition, getting high marks for the ROUGE evaluation measure
Ruby
39
star
4

cs6101

The Web IR / NLP Group (WING)'s public reading group at the National University of Singapore.
JavaScript
37
star
5

slsql

Code for the EMNLP 2020 paper "Re-examining the Role of Schema Linking in Text-to-SQL".
Python
26
star
6

SciAssist

Python
18
star
7

SSID

Student Submission Integrity Diagnosis
Java
18
star
8

Kairos

Kairos, combines a focused crawler and an information extraction engine, to convert a list of conference websites into a index filled with fields of metadata that correspond to individual papers. Using event date metadata extracted from the conference website, Kairos proactively harvests metadata about the individual papers soon after they are made public. We use a Maximum Entropy classifier to classify uniform resource locators (URLs) as scientific conference websites and use Conditional Random Fields (CRF) to extract individual paper metadata from such websites. The crawler is built on top of the popular open-source crawler Nutch.
Java
18
star
9

Prastava

100% Pure Ruby Recommendation System (CF/CBF/Hybrid)
Ruby
12
star
10

ELCo

The Dataset and Official Implementation for <The ELCo Dataset: Bridging Emoji and Lexical Composition> @ LREC-COLING 2024
Python
11
star
11

JavaRAP

JavaRAP is an implementation of the classic Resolution of Anaphora Procedure (RAP) given by Lappin and Leass (1994) . It resolves third person pronouns, lexical anaphors, and identifies pleonastic pronouns. The original purpose of the implementation is to provide anaphora resolution result to our TREC 2003 Q&A system.
9
star
12

ResearchTrends

Source code for the COLING 2018 paper entitled "Identifying Emergent Research Trends by Key Authors and Phrases" by Shenhao Jiang et al.
Python
8
star
13

RelatedWorkSummarizationDataset

Dataset for the paper: Cong Duy Vu Hoang and Min-Yen Kan (2010) Towards Automated Related Work Summarization. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China. pp. 427-435.
HTML
6
star
14

RAZ

Robust Argumentative Zoning - This is the home page for the argumentative zoning for raw text project, a collaboration between NUS and the University of Cambridge. This work follows Teufel's thesis to zone (label) sentences with six different rhetorical functions for scholarly discoure. The download comes with Teufel's original analysis and markup of 80 cmp-lg articles.
Perl
6
star
15

ChairVisE

To be edited.
Vue
5
star
16

ir-seminar

The Web IR / NLP Group (WING)'s IR Seminar at the National University of Singapore.
HTML
5
star
17

discoling

Source code for the AAAI 2018 paper entitled "Linguistic Properties Matter for Implicit Discourse Relation Recognition: Combining Semantic Interaction, Topic Continuity and Attribution" by Wenqiang Lei et al.
Python
5
star
18

texWordCount

A perl script to help count words in LaTeX. LPGL.
TeX
5
star
19

SciSWING

Scientific Document Summarizer from the Web IR / NLP Group (WING), NUS
Ruby
4
star
20

chatongpt

Chat on GPT public event - 18 April 2023
CSS
4
star
21

PyTorchCRF

A work-in-progress repository to develop a stand-alone lightweight CRF Layer in Pytorch
Python
4
star
22

nlp-seminar

The Web IR / NLP Group (WING)'s NLP Seminar at the National University of Singapore.
JavaScript
4
star
23

RSScrawler-1

Python
2
star
24

WING-LDA

LDA group project
Python
2
star
25

FCKeyphrase

SciVerse Application for document keyphrase extraction
C++
2
star
26

Word-News-Android

Word News Android client
JavaScript
2
star
27

WordNews

WordNews Chinese/English Language Learning system with Chrome Extension and Android backends.
Python
2
star
28

search-engine-wrapper

This package provides a Java wrapper framework for unifying programmatic access to search engines. A convenience class is also included for downloading the files at the URLs in the search engine results. This package contains an API as well as a command-line application.
Java
2
star
29

cubit

Google Scholar Analytics Package (Server backend and embeddable Javascript)
JavaScript
1
star
30

SSNLP-2019

Static Jekyll Website for the Singapore Symposium on Natural Language Processing.
CSS
1
star
31

ACL-Anthology-Codebase

Script and code for running the older version of the ACL Anthology
Perl
1
star
32

PDTB-scorer

Java
1
star
33

domadapter

Python
1
star
34

Elsevier-KP

Keyphrase Extraction (Base version for Elsevier; Elsevier-KP). Link below not working yet.
JavaScript
1
star
35

NeuralQuestionGeneration

WING-NUS (Pan Liangming's) Re-implementation of Serban et al's. 2016 ACL work "Generating Factoid Questions With Recurrent Neural Networks: The 30M Factoid Question-Answer Corpus"
1
star
36

DICOMER

DIscourse COherence Model for Evaluating Readability. DICOMER is a package for evaluating the coherence of text using discourse matrix representation, augmented with discourse hierarchy structure. Part of the deliverables from Lin et al.'s 2012 ACL paper entitled "Combining Coherence Models and Machine Translation Evaluation Metrics for Summarization Evaluation".
Ruby
1
star
37

TESLA-S

TESLA-S: Evaluating Summary Content. Adaptation of the popular TESLA evaluation metric for summarization content evaluation. Part of the deliverables from Lin et al.'s 2012 ACL paper entitled "Combining Coherence Models and Machine Translation Evaluation Metrics for Summarization Evaluation".
Java
1
star
38

ETD-Parsing

Repository for shared project for electronic theses and dissertation parsing collaboration between Virginia Tech, Old Dominion University and National University of Singapore.
Python
1
star