include("../functions.inc"); generateHead("Description of the corpus built in the CAST project", "..", 3); ?>
The corpus we built is an enhanced annotated corpus which differs from the majority of available resources in that it contains more information. In addition to containing information about the importance of the sentences, we also indicate parts which can be removed from sentences marked as essential or important, and provide a different label for those sentences which are not significant enough to be marked as important in their own right, but which have to be considered as they contain information essential for the understanding of the content of other sentences marked as essential/important. These last two types of information can give us an insight into the conciseness and coherence of summaries, respectively. Our corpus also contains annotations for linked sentences, where both sentences are considered important or essential, but one relies on the other to be completely understood.
The texts included in our corpus were taken from the Reuters Corpus. In addition to these, we also included a few popular science texts from the British National Corpus. The following table summarises the statistics of our corpus.
|No. of texts selected||147||16||163|
|No. of texts annotated by at least 2 annotators||31||12||43|
|No. of texts annotated by at least 3 annotators||7||-||7|
|Total words annotated||117,378||28,095||145,473|
|Total sentences annotated||5,214||1,370||6,584|
The texts with multiple annotations were used to measure the interannotator agreement. We found that the annotators did not agree so much on the actual sentences marked as important, but more on the information present in the different sentences which they did mark. Several sets of texts in the corpus deal with the same story, so the annotations on these have been analysed separately to study the impact of features such as order of information and space devoted to it on the annotation process. We found that both lexical and structural choices within the full texts do have a definite impact on the information which is marked as important. For the annotation process we used four annotators; three graduate students and one postgraduate. Three of the annotators were native English speakers, and the fourth had advanced knowledge of English. Before starting the annotation process, the annotation guidelines (written specifically for the annotation of this corpus) were explained to the annotators. For more information about the results obtained from an analysis of the corpus, see Hasler et al. (2003).
The corpus is encoded in XML. Given that an XML encoded file can be quite difficult to read and annotate, we used a multi-purpose annotation tool, PALinkA. In addition to helping with the marking process, the tool indicates throughout the annotation process what proportion of the file has been selected, and registers the time taken to annotate a file. The tool is easy to use, even for non-computer experts; to mark a unit of text, the annotator uses the mouse to indicate the boundaries of the unit, the tag assigned to the unit, and whatever attributes are required by the tag. To avoid errors, some attributes such as unique IDs and references are determined automatically by the tool.
There are three types of information marked in the corpus: the importance of the sentence (essential or important), links between sentences, and sections which can be removed from the marked sentences. The importance of the sentence was marked using the <EXTRACT ID="XXX"> tag, where the ID attribute identifies the tag uniquely. This tag has another attribute, IMP, which indicates the importance of the sentence and can take the following values: ESSENTIAL for sentences which are considered essential by the annotators, and IMPORTANT for those deemed important. In addition there is a third value for the IMP attribute, REFERRED, which indicates that a sentence is neither essential nor important, but required for the comprehension of a marked sentence. If a sentence is not marked in any way it is considered to be unimportant. The links between the sentences are marked by an empty tag <LINK REFERRED="XXX"/> which indicates that the EXTRACT tag which contains it is linked to the EXTRACT tag with the ID indicated by the attribute REFERRED. The sections from the selected sentences which are redundant or irrelevant are indicated by the <REMOVE> tag. All these tags have an optional attribute COMMENT, where the annotators can provide comments on the annotation process.