The corpus

The corpus used for the project on Language, Computers and Style

This corpus was build using several sources: Due to copyright reasons some areas are restricted. This means that you will not be able to see the full texts. In order to see the texts please click here to register. Once registed you will be also able to build frequency lists. Due to the amount of computation required, we cannot grant full access to people which are not involved in this project. If you require a certain list, please contact Constantin Orasan.


Warning: mysql_connect() [function.mysql-connect]: Access denied for user 'style'@'localhost' (using password: YES) in /srv/www/html/projects/style/corpus/index.php on line 28
CategoryNo. of wordsNo. of filesSee list
The structure of the corpus


Newspaper articles about business and commerce Files from BNC and WBE were included in this category. They include business articles from newspapers or sites specialised in business news distribution. In some cases a file from BNC contains more than one article
Newspaper articles about science
Files from BNC were used for this section. They are news about discoveries in science and articles for popularising science. In most of the cases a file contains more that one article
Newspaper articles about politics
Files from BNC were used for this section. They include political news.
Scientific papers discussing methods
The main source for this section was JAIR. The files included in this category are articles published in journals and propose new methods for different problems.
Scientific papers which review and evaluate methods
Files from BNC and JAIR were included. They discuss and evaluate scientific method. The justification for having two different categories of scientific papers is that a paper which puts forward a method is slightly different from one which mainly evaluate or reviews the literature from a domain.
Business reports and letters to shareholders
For this category files from WBE and BNC were used. They are annual reports and letters to shareholders. Some of the files could be included in Newspaper articles about business and commerce. However, most of them were not meant for publication in newspapers, instead they appear in the news section of different companies' websites.
Leaflets
Files from BNC were used. Most of them are advertising and instructional leaflets.
Instructional texts and manuals
Files from BNC were included. They are instructional and DIY texts. In the case of magazine articles, several are included in a file.

Click
here for the list of files
For different frequency lists click here
To compare two different lists click here
To compare several lists click here


For n-grams lists click here
To compare several ngrams lists click here



Back to the project main page




Page designed and maintained by Constantin Orasan