The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora

This article introduces ukWaC, deWaC and itWaC, three very large corpora of English, German, and Italian built by web crawling, and describes the methodology and tools used in their construction. The corpora contain more than a billion words each, and are thus among the largest resources for the res...

Full description

Saved in:

Bibliographic Details
Published in:	Language Resources and Evaluation Vol. 43; no. 3; pp. 209 - 226
Main Authors:	Baroni, Marco, Bernardini, Silvia, Ferraresi, Adriano, Zanchetta, Eros
Format:	Journal Article
Language:	English
Published:	Dordrecht Springer 01-09-2009 Springer Netherlands Springer Nature B.V
Subjects:	Adjectives Applied linguistics Computational Linguistics Computer Science Corpus linguistics Dictionaries English language German language Italian language Language Language and Literature Languages Linguistics Nouns Search engines Social Sciences URLs Vocabulary Webs Websites Words United Kingdom > UK Italy WaCky English Italian Annotated corpora General-purpose linguistic resources Corpus construction German Web as corpus Corpus linguistics Internet Computational linguistics Linguistic resources
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This article introduces ukWaC, deWaC and itWaC, three very large corpora of English, German, and Italian built by web crawling, and describes the methodology and tools used in their construction. The corpora contain more than a billion words each, and are thus among the largest resources for the respective languages. The paper also provides an evaluation of their suitability for linguistic research, focusing on ukWaC and itWaC. A comparison in terms of lexical coverage with existing resources for the languages of interest produces encouraging results. Qualitative evaluation of ukWaC versus the British National Corpus was also conducted, so as to highlight differences in corpus composition (text types and subject matters). The article concludes with practical information about format and availability of corpora and tools.
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23 ObjectType-Article-1 ObjectType-Feature-2
ISSN:	1574-020X 1572-8412 1574-0218
DOI:	10.1007/s10579-009-9081-4