# The PROIEL treebank
The PROIEL Treebank is a treebank of ancient Indo-European languages, including Latin and Ancient Greek. It uses a refined version of dependency grammar and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (opens new window). On this site you will find our official, versioned releases of the treebank and pointers to further information.
The PROIEL Treebank is one of three treebanks that use the same annotation system, follow the same principles and available under the same license. The PROIEL Treebank covers Ancient Greek and Latin, as well as the translations of the New Testament into Gothic, Classical Armenian and Old Church Slavonic. The TOROT Treebank (opens new window) covers Old Church Slavonic, Old Russian and Middle Russian, while the ISWOC Treebank includes texts in Old English, Old French, Portuguese and Spanish. The complete collection currently has 928,185 tokens, all of which has been manually annotated with morphological and syntactic analyses. Parts of the treebank also have information-structure annotation and the New Testament texts include text alignment.
Our releases contain the treebank on our own PROIEL XML format. PROIEL XML is the authoritative format for PROIEL-style treebank and the only one that provides access to all the annotation we have, but for ease of use we also include the treebank as CoNLL-X and CoNLL-U files. We have a command-line utility (opens new window) that can be used to convert PROIEL XML to various other formats (see the documentation (opens new window) for examples), including formats used for training taggers. For more complex tasks you can use our libraries for Ruby. See our developer's pages (opens new window) for more information.
Releases are hosted on GitHub (opens new window).
If you use the treebank, please cite as:
Dag T. T. Haug and Marius L. Jøhndal. 2008. 'Creating a Parallel Treebank of the Old Indo-European Bible Translations'. In Caroline Sporleder and Kiril Ribarov (eds.). Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008) (2008), pp. 27-34.
See also the following articles for further details:
Dag T. T. Haug, Marius L. Jøhndal, Hanne M. Eckhoff, Mari Johanne Hertzenberg and Angelika Müth. 2009. 'Computational and Linguistic Issues in Designing a Syntactically Annotated Parallel Corpus of Indo-European Languages'. Traitement automatique des langues 50 (2): 17-45.
Dag T. T. Haug, Hanne M. Eckhoff, Marek Majer and Eirik Welo. 2009. 'Breaking down and putting back together: analysis and synthesis of New Testament Greek'. Journal of Greek Linguistics 9 (1): 56-92.
Hanne Eckhoff, Kristin Bech, Gerlof Bouma, Kristine Eide, Dag Haug, Odd Einar Haugen and Marius Jøhndal. 2017. 'The PROIEL treebank family: a standard for early attestations of Indo-European languages'. Language Resources and Evaluation.
The following texts are currently included in the PROIEL Treebank:
|The Greek New Testament||Ancient Greek|
|Herodotus, Histories||Ancient Greek|
|Sphrantzes, Chronicles||Ancient Greek|
|Caesar, The Gallic War||Latin|
|Cicero, De officiis||Latin|
|Cicero, Letters to Atticus||Latin|
|Palladius, Opus agriculturae||Latin|
|The Armenian New Testament||Classical Armenian|
|The Gothic Bible||Gothic|
|Codex Marianus||Old Church Slavonic|
Please see the data files in the release distribution for complete contributor details and editorial notes.
The treebank was started as part of the research project Pragmatic Resources in Old Indo-European Languages (opens new window), which was financed by the Norwegian Research Council. It originally comprised the New Testament in Ancient Greek and its translations into Latin, Old Church Slavonic, Gothic and Classical Armenian. The treebank has since been expanded to include ancient Indo-European texts in general and has spawned the TOROT (opens new window) and ISWOC (opens new window) treebanks, which are complementary to the PROIEL Treebank.
We are constantly expanding the treebank. The following texts are in the pipeline and will be included in an upcoming release:
|Plautus, opera omnia||Latin|
|Terence, opera omnia||Latin|
# Annotation system
The morphosyntactic annotation scheme is described in the Annotation Guide.