Syntacticus is an umbrella project for the PROIEL Treebank (opens new window), the Tromsø Old Russian and OCS Treebank (TOROT) (opens new window) and the Information Structure and Word Order Change in Germanic and Romance Languages (ISWOC) Treebank (opens new window), which all use the same annotation system and share similar linguistic priorities.
Syntacticus provides easy access to around a million morphosyntactically annotated tokens from 10 early Indo-European languages.
|Language||Number of tokens|
|Old Church Slavonic||140,276|
We are constantly adding new material to Syntacticus. The ultimate goal is to have a representative sample of different text types from each branch of early Indo-European. We maintain lists of texts we are working on at the moment, which you can find on the PROIEL Treebank (opens new window) and the TOROT Treebank (opens new window) pages, but this is extremely time-consuming work so please be patient!
The focus for Syntacticus at the moment is to consolidate and edit our documentation so that it is easier to approach. We are very aware that the current documentation is inadequate! But new features and better integration with our development toolchain are also on the horizon in the near future.
# Annotation principles
In Syntacticus each text has been split into words, and then each word has been
- lemmatised (i.e. linked to its dictionary entry),
- assigned a part of speech (i.e. classified as noun, verb etc.),
- assigned morphological features (e.g. tagged with its case form or its tense), and
- given a syntactic function and linked to one or more other words (e.g. the subject of a verb has been labelled a subject and linked to the verb).
This has all been done manually by a language specialist and then verified by another specialist.
You can use this information in a number of ways. For example, if you know Latin but need help understanding the structure of a complex sentence, you can look up the specialist's analysis of that sentence.
The lemmatisation, parts of speech and morphology broadly speaking follow the same principles as standard reference grammars of Indo-European languages. In some situations we have adopted a different approach, which is more in line with modern formal linguistic thinking. This is the case in particular for various function words (such as subordinators, subjunctions, particles and interjections), which reference grammars tend to disagree on.
The syntactic annotation is based on the principles of dependency grammar. Each word is assigned a function, called a relation, and then linked to its head. For the English sentence John loves Mary, for example, John would have the relation subject and its head would be the verb loves because it is the subject of that verb. Mary would be object and its head would also be loves.
Our version of dependency grammar is heavily influenced by Lexical-Functional Grammar. This concerns in particular the granularity of argument and non-argument relations and how to distinguish between them, but we have also imported principles for annotating more complex linguistic structures such as raising and control.
The annotation system is documented in our annotation guide. (Note that the present guide is a compilation of several individual documents, some of which were written quite some time ago. We are in the process of editing and updating these documents, but for now they have to do!)
The New Testamanent texts in Syntacticus have been aligned with the Ancient Greek original. This means that you can browse them side-by-side and see how each word in a translation relates to the Ancient Greek original. This feature is not fully implemented on syntacticus.org, and if you cannot wait for the complete implementation to be ready you should consult our raw data releases.
All treebank data and other linguistic resources available from Syntacticus have been made available to you by the copyright holders under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 or Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license. In practice, this means you are free to use the data in a non-commercial setting as long as you provide complete attribution. You may also extract a subset of the data or derive a new data set by processing data from Syntacticus, but you must then make it freely available under the same license.
If you use this data in academic work, we ask that you cite the publication that the treebank editor has listed on their website. Please see the pages for the PROIEL Treebank, the TOROT Treebank and the ISWOC Treebank for this information.
You can also link directly to texts, sentences, dictionaries and lemmas. To do this, click on the yellow Details button and copy the permanent link to the page. This link includes information about the version of the data that you have accessed.
The linguistic data you find here is the product of many people's work. Some of it has been supported by funding bodies, other parts are the product of volunteer efforts by specialists. You can find detailed information about contributors and copyright holders for each linguistic resource by clicking on the yellow Details button on text and dictionary pages. This also explains the provenance of electronic text that the resource builds on and any restrictions associated with it.
# Raw data and developer resources
Raw data can be downloaded from the pages of the constituent treebanks, and some of the data has also been converted to Universal Dependencies 2.0.
We also provide a toolchain and libraries for reading and manipulate raw treebank data. Some of this is documented in our Development guide, and the code is found in our GitHub repositories https://github.com/proiel and https://github.com/mlj. (If you're curious the code for the Syntacticus website is also available.)
|PROIEL||Raw data, Latin UD 2.0, Ancient Greek UD 2.0, Old Church Slavonic UD 2.0, Gothic UD 2.0|
# Learn more!
If you have questions not covered here you can talk to us on Gitter (opens new window) and we will try to reply as soon as possible.