top of page

Questions and Answers

Public·29 members

Download 78k Txt

There are a number of data files involved in this challenge. Eachtype of file is available for each language.NEW (2009-02-25): All random word pairs files have beenupdated so that they correspond to the new evaluation scripts. Inaddition, small modifications have been made also to the Arabic wordlists and gold standard samples.Word list (input)First and foremost, there is a list of word forms. The words havebeen extracted from a text corpus, and each word in thelist is preceded by its frequency in the corpus used. For instance, a subset of the supplied English word list looks likethis:...1 barefoot's2 barefooted6699 feet653 flies2939 flying1782 foot64 footprints...Result file (output, i.e., what to submit)The participants' task is to return a list containing exactly thesame words as in the input, with morpheme analyses provided for eachword. The list returned shall not contain the word frequencyinformation.A submission for the above English words may look like's BARE FOOT +GENbarefooted BARE FOOT +PASTfeet FOOT +PLflies FLY_N +PL, FLY_V +3SGflying FLY_V +PCP1foot FOOTfootprints FOOT PRINT +PL...There are a number of things to note about the result file: Eachline of the file contains a word (e.g., "feet") separated from itsanalysis (e.g., "FOOT +PL") by one TAB character. The word needs tolook exactly as it does in the input; no capitalization or change ofcharacter encoding is allowed. The analysis contains morpheme labelsseparated using space. The order in which the labels appear does notmatter; e.g., "FOOT +PL" is equivalent to "+PL FOOT". The labels arearbitrary: e.g., instead of using "FOOT" you might use "morpheme784" andinstead of "+PL" you might use "morpheme2". However, we stronglyrecommend you to use intuitive labels, when possible, since theymake it easier for anyone to get an idea of the quality of theresult by looking at it.If a word has several interpretations, all interpretations shouldbe supplied: e.g., the word "flies" may be the plural form of the noun"fly" (insect) or the third person singular present tense form of theverb "to fly". The alternative analyses must be separated using a comma,as in: "FLY_N +PL, FLY_V +3SG". The existence of alternative analysesmakes the task challenging, and we leave it to the participants todecide how much effort they will put into this aspect of the task. InEnglish, for instance, in order to get a perfect score, it would benecessary to distinguish the different functions of the ending "-s"(plural or person ending) as well as the different parts-of-speech ofthe stem "fly" (noun or verb). As the results will be evaluatedagainst reference analyses (our so-called gold standard), it is worthreading about the guiding principles used whenconstructing the gold standard.As far as we understand, you can use any characters in yourmorpheme labels except whitespace and comma (,). However, we cannotguarantee that the evaluation scripts will work properly, if yourlabels contain some "strange" characters.Text corpus for English, Finnish, German and TurkishThe word list (input data) has been constructed by collecting wordforms occurring in a text corpus. The text corpora have been obtained from the Wortschatzcollection at the University of Leipzig (Germany). We used theplain text files (sentences.txt for each language); thecorpus sizes are 3 million sentences for English, Finnish and German,and 1 million sentences for Turkish. For English, Finnish and Turkishwe use preliminary corpora, which have not yet been released publiclyat the Wortschatz site. The corpora have been preprocessed for theMorpho Challenge (tokenized, lower-cased, some conversion of characterencodings).If the participants like to do so, they can use the corpora inorder to get information about the context in which the differentwords occur.We are most grateful to the University of Leipzig for making theseresources available to the Challenge, and in particular we thankStefan Bordag for his kind assistance.Text corpus for ArabicNEW: This year we try a different data set, the Quran, which is somewhat smaller (only 78K words), but has also a vowelized version (as well as the unvowelized one). The text data has also been made available.In Arabic, the participants can try to analyze the vowelized words or the unvowelized, or both. They will be evaluated separately against the vowelized or the unvowelized gold standard analysis, respectively.For all Arabic data, the Arabic writing script are provided as well as the Roman script (Buckwalter transliteration). However, we can only evaluate morpheme analysis submitted in Roman script, sorry.We are most grateful to Majdi Sawalha and Eric Atwell from the University of Leeds for making this data available to the Challenge and for their kind assistance in preparing it to meet the Challenge file formats.Sawalha, Majdi; Atwell, Eric. 2008. Comparative evaluation of Arabic languagemorphological analysers and stemmers. in: Proceedings of COLING 2008 22ndInternational Conference on Computational Linguistics.[PDF]We acknowledge also the Computational Linguistics Group at University of Haifa who supplied their tagged database.Gold standard morpheme analysesThe desired "correct" analyses for a random sample of circa 500words are supplied for each language. These samples can be used forvisual inspection and as a development test set (in order toget a rough estimate of the performance of the participants'morpheme-analyzing algorithm).The format of the gold standard file is exactly the same as that ofthe result file to be submitted. That is,each line contains a word and its analysis. The word is separated fromthe analysis by a TAB character. Morpheme labels in the analysis areseparated from each other by a space character. For some words thereare multiple correct analyses. These alternative analyses areseparated by a comma (,). Examples:LanguageExamplesEnglishbaby-sitters baby_N sit_V er_s +PLindoctrinated in_p doctrine_N ate_s +PASTFinnishlinuxiin linux_N +ILLmakaronia makaroni_N +PTVGermanchoreographische choreographie_N isch +ADJ-ezurueckzubehalten zurueck_B zu be halt_V +INFTurkishkontrole kontrol +DATpopUlerliGini popUler +DER_lHg +POS2S +ACC, popUler +DER_lHg +POS3 +ACC3Arabic vowelizedAl>aroDi 'rD faEl 'arD +Noun +Triptotic +Sg +Fem +Gen +DefArabic non-vowelizedAl>rD 'rD fEl 'rD +Noun +Triptotic +Sg +Fem +Gen +DefThe English and German gold standards are based on the CELEXdata base. The Finnish gold standard is based on thetwo-level morphology analyzer FINTWOL from Lingsoft, Inc. The Turkishgold-standard analyses have been obtained from a morphological parserdeveloped at BoğaziçiUniversity; it is based on Oflazer's finite-statemachines, with a number of changes. We are indebted to EbruArısoy for making the Turkish gold standard available to us. For Arabic the gold standard has in each line; the word, the root, the pattern and then the morphological and part-of-speech analysis.The morphological analyses are morpheme analyses. Thismeans that only grammatical categories that are realized as morphemesare included. For instance, for none of the languages will you find asingular morpheme for nouns or a present-tense morpheme for verbs,because these grammatical categories do not alter or add anything to the wordform, in contrast to, e.g., the plural form of a noun (housevs. house+s), or the past tense of verbs (helpvs. help+ed, come vs. came).The morpheme labels that correspond to inflectional (and sometimesalso derivational) affixes have been marked with an initialplus sign (e.g., +PL, +PAST). This is due to a feature of theevaluation script: in addition to the overall performance statistics,evaluation measures are also computed separately for the labelsstarting with a plus sign and those without an initial plus sign. Itis thus possible to make an approximate assessment of how accuratelyaffixes are analyzed vs. non-affixes (mostly stems). If you use thesame naming convention when labeling the morphemes proposed by youralgorithm, this kind of statistics will be available for your output(see the evaluation page for moreinformation).The morpheme labels that have not been marked as affixes (noinitial plus sign) are typically stems. These labels consist of anintuitive string, usually followed by an underscore character (_)and a part-of-speech tag, e.g., "baby_N", "sit_V". In many cases,especially in English, the same morpheme can function as differentparts-of-speech; e.g., the English word "force" can be a noun or averb. In the majority of these cases, however, if there is only adifference in syntax (and not in meaning), the morpheme has beenlabeled as either a noun or a verb, throughout. For instance,the "original" part-of-speech of "force" is a noun, and consequentlyboth noun and verb inflections of "force" contain the morpheme"force_N":force force_Nforce's force_N GENforced force_N +PASTforces force_N +3SG, force_N +PLforcing force_N +PCP1Thus, there is not really a need for your algorithm to distinguishbetween different meanings or syntactic roles of the discovered stemmorphemes. However, in some rare cases, if the meanings of thedifferent parts-of-speech do differ clearly, there are two variants,e.g., "train_N" (vehicle), "train_V" (to teach), "fly_N" (insect),"fly_V" (to move through the air). But again, if there are ambiguousmeanings within the same part-of-speech, these arenot marked in any way, e.g., "fan_N" (device for producing acurrent of air) vs. "fan_N" (admirer). This notation is a consequence ofusing CELEX and FINTWOL as the sources for our gold standards. We couldhave removed the part-of-speech tags, but we decided to leave themthere, since they carry useful information without significantlymaking the task more difficult. There are no part-of-speechtags in the Turkish gold standard.Random word pairs fileIf you want to carry out a small-scale evaluation yourself usingthe gold standard sample, you need to download a randomly generatedso-called word pairs file for each language to betested. Read more about this on the evaluation page.Character encodingIn the source data used for the different languages, there isvariation in how accurately certain distinctions are made when lettersare rendered. This makes it hard to apply a unified character encodingscheme for all the languages (such as UTF-8). Thus, the followingencodings have been used, in which all letters are encoded as one-byte(8-bit) characters:EnglishStandard text. All words are lower-cased, also proper names.FinnishISO Latin 1 (ISO 8859-1). The Scandinavian special letters å,ä, ö (as well as other letters occuring in loan words, e.g.,ü, é, à) are rendered as one-byte characters. All words arelower-cased, also proper names.GermanStandard text. All words are lower-cased, also all nouns. TheGerman umlaut letters are rendered as the corresponding non-umlautletter followed by "e", e.g., "laender" (Länder), "koennte"(könnte), "fuer" (für). Double-s is rendered as "ss",e.g., "strasse" (Straße). This coarse encoding is due to thefact that CELEX, the source for the morphological gold standard,utilizes this scheme. Note, however, that in the data you may seespecial letters encoded using ISO Latin 1 in some loan words, e.g.,"société", "l'unità" (these words are notincluded in CELEX and their analyses will not be evaluated).TurkishStandard text. All words are lower-cased. The letters specific tothe Turkish language are replaced by capital letters of the standardLatin alphabet, e.g.,"açıkgörüşlülüğünü" isspelled "aCIkgOrUSlUlUGUnU".Arabic All words in Roman script are presented in Buckwalter transliteration. The Arabic script is utf-8 coding.Download data for Competition 1LanguageWord listText corpusSample of gold standardRandom word pairs fileEnglishTextText gzippedText gzippedTextTextFinnishTextText gzippedText gzippedTextTextGermanTextText gzippedText gzippedTextTextTurkishTextText gzippedText gzippedTextTextArabic vowelizedText Arabic scriptText gzipped Arabic script gzippedText gzipped Arabic script gzippedTextTextArabic non-vowelizedText Arabic scriptText gzipped Arabic script gzippedText gzipped Arabic script gzippedTextTextInstead of downloading each file separately, you can download thewhole package (including all Competition 1,2 and 3 data), either as a tar file: morphochal09data.tar (638 MB;unpack using "tar xf") or as a zip file: (639 MB).Download data for Competition 2Participation in competition 2 does not necessarily require any extra effort by the participants. The organizers will use the analyses provided by the participants for competition 1 in information retrieval experiments. Data from CLEF will be used.However, because the information retrieval evaluation texts are different from the training texts of competition 1, a slightly better IR performance may be obtained, by submitting also the analyses of the words that do not exist in the word lists of competition 1.The joined word lists can be downloaded below.LanguageWord listText corpusEnglishTextText gzippedSee the paragraph belowFinnishTextText gzippedSee the paragraph belowGermanTextText gzippedSee the paragraph belowThose participants who wish to use the full text corpora in order to get information about the context in which the different words occur, please contact the organizers for more information how to register to CLEF to obtain the full texts.If there are participants who wish to submit morpheme analysis for words in their actual context (competition 2b), they will need to request the full texts, too.If you need the full texts, please contact the organizers for details how to fill in and submit the CLEF Registration Form and CLEF End-User Agreement. The DL for this registration is 1 May, 2009.NOTE: If you do not participate in competition 2b and do not need the full texts for to submit the unsupervised morpheme analysis for competition 2, it is enough to just download the data available at this page.Download data for Competition 3In order to participate in competition 3, participant must submitanalysis of the words in the Europarl corpus. Twolanguages, Finnish and Germany, are included in this competition. Theresult file must be in the same format as in competitions 1 and 2.However, several interpretations per word is not recommended, as onlyone can be applied. If alternatives are given, we will use only thefirst one. The word lists can be downloaded below.LanguageWord listText corpusFinnishTextText gzippedCorpus archive (45MB)GermanTextText gzippedCorpus archive (54MB)Warning: The list of words contains many numbers and variousspecial characters, which may cause probelms if not taken intoaccount. You can preprocess the data if needed, but be careful thatthe words in the result file will be as they were given.Exception: It is allowed (and recommended) to change comma (,) touppercase C. This is necessary especially if your algorithm givesalternative analyses.You are free to use the data sets from competitions 1 and 2 inaddition to the Europarl set to obtain the analyses. Also, you do notneed to return an analysis for every word in the Europarl wordlist. Those that have no analysis will be treated as one with a singlemorpheme - the word itself. (Note, however, that Europarl has a largenumber of words not appearing in the other data sets, so it is notrecommended to totally discard it.)Those participants who wish to use the full text corpora, can usethe provided corpus files. The gzipped tar archive contains severalhundred text files (named such as ep-98-01-13.txt). You mustreturn a set of files that is otherwise the same (same number oflines, same order of lines, including the empty lines), butwords are replaced by their analyses. Both morphemes and words shouldbe separated by a single space. (I.e., there is no need to distinguish wordbreaks from other morpheme breaks.)HOME RULES SCHEDULE EVALUATION DATASETS WORKSHOP RESULTS FAQ CONTACT

Download 78k txt



Welcome to the group! You can connect with other members, ge...
bottom of page