Apologies if you have received this e-mail more than once.
== Quranic Arabic Corpus Version 0.2 ==
Version 0.2 released today - Monday 1st Feburary, 2010. The Quranic Arabic Corpus is an annotated linguistic resource which shows the Arabic grammar, syntax and morphology for each word in the Quran. The corpus provides three levels of analysis: morphological annotation, a syntactic treebank and a semantic ontology. The research project is organized at the University of Leeds, and is part of the Arabic language computing research group within the School of Computing, supervised by Eric Atwell.
This project aims to provide a richly annotated linguistic resource for researchers wanting to study the original Arabic language of the Quran. Each day on average, the website receives 10,000 page views and over 1,500 visitors from 135 different countries world-wide. Following user feedback, a new version of the corpus is now available with several improvements to both the online website, as well as to the annotated linguistic data:
== Synopsis of New Features ==
- Syntactic treebank now includes chapter 2 of the Quran - Visual ontology with 300 concepts and 350 logical relations - Named entity tagging, with 6000 Arabic words in the Quran identified - Higher accuracy for part-of-speech tagging and morphological analysis
- New parts-of-speech for particles (PRO/prohibition, SUP/supplemental) - Improved English terminology for corresponding Arabic grammar terms - Fixed typos in interlinear translation - Fixed missing last verses in data download files
- Easier and quicker navigation with direct verse selection - Search page now shows entire verses in Arabic and English - Improved message board security with user sign-in and registration
== Linguistic Improvements ==
- The syntactic treebank uses dependency graphs to visualize the parsed syntactic structure for Arabic verses in the Quran. Previously, the treebank covered approximately 5,000 words (surat l-fatihah and the last two juz of the Quran). In version 0.2, the treebank has been extended to include chapter 2 (surat l-baqarah) and now covers over 11,000 Arabic words in the Quran with 2,500 dependency graphs. See: http://corpus.quran.com/treebank.jsp
- The ontology of Quranic concepts is the largest new feature to be added in this release. This shows a visual map of the names of people, places and other entities mentioned in the Quran (http://corpus.quran.com/ontology.jsp). Relationships between entities are encoded using predicate logic (e.g. father/son, instance/subclass, part-of, etc). At present, this is a basic ontology to enable a further planned step of analysis, pronoun resolution. A brief webpage has been written about each of the 300 concepts in the ontology, providing a short synopsis, as well as showing predicate logic relations. Users can add comments to each ontology concept page. It is hoped that over time the ontology will grow into a small specialized wiki of Quranic topics, formalized using machine-readable predicate logic. Each page in the ontology is hyperlinked to the closest corresponding page in Wikipedia, where applicable. A topic concordance of concepts is also available (http://corpus.quran.com/topics.jsp) which allows users to click through to easily find verse references for each concept in the ontology.
- Named entity tagging in the Quranic corpus involves identifying specific Arabic words (or spans of words) in verses, and mapping these to well-defined formal concepts in the ontology. The word-by-word grammatical annotation scheme on the website has been extended to show links to the ontology. So far, 6,000 Arabic words have been tagged as named entities and have been mapped to concepts. These include all proper nouns in the corpus, as well as names of other specific locations, places, animals and important events mentioned in the Quran.
- A detailed linguistic review has been completed of all messages on the message board. This has left 339 messages open for further discussion, with 2,842 messages now resolved and archived. Version 0.2 of the corpus incorporates many improvements and suggestions from volunteer annotators on how grammatical tagging might be improved. This has resulted in much higher accuracy in the online grammatical analysis for each Arabic word.
== Data Download Improvements ==
- Previously for part-of-speech tagging, the SUP tag was used for the rare surprise particle. This has now been changed to SUR/surprise. Version 0.2 of the corpus introduces two new part-of-speech tags for particles, in order to achieve higher accuracy with regards to traditional Arabic grammatical analysis (i'rab). A new tag SUP/supplemental (harf za'id), has been introduced, as well as PRO/prohibition. The latter is required to correctly distinguish negative particles (NEG = harf nafee) from particles of prohibition (PRO = harf nahee). Proper noun tagging has also been improved. Completion of the initial draft of the ontology has allowed for a clearer view on what should be tagged as a proper noun, based on grammatical as well as semantic considerations.
- English terminology on the website has been improved for corresponding Arabic grammatical terms. The syntactic treebank now uses clearer English terminology and phrase tagging for jumlah fi'liya / ismiyah (VS / NS = verbal / nominal sentence). Previously these were named "verb phrase" and "noun phrase" which may have led to some confusion. There is also improved terminology for the rarer Quranic verbal nouns, e.g. "imperative verbal noun" instead of just "imperative noun" for "ism fi'il amr".
- Some typos have been fixed in the interlinear English translation. This includes correcting some of the places where words have been doubled up, as well as fixing missing occurrences of the word "zakah". There are quite likely to be more improvements to be made in the interlinear translation with regards to accuracy against traditional accepted sources of translation into English. Comments are more than welcome via the message board.
- The data download files for version 0.2 of the corpus have been updated to include all these new improvements. The issue of missing last verses when downloading data has been also now been fixed.
== Website Improvements ==
- A drop down verse list has been introduced across the website. This allows for easier and quicker navigation with direct verse selection. This was an often requested feature by regular website users.
- The search page now shows entire verses in Arabic and English. When searching for a word or using the concordance functionality, previously only a list of matching words would be displayed. Now, each search result highlights the matching Arabic word and shows in its entire verse in context. A corresponding English translation for each verse is also displayed when searching, using the Sahih International translation. Website users also have the option of using 8 different English translations for wider context, including the word-by-word interlinear translation.
- The message board now has improved security with user sign-in and registration. The Quranic Arabic Corpus website receives many regular visitors, including young students who use the website to learn about Arabic grammar and to find out more about the Quran. This registration process is intended to protect our users from spam, and to prevent other unsuitable or potentially harmful messages from being posted to the message board. Users can now also post messages to each of the 300 ontology concept pages, so that hopefully this new content can be improved and extended over time.
- Non-technical interview with the muslim post (January 2010) - http://corpus.quran.com/interview.jsp
- Linguistic academic paper (for submission) - "Kais Dukes and Tim Buckwalter. A Dependency Treebank of the Quran using Traditional Arabic Grammar." - http://corpus.quran.com/publications.jsp
== Feedback ==
Any feedback on version 0.2 of the Quranic Arabic Corpus is more than welcome. The Quranic Arabic Corpus is made freely available under the GNU public license and the corpus terms-of-use.
-- Kais Dukes
Language Research Group School of Computing University of Leeds
http://corpus.quran.com - The Quranic Arabic Corpus comp-quran at comp.leeds.ac.uk - Computational Quranic Arabic discussion list