The 3rd Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with ArabicWeb16 Data Challenge http://edinburghnlp.inf.ed.ac.uk/workshops/OSACT3/ Workshop Description Given the success of the first and second workshops on Open-Source Arabic Corpora and Corpora Processing Tools (OSACT) in LREC 2014 and LREC 2016, where their presented papers received 77 citations up to now, the third workshop comes to encourage researchers and practitioners of Arabic language technologies, including computational linguistics (CL), natural language processing (NLP), and information retrieval (IR), to share and discuss their research efforts, corpora, and tools. The workshop will also give special attention on the wide variety of initiatives for the creation, use, and evaluation of Arabic as a type of Asian Language Resources and Technologies, which is one of LREC 2018 hot topics. In addition to the general topics of CL, NLP and IR, the workshop will give a special emphasis on a new Arabic Data challenge track. Data Challenge Track This year, we are introducing ArabicWeb16, a new Web dataset that is suitable for many research projects. ArabicWeb16 is a public Web crawl of 150M Arabic Web pages, crawled over the month of January 2016, with high coverage of dialectal Arabic (about 21%) as well as Modern Standard Arabic (MSA). One goal of the workshop is to define shared challenges using this dataset. We encourage submissions describing experiments for research tasks on the dataset. This includes (but not limited to) training word-embeddings, deduplication, cross-dialect search, question answering, dialect detection, knowledge-base population, entity search, blog search, text classification, and spam detection. Further details, including instructions on how to obtain the dataset, can be found here: https://sites.google.com/view/arabicweb16 Motivation In the NLP, CL, and IR communities, Arabic is considered to be relatively resource poor compared to English. This situation was thought to be the reason for the limited number of corpus-based studies in Arabic. However, the past years witnessed the emergence of new considerably free Modern Standard Arabic (MSA) corpora and to a lesser extent Arabic processing tools. Moreover, this workshop will introduce -for the first time- the “ArabicWeb16” dataset, a new Web dataset that is suitable for many research projects, and employ it in a shared challenge for reporting experiments of research tasks on the dataset. Topics of Interest Corpora

● Surveying and criticizing the design of available Arabic corpora, their associated and processing tools.

● Availing new annotated corpora for NLP and IR applications such as named entity recognition, machine translation, sentiment analysis, text classification, and language learning.

● Evaluating the use of crowdsourcing platforms for Arabic data annotation. Tools and Technologies

● Language education e.g. L1 and L2.

● Language modeling and word embeddings.

● Tokenization, normalization, word segmentation, morphological analysis, part-of-speech tagging, etc.

● Sentiment analysis, dialect identification, and text classification

● Dialect translation ArabicWeb16 Data Challenge

● Language modeling, word embeddings.

● Dialect detection, Cross-dialect search.

● Entity search, Blog search, Deduplication, Spam detection.

● Question answering, Knowledge-base population.

● Text Classification

Important Dates

● Submission deadline: 15 January 2018

● Notification of acceptance: 15 February 2018

● Final submission of manuscripts: 25 February 2018

● Workshop date: Tuesday, 8 May 2018

Submission guidelines The language of the workshop is English and submissions should be with respect to LREC 2018 paper submission instructions (http://lrec2018.lrec-conf.org/en/submission/authors-kit/ ). All papers will be peer reviewed possibly by three independent referees. Papers must be submitted electronically in PDF format to the STAR system. When submitting a paper from the STAR page, authors will be asked to provide essential information about resources (in a broad sense, i.e. technologies, standards, evaluation kits, etc.) that have been used for the work described in the paper or are a new result of your research. Moreover, ELRA encourages all LREC authors to share the described LRs (data, tools, services, etc.), to enable their reuse, replicability of experiments (including evaluation ones). Identify, Describe and Share your LRs!

● Describing your LRs in the LRE Map is now a normal practice in the submission procedure of LREC (introduced in 2010 and adopted by other conferences). To continue the efforts initiated at LREC 2014 about “Sharing LRs” (data, tools, web-services, etc.), authors will have the possibility, when submitting a paper, to upload LRs in a special LREC repository. This effort of sharing LRs, linked to the LRE Map for their description, may become a new “regular” feature for conferences in our field, thus contributing to creating a common repository where everyone can deposit and share data.

● As scientific work requires accurate citations of referenced work so as to allow the community to understand the whole context and also replicate the experiments conducted by other researchers, LREC 2018 endorses the need to uniquely Identify LRs through the use of the International Standard Language Resource Number (ISLRN, <http://www.islrn.org/> www.islrn.org<http://www.islrn.org/>), a Persistent Unique Identifier to be assigned to each Language Resource. The assignment of ISLRNs to LRs cited in LREC papers will be offered at submission time.

Organizing Committee

● Hend Al-Khalifa, King Saud University, KSA

● Walid Magdy, University of Edinburgh, UK

● Kareem Darwish, Qatar Computing Research Institute, Qatar

● Tamer Elsayed, Qatar University, Qatar

Program Committee (Tentative)

● Nizar Habash, New York University Abu Dhabi, UAE

● Mona Diab, George Washington University, USA

● Waleed Ammar, Allen Institute for Artificial Intelligence, USA

● Wajdi Zaghouani, Carnegie Mellon University, Qatar

● Mahmoud El-Haj, Lancaster University, UK

● Khaled Bashir Shaban, Qatar University, Qatar

● Wassim El-Hajj, American University of Beirut, Lebanon

● Ayah Zirikly, George Washington University, USA

● Irina Temnikova, Qatar Computing Research Institute, Qatar

● Shady Elbassuoni, American University of Beirut, Lebanon

● Abeer Aldayel, King Saud University, KSA

● Khaled Shaalan, The British University in Dubai, UAE

● Almoataz B. Elsaid, Cairo University, Egypt

● Ahmed Mourad, RMIT University, Australia

● Hassan Sawaf, Amazon, USA

● Fethi Bougares, Universite du Maine, Avenue Laennec, France

● Nada Ghneim, Higher Institute for Applied Science and Technology, Syria

● Amal Alsaif, Al-Imam Muhammad ibn Saud Islamic University

● Maha Althobaiti, Taif University, KSA More names to come . . .

