[Corpora-List] Fwd: GermEval 2020 Task 1 on the Prediction of Intellectual Ability and Personality Traits from Text: 1st Call for Participation

Chris Brew christopher.brew at gmail.com
Thu Dec 5 22:17:20 CET 2019

I think there is a parallel with the ASAP tasks that were created by the Hewlett Foundation in 2012 (https://www.kaggle.com/c/asap-aes. https://www.kaggle.com/c/asap-sas). These competitions drew broad participation, including from me. In retrospect, I stlll think my participation was justified, but I also understand people who chose not to, on moral grounds, and even those who may say that the Hewlett Foundation should not have run the tasks. It'll take a few paragraphs to say why.

The Hewlett Foundation explains the motivation in the following terms:

One of the key roadblocks to advancing school-based curricula focused on critical thinking and analytical skills is the expense associated with scoring tests to measure those abilities. For

example, tests that require essays and other constructed responses are useful tools, but they typically are hand scored, commanding considerable time and expense from public agencies. So, because of those costs, standardized examinations have increasingly been limited to using “bubble tests” that deny us opportunities to challenge our students with more sophisticated measures of ability.


Hewlett is opening the field of automated student assessment to you. We want to induce a breakthrough that is both

personally satisfying and game-changing for improving public education.

Prima facie, it is hard to argue against this goal. But I think a deeper look is needed. There were and are other motivations in play.

When the competitions were announced, I was working at the Educational Testing Service (ETS). The whole reason why ETS, which is a non-profit, exists is to advance responsible educational testing, and the organization has a stellar record of attention to bias and fairness in testing. ETS is also rigorous and careful in looking at use cases and possible unintended consequences. It does deploy automated essay grading at scale, and is clear-eyed about the strengths and weaknesses of the techniques used.

The so-called "bubble tests" which Hewlett criticizes are excellent tools as far as they go. Psychometricians know a lot about how to make them fair and informative, provided that they have time to study the exact settings and populations with which they are going to be used. And NLP research is scrutinized to ensure that it matches the standards set by responsible use of bubble tests. Although it is certainly true that tests using constructed responses such as essays have a more natural connection to real-world skills than do bubble tests, if such tests are going to be used, they have to be used judiciously, and with an eye to unintended consequences.

In 2011-2012 the political context of the ASAP tasks was roughly the following:

- The US Federal Education Department was offering extra school funding

to state education departments, but making it contingent on demonstrated

commitment to serious assessment. Personally, I think this contingency was

an awful idea, and even at the time it was pretty easy to see how it would

misfire. The level of funding was such that few state education departments

were ever going to turn it down, no matter what they thought about the

merits of a headlong revolution in curriculum and assessment.

- state chief education officers were responding to this by looking at

their options for getting large-scale in-school assessments done. There was

a lot of consultation and two large consortia investigating possibilities.

I believe, but cannot say for sure, that the Hewlett Foundation was working

closely with the consortia in formulating the task. It is clear from the

task that the foundation had access to resources that could only have been

made available by high-level co-operation with state education departments.

It might be too much to say that the foundation was a proxy for the

interests of the state education departments, but I think the statement is

directionally correct.

- in-school testing was clearly going to require test design resources

and approaches at a larger scale than existed at the time. Chief education

officers were almost certainly wary of being totally dependent on big

players such as Pearson and/or ETS. One of the explicit goals of the

competitions was to bring in new approaches and new players.

In the event, as far as I know, ETS did not mount a major effort to do large-scale in-school testing. I like to think that this is because ETS realized that doing responsible in-school testing is very difficult or even impossible at the quality level it aspires to. Other providers made a different choice, for sure, so these tests are widely deployed.

The difficulties of deploying these tests at all include robustness to language variation. Both individual differences and differences between linguistic subcommunities must be considered. In a tightly controlled setting, it is possible to do a decent job on this, but it requires vigilance, comprehensive testing and a determination to drop any questions (or even tests) that are not working as they should. The same is true for other sources of bias. Skimping on the necessary quality control is a disaster waiting to happen. The College Board (for SAT) and ETS (for GRE) are attempting to provide tests that avoid these biases. So the SAT and GRE tests are, rightly, carefully monitored for bias by the institutions who consider using them. I understand that many institutions would rather not rely on the GRE or SAT as much as they do, because of the possibility of unacceptable systematic biases. The bias can be gender-based, racial or economic. A major contributor to economic bias is the availability of effective but expensive coaching. By design, the tests are not supposed to be coachable, but that doesn't stop the coaching companies. I totally understand why a college that can afford to run and take responsibility for an alternative admissions process would decide not to pay attention to the GRE or SAT at all. The fact that the tests were at one point a positive contribution to reducing bias does not mean that they still are now.

Additional difficulties of large scale include the rate at which new questions must be developed, because a question that has been exposed to a broad audience must be assumed to have leaked, especially if deployment is to schools rather than in a tightly moderated test center. More fundamentally, it was obvious that despite protestations to the contrary, mission creep would occur.

The promise was made, and I think honored, that the tests would not drive educational decisions about individual students, in my view they had predictable bad effects elsewhere. In particular, it was very predictable that test results would be used in teacher evaluations, and in evaluation of schools, and in funding decisions made by educational authorities. This is a classic case of what happens when what was designed as a metric begins to become a target. While I believe that teaching to the test is not automatically a bad thing, if the test is well-formulated and well integrated with a supporting curriculum, if the tests are not great, the risk that teaching to the test will crowd out appropriate educational activities is very real. Unfortunately, this has happened, which is why I think the best thing to do at this stage is to completely drop in-school standardized tests. They cause stress to students and teacher, distort the curriculum and provide misleading and dangerous pseudo-data about how well schools are doing.

So I think it is reasonable to follow the chain of responsibility from the Education Department through to the state educational officers and the Hewlett Foundation to the participants and to say that by participating, people like me were giving credibility to large-scale testing, and that this testing is a bad thing. The counterpoints are (a) I am not sure that automated grading of free-form constructed responses is actually part of the main testing program. It shouldn't be: I hope it isn't (b) free-form constructed responses are a great and important instrument, and there are lots of genuinely low-stakes settings in which they have value. I myself worked with Lydia Liu and colleagues on a system for in-class feedback for short answer science questions, and really don't see a downside to this capability. Decent quality feedback from a computer system is a useful complement to excellent but delayed feedback from a teacher. Care is needed to make sure that students understand how fallible the automated system is, and know what they are getting and not getting.

Apologies for the length. My conclusion, such as it is, is that even in retrospect I don't know whether the ASAP tasks were a good thing for the world as whole. I decline to spell out much about what I think the detailed parallels are with the GermEval task


On Wed, Dec 4, 2019 at 2:04 PM Emily M. Bender <ebender at uw.edu> wrote:

> Thank you, Jacob, for this reply. This task seems irresponsible/poorly
> conceived to me. Before designing such a task, I think it is imperative to
> consider its use cases: When and why would we want to predict IQ scores or
> high school grades from text? Given the high potential for any such system
> to learn preexisting biases (themselves the result of structural
> discrimination in society), what are the likely impacts, especially on
> already marginalized populations?
> Emily
> On Wed, Dec 4, 2019 at 10:34 AM Jacob Eisenstein <jacobe at gmail.com> wrote:
>> As a community, we should think carefully about whether it is appropriate
>> to work with IQ test results as data, and what the applications of this
>> research might be.
>> In the United States, there is considerable evidence that IQ tests are
>> racially biased. In the past, courts have excluded IQ tests from
>> educational placement in California for precisely this reason. I wonder if
>> there is research on this topic in the German context.
>> It is not difficult to imagine that the outcome of this shared task would
>> be a set of technologies that encode spurious correlations between
>> estimates of intelligence and the linguistic features of specific racial
>> groups. If such a system were trained on data that already contains biases,
>> there is a risk that this bias would be not only entrenched but amplified.
>> And even if the IQ test statistics are not themselves biased, an NLP system
>> that predicts IQ from text could introduce bias, if there is an unmeasured
>> confound that is statistically associated with both IQ and race.
>> I hope that these issues will receive serious consideration from the
>> organizers and participants in the task.
>> Jacob Eisenstein
>> On Wed, Dec 4, 2019 at 8:27 AM Dirk Johannßen <
>> johannssen at informatik.uni-hamburg.de> wrote:
>>> *GermEval 2020 Task 1 on the Prediction of Intellectual Ability and
>>> Personality Traits from Text*
>>> *1st Call for Participation*
>>> We invite interested parties from academia and industry to participate
>>> in this shared task. Further information can be found here:
>>> https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data/germeval-2020-psychopred.html
>>> <https://urldefense.com/v3/__https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data/germeval-2020-psychopred.html__;!bvwDki4T53I!jyb_JKIGauOdFY2e3EkCIHyGXU_EFPFGPQhdKsgc8LK1Kxryktev-4FEly6RH9ZldpLsrfw$>
>>> .
>>> The validity of high school grades as a predictor of academic success is
>>> controversial. Researchers have found indications that linguistic features
>>> such as function words used in a prospective student's writing perform
>>> better in predicting academic success (Pennebaker et al., 2014).
>>> During an aptitude test, participants are asked to write freely
>>> associated texts to provided questions and images. Trained psychologists
>>> can predict behavior, long-term development, and subsequent success from
>>> those expressions. Paired with an IQ test and provided high school grades,
>>> prediction of intellectual ability from a text can be investigated. Such an
>>> approach would extend the sole text classification and could reveal
>>> insightful psychological traits.
>>> Operant motives are unconscious intrinsic desires that can be measured
>>> by implicit or operant methods, such as the Operant Motive Test (OMT) or
>>> the Motive Index (MIX) employs. During the OMT and MIX, participants are
>>> asked to write freely associated texts to provided questions and images.
>>> Trained psychologists label these textual answers with one of five motives
>>> and corresponding levels. The identified motives allow psychologists to
>>> predict behavior, longterm development, and subsequent success. For our
>>> task, we provide extensive amounts of textual data from both, the OMT and
>>> MIX, paired with IQ and high school grades (MIX) and labels (OMT).
>>> With this task, we aim to foster research within this context. This task
>>> is focusing on classifying German psychological text data for predicting
>>> the IQ and high school grades of college applicants as well as performing
>>> speaker identification by the same image descriptions.
>>> *Tasks*
>>> This shared task consists of two subtasks, described below. Participants
>>> are free to participate in either one of them or both.
>>> *- Subtask 1*: Prediction of Intellectual Ability. The task is to
>>> predict measures of intellectual ability solemnly based on text. For this,
>>> z-standardized high school grades and IQ scores of college applicants are
>>> summed and globally ranked. The goal of this subtask is to reproduce their
>>> ranking, systems are evaluated by the Pearson correlation coefficient
>>> between system and gold ranking.
>>> For the final results, participants of this shared task will be provided
>>> with an MIX_text only and are asked to reproduce the ranking of each
>>> student relative to all students in a collection (i.e. the within the test
>>> set).
>>> The data is delivered in two files, one containing participant data,
>>> the other containing sample data, each being connected by a student ID. The
>>> rank in the sample data reflects the averaged performance relative to all
>>> instances within the collection (i.e. within train / test / dev), which is
>>> to be reproduced for the task.
>>> *- Subtask 2*: Classification of the Operant Motive Test (OMT). Operant
>>> motives are unconscious intrinsic desires that can be measured by implicit
>>> or operant methods, such as the Operant Motive Test (OMT)(Kuhl and
>>> Scheffer, 1999). During the OMT, participants are asked to write freely
>>> associated texts to provided questions and images. An exemplary
>>> illustration can be found in the Data area. Trained psychologists label
>>> these textual answers with one of four motives. The identified motives
>>> allow psychologists to predict behavior, long-term development, and
>>> subsequent success.
>>> For this shared task, participants will be provided with an OMT_text and
>>> are asked to predict the motive and level of each instance. The success
>>> will be measured with the macro-averaged F1-score.
>>> *Data*
>>> Since 2011, the private university of applied sciences NORDAKADEMIE
>>> performs an aptitude college application test, where participants state
>>> their high school performance, perform an IQ test and a psychometrical test
>>> called the Motive Index (MIX). The MIX measures so-called implicit or
>>> operant motives by having participants answer questions to those images
>>> like the one displayed below such as "who is the main person and what is
>>> important for that person?" and "what is that person feeling". Furthermore,
>>> those participants answer the question of what motivated them to apply for
>>> The data consists of a unique ID per entry, one ID per participant, of
>>> the applicants' major and high school grades as well as IQ scores with one
>>> textual expression attached to each entry. high school grades and IQ scores
>>> are z-standardized for privacy protection. In total there are 2,595
>>> participants, who produced 77,850 unique MIX answers. The shortest textual
>>> answers consist of 3 words, the longest of 42 and on average there are
>>> roughly 15 words per textual answer with a standard deviation of 8 words.
>>> The available data set has been collected and hand-labeled by
>>> researchers of the University of Trier. More than 14,600 volunteers
>>> participated in answering questions to 15 provided images. The pairwise
>>> annotator intraclass correlation was r = .85 on the Winter scale (Winter,
>>> 1994). The length of the answers ranges from 4 to 79 words with a mean
>>> length of 22 words and a standard deviation of roughly 12 words.
>>> Submissions for the validation set via the Codalab page are accepted and
>>> published on a leaderboard from January 1st. From May 1st, we will start
>>> the final evaluation phase of the task by providing the gold labels of the
>>> validation set, which can be used as additional training data.
>>> Additionally, the test set samples will be provided, for which we accept
>>> submissions until June, 1st.
>>> More information can be found on the task's webpage:
>>> https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data/germeval-2020-psychopred.html
>>> <https://urldefense.com/v3/__https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data/germeval-2020-psychopred.html__;!bvwDki4T53I!jyb_JKIGauOdFY2e3EkCIHyGXU_EFPFGPQhdKsgc8LK1Kxryktev-4FEly6RH9ZldpLsrfw$>
>>> *Important Dates*
>>> - 01-Dec-2019: Release of trial data and systems
>>> - 01-Jan-2020: Release of training data (train + validation)
>>> - 08-May-2020: Release of test data
>>> - 01-Jun-2020: Final submission of test results
>>> - 03-Jun-2020: Submission of description paper
>>> - 04-11-Jun-2020: Peer reviewing: participants are expected to review
>>> other participant's system descriptions
>>> - 12-Jun-2020: Notification of acceptance and reviewer feedback
>>> - 18-Jun-2020: Camera-ready deadline for system description papers
>>> - 23-Jun-2020: Workshop in Zurich, Switzerland at the KONVENS 2020 and
>>> SwissText joint conference
>>> The shared task will be accompanied by a pre-conference workshop of the
>>> Conference on Natural Language Processing ("Konferenz zur Verarbeitung
>>> natürlicher Sprache", KONVENS) hosted on June 23, 2020, at Zürich (
>>> https://swisstext-and-konvens-2020.org/
>>> <https://urldefense.com/v3/__https://swisstext-and-konvens-2020.org/__;!bvwDki4T53I!jyb_JKIGauOdFY2e3EkCIHyGXU_EFPFGPQhdKsgc8LK1Kxryktev-4FEly6RH9ZlX5dbT28$>
>>> ).
>>> *Workshop Proceedings*
>>> Description papers will appear in online workshop proceedings.
>>> Participants who submit a description paper will be asked to register at
>>> the workshop and present their system as a poster or in an oral
>>> presentation (depending on the number of submissions).
>>> *Organizers*
>>> The shared task is organized by Dirk Johannßen, Chris Biemann, Steffen
>>> Remus and Timo Baumann from the Language Technology group of the University
>>> of Hamburg (https://www.inf.uni-hamburg.de/en/inst/ab/lt/home.html
>>> <https://urldefense.com/v3/__https://www.inf.uni-hamburg.de/en/inst/ab/lt/home.html__;!bvwDki4T53I!jyb_JKIGauOdFY2e3EkCIHyGXU_EFPFGPQhdKsgc8LK1Kxryktev-4FEly6RH9ZlfjIfkkk$>),
>>> as well as David Scheffer from the NORDAKADEMIE Elmshorn, Nicola Baumann
>>> from the Universität Trier and the Gudula Ritz from the Impart GmbH
>>> (Germany).
>>> *GermEval*
>>> GermEval is a series of shared task evaluation campaigns that focus on
>>> Natural Language Processing for the German language. GermEval has been
>>> conducted four times since 2014 in co-location with KONVENS/GSCL
>>> conferences. For an overview of the currently conducted tasks, visit
>>> https://swisstext-and-konvens-2020.org/shared-tasks/
>>> <https://urldefense.com/v3/__https://swisstext-and-konvens-2020.org/shared-tasks/__;!bvwDki4T53I!jyb_JKIGauOdFY2e3EkCIHyGXU_EFPFGPQhdKsgc8LK1Kxryktev-4FEly6RH9ZlOOu2go8$>
>>> .
>>> --
>>> Dirk Johannßen
>>> Universität Hamburg
>>> Department of Informatics
>>> Language Technology Group (LT)
>>> Vogt-Kölln-Straße 30
>>> 22527 Hamburg
>>> Room: F-412
>>> johannssen at informatik.uni-hamburg.de
>>> http://lt.informatik.uni-hamburg.de
>>> <https://urldefense.com/v3/__http://lt.informatik.uni-hamburg.de__;!bvwDki4T53I!jyb_JKIGauOdFY2e3EkCIHyGXU_EFPFGPQhdKsgc8LK1Kxryktev-4FEly6RH9ZlA7jbwbc$>
>>> http://www.uni-hamburg.de
>>> <https://urldefense.com/v3/__http://www.uni-hamburg.de__;!bvwDki4T53I!jyb_JKIGauOdFY2e3EkCIHyGXU_EFPFGPQhdKsgc8LK1Kxryktev-4FEly6RH9ZlHOUe3vE$>
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> <https://urldefense.com/v3/__http://mailman.uib.no/options/corpora__;!bvwDki4T53I!jyb_JKIGauOdFY2e3EkCIHyGXU_EFPFGPQhdKsgc8LK1Kxryktev-4FEly6RH9Zle9j_Uts$>
>>> Corpora mailing list
>>> Corpora at uib.no
>>> https://mailman.uib.no/listinfo/corpora
>>> <https://urldefense.com/v3/__https://mailman.uib.no/listinfo/corpora__;!bvwDki4T53I!jyb_JKIGauOdFY2e3EkCIHyGXU_EFPFGPQhdKsgc8LK1Kxryktev-4FEly6RH9ZlTqfAxzM$>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> <https://urldefense.com/v3/__http://mailman.uib.no/options/corpora__;!bvwDki4T53I!jyb_JKIGauOdFY2e3EkCIHyGXU_EFPFGPQhdKsgc8LK1Kxryktev-4FEly6RH9Zle9j_Uts$>
>> Corpora mailing list
>> Corpora at uib.no
>> https://mailman.uib.no/listinfo/corpora
>> <https://urldefense.com/v3/__https://mailman.uib.no/listinfo/corpora__;!bvwDki4T53I!jyb_JKIGauOdFY2e3EkCIHyGXU_EFPFGPQhdKsgc8LK1Kxryktev-4FEly6RH9ZlTqfAxzM$>
> --
> Emily M. Bender (she/her)
> Howard and Frances Nostrand Endowed Professor
> Department of Linguistics
> Faculty Director, CLMS
> University of Washington
> Twitter: @emilymbender
> _______________________________________________
> UNSUBSCRIBE from this page:
> https://urldefense.com/v3/__http://mailman.uib.no/options/corpora__;!bvwDki4T53I!jyb_JKIGauOdFY2e3EkCIHyGXU_EFPFGPQhdKsgc8LK1Kxryktev-4FEly6RH9Zle9j_Uts$
> Corpora mailing list
> Corpora at uib.no
> https://urldefense.com/v3/__https://mailman.uib.no/listinfo/corpora__;!bvwDki4T53I!jyb_JKIGauOdFY2e3EkCIHyGXU_EFPFGPQhdKsgc8LK1Kxryktev-4FEly6RH9ZlTqfAxzM$
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 26821 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20191205/5cf0abbc/attachment.txt>

More information about the Corpora mailing list