psychotechno: Top 5 CLEF 2018


A view of the Palais des Papes, in Avignon

All directions lead to CLEF

The 2018 edition of the Cross-Language Evaluation Forum was held in Avignon, last September. The venue was awesome, with the nice city guarded by its medieval wall and marked by the Pope's palace. Here is my selection of top research, including the conference track and some of the nine labs.

Disclaimer: as usual, this article is biased towards my current interests and my taste!

Arriving at the rooms for CLEF was not trivial, but we made it to the large warm room and the cloister where posters were exhibited (even in combination with a wine tasting session!)

Top 1: CheckThat! lab

The CheckThat! lab proposed two tasks around political debates. In the first task, a system had to prioritise the claims in a political debate according to their check-worthiness. In the second task, a system had to determine if a claim was factually true, half-true, or false.

For the first task, participants used various supervised models (e.g., MLP, SVMs) on a manifold of features. The most successful model used bag of words, POST, verbal forms, negations, and word embeddings, among others. Indeed, Hansen et al's contrasting model performed best (the task allowed for one primary and two contrasting submissions): a recurrent neural network with attention that had word embeddings, dependencies, and POST as input. The task was available in Arabic as well, although this language attracted less attention. Ghanem et al. approached it using the LIWC lexicons, after machine-translating them into Arabic.

For the second task, supervised models were used again, this time after grabbing supporting information from the Web. The model of Wang et al. performed the best, by a margin. One of the main reasons for this difference is that they added 4k extra instances from politifact in order to train their model. They used a NN (and an SVM) in which the claim is fed concatenated with the evidence from 5 retrieved snippets. Another interesting aspect of this model is that they opted for dropping the intermediate instances from politifact, after finding out that these were not contributing properly to the model. This idea of dropping ambiguous intermediate instances is in line with the findings of Barrón-Cedeño et al.: intermediate classes, which often intend to absorb doubtful instances do not help.

Next year CheckThat! will focus on the same two tasks, but the second one will be divided into its different components: the ranking of relevant documents, evaluating their value to judge a claim, and actually labeling the claim. (It is worth noting that when Noriko Kando described the QA Lab PoliInfo task at NTCIR, she stressed that it was similar to CheckThat!. It is about summarisation of political arguments as being supportive or not, in Japanese.)

The room right before starting the CheckThat! session

Top 2: PAN lab

This year, and indeed since 2016, PAN focused 100% on authorship analysis. This makes sense as, according to the numbers presented, ~1k papers on this topic have been published over the last ten years. Three tasks were offered this year.

Task 1. Author identification, divided on cross-domain authorship attribution and style change detection (aka single vs multi-authored documents identification). The former used fan fiction material and most well-performing models used SVMs on n-grams. Whereas this year the setting was closed, next year it will be open, including "unknown" as a possible class. In the latter task, that of identifying whether a text has been written by one single or multiple authors, participants used various lengths of sliding windows and a manifold of features, including the average word frequency class. Custódio and Paraboni used a SVR on fix n-grams and what they call distortion n-grams. Apparently, they substitute everything but punctuation, spaces, and diacritised characters.

Task 2. Author profiling in twitter. It was offered in Arabic, English, and Spanish and included images

Task 3. Author obfuscation; i.e. paraphrasing a text to cause a plagiarism detection model to fail. They evaluated different authorship attribution models on text before and after being obfuscated by the participants. Kocher and Savoy applied a bunch of rules mostly to substitute conjunctions, prepositions, adverbs, and punctuation.

As traditionally, since 2013, only software submissions were allowed. In 2019, PAN will include a task on discriminating between robot and human twitter accounts.

Paolo and Kiko, giving the certificate to the best-performing team in PAN task 1

Top 3: the keynotes

Julio Gonzalo talked about bias in system evaluation. During his talk, he continuously played a game about the difference between humans and computer scientists. He argues that people (humans) do not understand the difference between correlation and causality and that nowadays research is driven by publishability rather than curiosity. He expresses his concerns about current evaluation settings. For instance, tie breaking is not well defined, causing evaluation ecosystems fragile: a tiny change in these decisions can change scores significantly and, what is worst, can change participant rankings. According to Julio, we should do the following to improve our research ecosystem:

encourage the publication of negative results, text collections, and replication results;
reject the typical blind application of machine learning package X to problem Y;
request authors to test their proposed models on at least three collections; and
publish the procedure (e.g., software) and not simply text (paper).

He promoted UNED's evaluation platform and highlighted voting mechanisms that consider the bias in the participants and try to turn into wisdom and mentioned the paper of Cañamares and Castells at SIGIR 2018.

Julio Gonzalo's research incentive bias

Gabriella Pasi talked about the evaluation of both search engines and recommender systems. She stressed two things: (i) in recommender systems not only similarity, but dissimilarity is important and (ii) nDCG and MAP are often reported as if they were independent, but they are not (and they correlate). Beside this, she argues that precision and recall are often shadowed by combining them into the F-measure. But they are indeed relevant by themselves!

Top 4: CENTRE Lab

This is a joint lab that brings together CLEF, NTCIR, and TREC. The main objective is repeating the outcomes from the literature, even if under different conditions. They are interested in three aspects of research in our field, which are tied to ACM's badging policy. :

Replicability - Implement an identical algorithm, apply it to an identical dataset, and reproduce the reported results.
Reproducibility - Apply the same algorithm to a different dataset.
Generalizability - Predict the performance of a model on a fresh dataset.

The session started with a keynote by Martin Potthast on Improving the reproducibility of Computer Science. He stresses what we know well: many details and technical issues are missing in research papers (the typical not enough details in the paper). This happens to data as well. On a personal note, it is even worst when data is released, but incomplete, as a fake sensation that everything necessary is available is produced, but this is not the case. Again, on the partial release, he stresses the issue of code: sometimes it is released, but lacking any kind of documentation (perhaps due to that big-company typical claim that a person should be able to understand code). Among additional information, he refers to V. Stodden's top barriers to sharing content, to DKPro lab, and their own Tira.

They welcomed our feedback! Participants will replicate the best of 2018

Unfortunately, only one participant took the challenge this year. But they are re-focusing and loading batteries for next year. Instead of choosing some interesting papers, the best papers from this year in tasks running next year will be included. They have a form where people can propose papers for next year.

Top 5: other labs and the conference

A number of other interesting tasks took place.

The Early Risk Prediction on the Internet lab aimed at developing models to identify whether an individual suffers depression or anorexia on the basis of his/her online posts. An interesting aspect of this task is that they do chronological prediction: they released weekly data dumps with new posts from the different users (somehow similar to the temporal summarisation track at TREC). Hence, the system could determine if one person was suffering any of the ailments or if it had to wait to see more data.

ImageClef focused on three tasks: (i) visual question answering in the medical domain, (ii) ImageClef caption, and (iii) ImageClefTuberculosis. They used CrowdAI, which they described as an open-source Kaggle, to run the lab. A branch of ImageCLEF was LifeCLEF, which is interested in biodiversity, animal identification, and geolocation prediction.

At the main conference, John Samuel presented Analyzing and Visualizing Translation Patterns of Wikidata Properties. In there, he stressed the turning of Wikidata into Wikipedia contents ---an ongoing initiative in both the Catalan and the Basque Wikipedia editions, among others. He also mentions wdprop, a toolkit for "understanding and improving multilingual and collaborative ontology development on Wikidata". He claims this can be used both for vandalism and recommendation. Finally, he mentions an interesting paper: Linguistic influence patterns within the global network of Wikipedia language edition.

They claimed they had been jamming. It was hard to believe!

One of the local jazz bands

Extra: the cultural program

Clef 2018 featured a plethora of cultural activities including tango and theater lessons... and a lot of music; everything at the Thèâtre des Halles. The concerts are worth mentioning, as they combined local jazz bands and our own Enrique Amigó, Victor Fresno, and Julio Gonzalo.

What is coming next year

The 20th edition of CLEF will be held in Lugano. Among the novelties:

Demos will be integrated to the program
Various labs will appear again with the same or similar tasks. Among them, PAN, ImageCLEF, and ChechThat!.
A new lab is being organised called ProtestNews. It's purpose is determining if a news article is covering a riot, a demonstration or, in general, social movements.

Ali Hurriyetoglu presenting ProtestNews

That was the end of CLEF this year and, as usual, I closed with some good local food.

Last CLEF 2018 supper at L'Épicerie

psychotechno

Monday, 29 October 2018

Top 5 CLEF 2018