Monday, 29 October 2018

Top 5 CLEF 2018

A view of the Palais des Papes, in Avignon



All directions lead to CLEF

The 2018 edition of the Cross-Language Evaluation Forum was held in Avignon, last September. The venue was awesome, with the nice city guarded by its medieval wall and marked by the Pope's palace.  Here is my selection of top research, including the conference track and some of the nine labs.

Disclaimer: as usual, this article is biased towards my current interests and my taste!

Arriving at the rooms for CLEF was not trivial, but we made it to the large warm room and the cloister where posters were exhibited (even in combination with a wine tasting session!)

 

Top 1: CheckThat! lab

The CheckThat! lab proposed two tasks around political debates. In the first task, a system had to prioritise the claims in a political debate according to their check-worthiness. In the second task, a system had to determine if a claim was factually true, half-true, or false. 

For the first task, participants used various supervised models (e.g., MLP, SVMs) on a manifold of features. The most successful model used bag of words, POST, verbal forms, negations, and word embeddings, among others.  Indeed, Hansen et al's contrasting model performed best (the task allowed for one primary and two contrasting submissions): a recurrent neural network with attention that had word embeddings, dependencies, and POST as input. The task was available in Arabic as well, although this language attracted less attention. Ghanem et al. approached it using the LIWC lexicons, after machine-translating them into Arabic.

For the second task, supervised models were used again, this time after grabbing supporting information from the Web. The model of Wang et al. performed the best, by a margin. One of the main reasons for this difference is that they added 4k extra instances from politifact in order to train their model. They used a NN (and an SVM) in which the claim is fed concatenated with the evidence from 5 retrieved snippets. Another interesting aspect of this model is that they opted for dropping the intermediate instances from politifact, after finding out that these were not contributing properly to the model. This idea of dropping ambiguous intermediate instances is in line with the findings of Barrón-Cedeño et al.: intermediate classes, which often intend to absorb doubtful instances do not help.

Next year CheckThat! will focus on the same two tasks, but the second one will be divided into its different components: the ranking of relevant documents, evaluating their value to judge a claim, and actually labeling the claim. (It is worth noting that when Noriko Kando described the QA Lab PoliInfo task at NTCIR, she stressed that it was similar to CheckThat!. It is about summarisation of political arguments as being supportive or not, in Japanese.)
The room right before starting the CheckThat! session

Top 2: PAN lab

This year, and indeed since 2016, PAN focused 100% on authorship analysis. This makes sense as, according to the numbers presented, ~1k papers on this topic have been published over the last ten years.  Three tasks were offered this year.

Task 1. Author identification, divided on cross-domain authorship attribution and style change detection (aka single vs multi-authored documents identification). The former used fan fiction material and most well-performing models used SVMs on n-grams. Whereas this year the setting was closed, next year it will be open, including "unknown" as a possible class. In the latter task, that of identifying whether a text has been written by one single or multiple authors, participants used various lengths of sliding windows  and a manifold of features, including the average word frequency class. Custódio and Paraboni used a SVR on fix n-grams and what they call distortion n-grams. Apparently, they substitute everything but punctuation, spaces, and diacritised characters.

Task 2. Author profiling in twitter. It was offered in Arabic, English, and Spanish and included images

Task 3. Author obfuscation; i.e. paraphrasing a text to cause a plagiarism detection model to fail. They evaluated different authorship attribution models on text before and after being obfuscated by  the participants. Kocher and Savoy applied a bunch of rules mostly to substitute conjunctions, prepositions, adverbs, and punctuation.

As traditionally, since 2013, only software submissions were allowed. In 2019, PAN will include a task on discriminating between robot and human twitter accounts.
Paolo and Kiko, giving the certificate to the best-performing team in PAN task 1

Top 3: the keynotes

Julio Gonzalo talked about bias in system evaluation. During his talk, he continuously played a game about the difference between humans and computer scientists. He argues that people (humans) do not understand the difference between correlation and causality and that nowadays research is driven by publishability rather than curiosity. He expresses his concerns about current evaluation settings. For instance, tie breaking is not well defined, causing evaluation ecosystems fragile: a tiny change in these decisions can change scores significantly and, what is worst, can change participant rankings. According to Julio, we should do the following to improve our research ecosystem: 
  • encourage the publication of negative results, text collections, and replication results;
  • reject the typical blind application of machine learning package X to problem Y;
  • request authors to test their proposed models on at least three collections; and 
  • publish the procedure (e.g., software) and not simply text (paper).
He promoted UNED's evaluation platform and highlighted voting mechanisms that consider the bias in the participants and try to turn into wisdom and mentioned the paper of Cañamares and Castells at SIGIR 2018.

Julio Gonzalo's research incentive bias
Gabriella Pasi talked about the evaluation of both search engines and recommender systems. She stressed two things: (i) in recommender systems not only similarity, but dissimilarity is important and (ii) nDCG and MAP are often reported as if they were independent, but they are not (and they correlate). Beside this, she argues that precision and recall are often shadowed by combining them into the F-measure. But they are indeed relevant by themselves! 

Top 4: CENTRE Lab

This is a joint lab that brings together CLEF, NTCIR, and TREC. The main objective is repeating the outcomes from the literature, even if under different conditions. They are interested in three aspects of research in our field, which are tied to ACM's badging policy. :
  1. Replicability - Implement an identical algorithm, apply it to an identical dataset, and reproduce the reported results.
  2. Reproducibility - Apply the same algorithm to a different dataset. 
  3. Generalizability - Predict the performance of a model on a fresh dataset.
The session started with a keynote by Martin Potthast on Improving the reproducibility of Computer Science. He stresses what we know well: many details and technical issues are missing in research papers (the typical not enough details in the paper). This happens to data as well. On a personal note, it is even worst when data is released, but incomplete, as a fake sensation that everything necessary is available is produced, but this is not the case. Again, on the partial release, he stresses the issue of code: sometimes it is released, but lacking any kind of documentation (perhaps due to that big-company typical claim that a person should be able to understand code). Among additional information, he refers to V. Stodden's top barriers to sharing content, to DKPro lab, and their own Tira.

They welcomed our feedback! Participants will replicate the best of 2018
Unfortunately, only one participant took the challenge this year. But they are re-focusing and loading batteries for next year. Instead of choosing some interesting papers, the best papers from this year in tasks running next year will be included. They have a form where people can propose papers for next year.


Top 5: other labs and the conference

A number of other interesting tasks took place.

The Early Risk Prediction on the Internet lab aimed at developing models to identify whether an individual suffers depression or anorexia on the basis of his/her online posts. An interesting aspect of this task is that they do chronological prediction: they released weekly data dumps with new posts from the different users (somehow similar to the temporal summarisation track at TREC). Hence, the system could determine if one person was suffering any of the ailments or if it had to wait to see more data.

ImageClef focused on three tasks: (i) visual question answering in the medical domain, (ii) ImageClef caption, and (iii) ImageClefTuberculosis. They used CrowdAI, which they described as an open-source Kaggle, to run the lab. A branch of ImageCLEF was LifeCLEF, which is interested in biodiversity, animal identification, and geolocation prediction.

At the main conference, John Samuel presented Analyzing and Visualizing Translation Patterns of Wikidata Properties. In there, he stressed the turning of Wikidata into Wikipedia contents ---an ongoing initiative in both the Catalan and the Basque Wikipedia editions, among others. He also mentions wdprop, a toolkit for "understanding and improving multilingual and collaborative ontology development on Wikidata". He claims this can be used both for vandalism and recommendation. Finally, he mentions an interesting paper: Linguistic influence patterns within the global network of Wikipedia language edition.
They claimed they had been jamming. It was hard to believe!


One of the local jazz bands

Extra: the cultural program

Clef 2018 featured a plethora of cultural activities including tango and theater lessons... and a lot of music; everything at the Thèâtre des Halles. The concerts are worth mentioning, as they combined local jazz bands and our own Enrique Amigó, Victor Fresno, and Julio Gonzalo.


What is coming next year

The 20th edition of CLEF will be held in Lugano. Among the novelties:
  • Demos will be integrated to the program
  • Various labs will appear again with the same or similar tasks. Among them, PAN, ImageCLEF, and ChechThat!.
  • A new lab is being organised called ProtestNews. It's purpose is determining if a news article is covering a riot, a demonstration or, in general, social movements. 
Ali Hurriyetoglu presenting ProtestNews
 
That was the end of CLEF this year and, as usual, I closed with some good local food.

Last CLEF 2018 supper at L'Épicerie

Wednesday, 18 April 2018

Top 5 ECIR 2018

The 40th edition of the European Conference on Information Retrieval was held in late March in Grenoble, France. Beside the amazing landscape of the French Alps, the great French cuisine, and awesome craft beer, the conference had a fine selection of up-to-date IR research. Here is my selection of top 5 research outcomes and events. 

Disclaimer: this article is biased towards my current interests and my taste!

A view of Place Victor Hugo in Grenoble

 

 Top 1. The NewsIR Workshop

Researchers from a couple of companies (Signal Media and Factmata) and three universities (UNED, U. of Chile, and U. of Sheffield) organised the second edition of this workshop on the trends of news-focused IR. The one-day workshop came in the form of short 10-minute talks (with a poster session), two keynotes, a round table, and a discussion session.

During the opening session, the evolution of corpora in this genre was stressed, departing from the early Reuters 1997 collections and up to the Signal 1M corpus, with 1M entries from online newspapers and blogs. The ongoing efforts on analysing media bias and verification and the difference between delivering news and delivering actionable intelligence were stressed.

In his keynote titled AI & Automated News: Implications on Trust, Bias, and Credibility, Edgar Meij (Bloomberg) started by giving some big numbers about Bloomberg: 500 stories/second and host for more news reporters than The New York Times, The Washington Post and The Chicago Tribune together. He discussed about systems for the automatic generation of news reports, including companies such as automated insights. According to Meij, most of these models are based on templates: a news story is templated and the specific information from an event (e.g., a match) is filled to come out with the note. Ongoing efforts, such as the natural language generation conference, are trying to go beyond this approach. This is a hot topic and recent  research insights try to explore how the general public perceives machine-generated reports. This includes work from Graefe et al. (2016) and Wolker and Powell (2018). Bloomberg is now trying to generate documentary-like videos out of tweets, something that resembles Qlusty.

In the second keynote, “Every tool is better than nothing”?: The use of dashboards in journalistic work, Peter Tolmie discussed some technology for journalism. Part of the talk described the outcomes of the Pheme project on assessing the veracity of claims online. Pheme generated many resources on rumours and social media analysis, with a multilingual emphasis.

The people from the Webis group was the most active in this edition of NewsIR and offered three talks:
  • In A Plan for Ancillary Copyright, Potthast et al. discussed the ongoing legal situation in Germany which, apparently following the steps of Spain, is trying to regulate how search engines and other websites re-use contents generated by others (do not forget that such regulations caused the end of Google News in Spain, among other consequences).  This kind of initiative tries to guarantee  that only news publishers are allowed to publish their text (not even snippets could be shown in search engines, and definitively not for commercial use). Beyond the national initiatives,  the EC is debating an ancillary copyright and JRC did a study about it (but it was not published because the conclusions were not what the EC expected!). The fact seems to be that Taraborelli (2015)'s reuse paradox worries the large content generation companies. That is, that a good snippet returned after a search could be enough to fulfill the user's information need and hence would prevent her from actually visiting the website (hence harming its income). It seems like this could actually harm relatively small companies.
The Clash of Titans in Ancillary Copyright
  • In Shaping the Information Nutrition Label they discussed on the different dimensions one could show to the user to characterise an article. Taking advantage of the "typical" dimensions used to describe the nutrition values of food (e.g., carbohydrates, sugar, proteins), they discussed quality dimensions, such as readability or verbosity, and proposed an iconography to display them. This research builds upon Norbert Fuhr et al. (2017)'s An Information Nutritional Label for Online Documents. Interestingly, they link these dimensions to Aristotle's categories of perception. 
A snapshot of Shaping the Information Nutrition Label to display the quality of a document

.
  • Finally, in Cross-Reading News S. Syed et al. described a system to assist journalists in their work. The system allows a journalist to search based on named entities and select topics to obtain an automatically-generated summary (an article) and a candidate title. This generated text is then shown to the journalist for edition and eventually to be sent for review. 
In Visualizing Polarity-based Stances of News Websites, M. Yoshioka et al. focused on the 2016 US election campaign and tried to identify the polarity of news articles against H. Clinton and D. Trump. They relied on the Google-supported gdelt (also considering their positive, neutral and negative judgments as gold standard) to build their dataset and performed document-level judgments on them.

In Qlusty: Quick and Dirty Generation of Event Videos from Written Media Coverage, A. Barrón-Cedeño et al. presented some preliminary efforts to identify events, diversify their points of view and present them to the user as a short overview video. For event identification they used dbscan clustering on doc2vec representations. For diversification they played with the dbscan-generated clusters and ranked the articles according to their distance against the rest of the cluster elements. Here is an instance of the generated videos (a new blog entry will soon arrive with further details about Qlusty).
The poster about Qlusty (starring T. Brady, G. Bündchen and the Got Süt ad.

In Exploring Significant Interactions in Live News, Schubert et al. described an interesting system that identifies how different entities act together in the news and hence can compose an event. They have a nice running demo to explore what is going on in the news live.

In On Temporally Sensitive Word Embeddings for News Information Retrieval, Taewon Yoon et al. tried to address an interesting question: how often should we change our embeddings? The question comes from the fact that, with new events, new vocabulary never seen before appears and might increase the number of OOVs, making it harder for embedding-based retrieval models to catch up (and even those based on idf-like frequencies). They cite an interesting paper on using the centroids of word embeddings (CentIDF) for QA.

The last standard talk presented the TREC 2018 News Track, which proposes two tasks. In background linking a model should retrieve articles reported before the current one which will provide context to better understanding it. In entity ranking a model should rank the entities in an article according to their importance. Further information is available on the website and a call for participation should be available by April/May.

The workshop closed with an open discussion and round tables. Multiple topics were mentioned, including robotic generation of articles, virality of news, prediction of events, multi-document summarisation (also using the Wikipedia as ground truth),  hyperpartisan identification, timelines generation, and opinion similarity. Regarding multilinguality, Andreas Spitz came out with a phrase worth quoting: "no matter how many languages you cover, you wont cover all the opinions".

Not in NewsIR, but in the main conference, Meladianos et al. presented An Optimization Approach for Sub-Event Detection and Summarization in Twitter. In this case, the event is already given, as it comes from a specific time span and asociated hashtags. Having this as an input, they identify the relevant sub-events (e.g., the most interesting actions in a football match). They build a graph of each tweet (similar to word-level pagerank) and merge them into a larger graph. The minimisation function represents the similarity between the events.

 

Top 2. Community question answering

The problem of selecting appropriate answers and retrieving similar questions in community-driven question and answering forums remains hot.

In Medical Forum Question Classification Using Deep Learning, Raksha Jalan et al. tried to identify the intent of a question. They relied on the ICHI 2016 Health care data analytics challenge to run their experiments (8k questions for training, 3k for testing). They crawled medhelp.org to retrieve some extra instances (although they mentioned others, such as Mayo Clinic and WebMD) and generated new supervised data by self learning. That is, they classified the unlabeled data with a model trained on labeled data and retrained with the sum of them both. They referred to the term throttling balancing with lookups as the strategy they followed and called this weak supervision. Their model is a bi-LSTM and the preprocessing of the texts includes hyperlinks removal.

D. Cohen and B. Croft contributed their part to the topic by presenting A Hybrid Embedding Approach to Noisy Answer Passage Retrieval, for which they experimented with Yahoo's Webscope L4 and L6 and WebAP from TREC 2004.

Not exactly question answering, but a research work relevant for this kind of data was Bringing back structure to free text email conversations with recurrent neural networks, by T. Repke and R. Krestel. Questions and answers in community forums are full of noisy text, greetings, and other sections that are irrelevant for the relevance estimation task (we tried to address this problem in the past). E-mails are similar: they contain different sections. Beside rule-based approaches, which achieve accuracy values close to 70%, various supervised models have been proposed. Lampert et al (EMNLP 2009) proposed zebra, which relies on an SVM to classify each line into one of several classes, obtaining  93% accuracy (Carvalho, 2004 had obtained 99% with Jangada, but on a pretty standardised dataset; ~64% on actual data). The currently-proposed Quagga email zoning model consists of a CNN for line encoding. The input is a matrix of one-hot character representations (the presenter mentioned that this representation does not work for Arabic or Chinese; not sure why this is the case for the former). One encoder is used for greetings, signals(?) and signatures. The results on the Enron and ASF datasets (they generated the latter from Apache public emails) are state of the art and good even across corpora. An additional contribution of this research is a publicly-available annotation tool.

 

Top 3. Topic modeling

The sessions on topic modeling included two interesting papers. In Predicting topics in scholarly papers, Al Bahrainian et al. built on top of their previous paper Modeling discrete dynamic topics to propose K2RE. Their aim is to track topical changes through time and being able to predict which topics will remain alive in the future (recency). They focus on scientific papers and use a dataset consisting of 6k papers from NIPS. Their model uses LDA topic computation and topic correlation based on Pearson. This could be used for other genres and problems (e.g.,  personalised recommender systems?).

Another interesting paper was Topic Lifecycle on Social Networks: Analyzing the Effects of Semantic Continuity and Social Communities, by Dey et al. The authors describe the lifecycle of a topic in social media: emergence, spread, and subside. Their model consists of two main  components:
  1. topic cluster formation
    1. concatenate all tweets with common hashtag to produce one single document
    2. compute glove twitter embeddings (after removing hashtags)
    3. average to represent the hashtag
    4. compute k-means using cosine
    5. build temporal relationships by measuring overlappings with documents appearing before and after (at day-level granularity).
  2. Build topic timelines using the fellowship network.
Following Ardon et al (CIKM 2013), they consider a hashtag as a topic. They also mentioned some interesting research, in particular the k-spectral centroid based on hashtags and Yang and Leskovec (WSDM 2011) and their Stanford LArge Network Dataset collection (SNAP).

Top 4. Reproducibility

As in every conference in already quite a few years, reproducibility was a big topic. The paper opening the discussion was Reproducing a Neural Question Answering Architecture Applied to the SQUAD Benchmark Dataset: Challenges and Lessons Learned, by Alexander Dur et al. The main idea of this paper, as the title clearly states, is "let us try to obtain exactly the same results that a given paper reports". The conclusion was worrying, but somehow not surprising: none of their submitted docker containers managed to reproduce the results reported by Wang et al's Gated self-matching networks for reading comprehension and question answering (even if at some stage they managed to exchange emails with the authors, which eventually stopped replying!). The authors provided some guidelines to describe the network architectures, including the topology and the learning algorithm. The take home message was the holy grail of reproducibility in our field: release the source code. Some informal discussion started after this talk on what should be done with such papers. The proposals came from doing nothing up to post-rejecting the paper, passing by flagging it as non-reproducible.

This research gave me the idea for a nice paper: taking a bunch of papers describing alternative models for the same task, proposing to a number of students to implement them, and publish a paper with the outcome. Later, Silvello et al. presented Statistical Stemmers: A Reproducibility Study. The authors asked their students to implement stemmers from different papers and thoroughly analysed the outcome and discussed what made the models reproducible or not. This research obtained the ECIR 2018 best paper award.

Top 5.  The keynotes

Radim Rehurek, the founder of RaRe technologies ---world-famous for gensim, the python library for topic modelling--- offered a nice keynote: Anatomy of an idea: mixing open source, research and business. After showing the place in Thailand where gensim (generate similars) was born, he explained the impact the toolkit has had, including 900+ citations (by the way, gensim also works for large data streaming and online learning). Among nice anecdotes and experiences, Radim stressed the importance of good practices in the development of any technology: put whatever that has been done in small demos, blog posts, and web prototypes. Do unit testing, use logging for everything (what is being trained, what is the data). He used some nice illustrations from commitstrip
Fernando Diaz (Spotify) was granted the Karen Spark-Jones prize and presented the keynote The Harsh Reality of Production Information Access Systems. Among the different aspects he touched, he mentioned a couple of interesting references including
The beginning of Fernando Diaz Karen Spark-Jones prize keynote.

Gabriella Kazai (Microsoft) presented Challenges in building IR evaluation pipelines. She mentioned Evangelos Kanoulas' A short survey on search evaluation and argued about the classical TREC Cranfield framework: corpus -> query -> labels -> metrics. She stressed that people from different places have different interests: some could like personalisation, others prefer trending topics. She discussed around the Lumi news app, which does not deliver news only, but all kind of contents. In Lumi users are incentivised to feed the system with their info (e.g., tweets) because this results in better recommendations.

The evaluation carried out on Lumi.

Extra. 


I found some posters particularly interesting as well. among them:











The end of ECIR and the farewell from Grenoble came with another of the typical local dishes: ravioles.


Top 5 CLEF 2018

A view of the Palais des Papes, in Avignon All directions lead to CLEF The 2018 edition of the Cross-Language Evaluation Forum w...