psychotechno: April 2018

The 40th edition of the European Conference on Information Retrieval was held in late March in Grenoble, France. Beside the amazing landscape of the French Alps, the great French cuisine, and awesome craft beer, the conference had a fine selection of up-to-date IR research. Here is my selection of top 5 research outcomes and events.

Disclaimer: this article is biased towards my current interests and my taste!


A view of Place Victor Hugo in Grenoble

Top 1. The NewsIR Workshop

Researchers from a couple of companies (Signal Media and Factmata) and three universities (UNED, U. of Chile, and U. of Sheffield) organised the second edition of this workshop on the trends of news-focused IR. The one-day workshop came in the form of short 10-minute talks (with a poster session), two keynotes, a round table, and a discussion session.

During the opening session, the evolution of corpora in this genre was stressed, departing from the early Reuters 1997 collections and up to the Signal 1M corpus, with 1M entries from online newspapers and blogs. The ongoing efforts on analysing media bias and verification and the difference between delivering news and delivering actionable intelligence were stressed.

In his keynote titled AI & Automated News: Implications on Trust, Bias, and Credibility, Edgar Meij (Bloomberg) started by giving some big numbers about Bloomberg: 500 stories/second and host for more news reporters than The New York Times, The Washington Post and The Chicago Tribune together. He discussed about systems for the automatic generation of news reports, including companies such as automated insights. According to Meij, most of these models are based on templates: a news story is templated and the specific information from an event (e.g., a match) is filled to come out with the note. Ongoing efforts, such as the natural language generation conference, are trying to go beyond this approach. This is a hot topic and recent research insights try to explore how the general public perceives machine-generated reports. This includes work from Graefe et al. (2016) and Wolker and Powell (2018). Bloomberg is now trying to generate documentary-like videos out of tweets, something that resembles Qlusty.

In the second keynote, “Every tool is better than nothing”?: The use of dashboards in journalistic work, Peter Tolmie discussed some technology for journalism. Part of the talk described the outcomes of the Pheme project on assessing the veracity of claims online. Pheme generated many resources on rumours and social media analysis, with a multilingual emphasis.

The people from the Webis group was the most active in this edition of NewsIR and offered three talks:

In A Plan for Ancillary Copyright, Potthast et al. discussed the ongoing legal situation in Germany which, apparently following the steps of Spain, is trying to regulate how search engines and other websites re-use contents generated by others (do not forget that such regulations caused the end of Google News in Spain, among other consequences). This kind of initiative tries to guarantee that only news publishers are allowed to publish their text (not even snippets could be shown in search engines, and definitively not for commercial use). Beyond the national initiatives, the EC is debating an ancillary copyright and JRC did a study about it (but it was not published because the conclusions were not what the EC expected!). The fact seems to be that Taraborelli (2015)'s reuse paradox worries the large content generation companies. That is, that a good snippet returned after a search could be enough to fulfill the user's information need and hence would prevent her from actually visiting the website (hence harming its income). It seems like this could actually harm relatively small companies.


The Clash of Titans in Ancillary Copyright

In Shaping the Information Nutrition Label they discussed on the different dimensions one could show to the user to characterise an article. Taking advantage of the "typical" dimensions used to describe the nutrition values of food (e.g., carbohydrates, sugar, proteins), they discussed quality dimensions, such as readability or verbosity, and proposed an iconography to display them. This research builds upon Norbert Fuhr et al. (2017)'s An Information Nutritional Label for Online Documents. Interestingly, they link these dimensions to Aristotle's categories of perception.


A snapshot of Shaping the Information Nutrition Label to display the quality of a document			.

Finally, in Cross-Reading News S. Syed et al. described a system to assist journalists in their work. The system allows a journalist to search based on named entities and select topics to obtain an automatically-generated summary (an article) and a candidate title. This generated text is then shown to the journalist for edition and eventually to be sent for review.

In Visualizing Polarity-based Stances of News Websites, M. Yoshioka et al. focused on the 2016 US election campaign and tried to identify the polarity of news articles against H. Clinton and D. Trump. They relied on the Google-supported gdelt (also considering their positive, neutral and negative judgments as gold standard) to build their dataset and performed document-level judgments on them.

In Qlusty: Quick and Dirty Generation of Event Videos from Written Media Coverage, A. Barrón-Cedeño et al. presented some preliminary efforts to identify events, diversify their points of view and present them to the user as a short overview video. For event identification they used dbscan clustering on doc2vec representations. For diversification they played with the dbscan-generated clusters and ranked the articles according to their distance against the rest of the cluster elements. Here is an instance of the generated videos (a new blog entry will soon arrive with further details about Qlusty).

The poster about Qlusty (starring T. Brady, G. Bündchen and the Got Süt ad.

In Exploring Significant Interactions in Live News, Schubert et al. described an interesting system that identifies how different entities act together in the news and hence can compose an event. They have a nice running demo to explore what is going on in the news live.

In On Temporally Sensitive Word Embeddings for News Information Retrieval, Taewon Yoon et al. tried to address an interesting question: how often should we change our embeddings? The question comes from the fact that, with new events, new vocabulary never seen before appears and might increase the number of OOVs, making it harder for embedding-based retrieval models to catch up (and even those based on idf-like frequencies). They cite an interesting paper on using the centroids of word embeddings (CentIDF) for QA.

The last standard talk presented the TREC 2018 News Track, which proposes two tasks. In background linking a model should retrieve articles reported before the current one which will provide context to better understanding it. In entity ranking a model should rank the entities in an article according to their importance. Further information is available on the website and a call for participation should be available by April/May.

The workshop closed with an open discussion and round tables. Multiple topics were mentioned, including robotic generation of articles, virality of news, prediction of events, multi-document summarisation (also using the Wikipedia as ground truth), hyperpartisan identification, timelines generation, and opinion similarity. Regarding multilinguality, Andreas Spitz came out with a phrase worth quoting: "no matter how many languages you cover, you wont cover all the opinions".

Not in NewsIR, but in the main conference, Meladianos et al. presented An Optimization Approach for Sub-Event Detection and Summarization in Twitter. In this case, the event is already given, as it comes from a specific time span and asociated hashtags. Having this as an input, they identify the relevant sub-events (e.g., the most interesting actions in a football match). They build a graph of each tweet (similar to word-level pagerank) and merge them into a larger graph. The minimisation function represents the similarity between the events.

Top 2. Community question answering

The problem of selecting appropriate answers and retrieving similar questions in community-driven question and answering forums remains hot.

In Medical Forum Question Classification Using Deep Learning, Raksha Jalan et al. tried to identify the intent of a question. They relied on the ICHI 2016 Health care data analytics challenge to run their experiments (8k questions for training, 3k for testing). They crawled medhelp.org to retrieve some extra instances (although they mentioned others, such as Mayo Clinic and WebMD) and generated new supervised data by self learning. That is, they classified the unlabeled data with a model trained on labeled data and retrained with the sum of them both. They referred to the term throttling balancing with lookups as the strategy they followed and called this weak supervision. Their model is a bi-LSTM and the preprocessing of the texts includes hyperlinks removal.

D. Cohen and B. Croft contributed their part to the topic by presenting A Hybrid Embedding Approach to Noisy Answer Passage Retrieval, for which they experimented with Yahoo's Webscope L4 and L6 and WebAP from TREC 2004.

Not exactly question answering, but a research work relevant for this kind of data was Bringing back structure to free text email conversations with recurrent neural networks, by T. Repke and R. Krestel. Questions and answers in community forums are full of noisy text, greetings, and other sections that are irrelevant for the relevance estimation task (we tried to address this problem in the past). E-mails are similar: they contain different sections. Beside rule-based approaches, which achieve accuracy values close to 70%, various supervised models have been proposed. Lampert et al (EMNLP 2009) proposed zebra, which relies on an SVM to classify each line into one of several classes, obtaining 93% accuracy (Carvalho, 2004 had obtained 99% with Jangada, but on a pretty standardised dataset; ~64% on actual data). The currently-proposed Quagga email zoning model consists of a CNN for line encoding. The input is a matrix of one-hot character representations (the presenter mentioned that this representation does not work for Arabic or Chinese; not sure why this is the case for the former). One encoder is used for greetings, signals(?) and signatures. The results on the Enron and ASF datasets (they generated the latter from Apache public emails) are state of the art and good even across corpora. An additional contribution of this research is a publicly-available annotation tool.

Top 3. Topic modeling

The sessions on topic modeling included two interesting papers. In Predicting topics in scholarly papers, Al Bahrainian et al. built on top of their previous paper Modeling discrete dynamic topics to propose K2RE. Their aim is to track topical changes through time and being able to predict which topics will remain alive in the future (recency). They focus on scientific papers and use a dataset consisting of 6k papers from NIPS. Their model uses LDA topic computation and topic correlation based on Pearson. This could be used for other genres and problems (e.g., personalised recommender systems?).

Another interesting paper was Topic Lifecycle on Social Networks: Analyzing the Effects of Semantic Continuity and Social Communities, by Dey et al. The authors describe the lifecycle of a topic in social media: emergence, spread, and subside. Their model consists of two main components:

topic cluster formation

concatenate all tweets with common hashtag to produce one single document
compute glove twitter embeddings (after removing hashtags)
average to represent the hashtag
compute k-means using cosine
build temporal relationships by measuring overlappings with documents appearing before and after (at day-level granularity).

Build topic timelines using the fellowship network.

Following Ardon et al (CIKM 2013), they consider a hashtag as a topic. They also mentioned some interesting research, in particular the k-spectral centroid based on hashtags and Yang and Leskovec (WSDM 2011) and their Stanford LArge Network Dataset collection (SNAP).

Top 4. Reproducibility

As in every conference in already quite a few years, reproducibility was a big topic. The paper opening the discussion was Reproducing a Neural Question Answering Architecture Applied to the SQUAD Benchmark Dataset: Challenges and Lessons Learned, by Alexander Dur et al. The main idea of this paper, as the title clearly states, is "let us try to obtain exactly the same results that a given paper reports". The conclusion was worrying, but somehow not surprising: none of their submitted docker containers managed to reproduce the results reported by Wang et al's Gated self-matching networks for reading comprehension and question answering (even if at some stage they managed to exchange emails with the authors, which eventually stopped replying!). The authors provided some guidelines to describe the network architectures, including the topology and the learning algorithm. The take home message was the holy grail of reproducibility in our field: release the source code. Some informal discussion started after this talk on what should be done with such papers. The proposals came from doing nothing up to post-rejecting the paper, passing by flagging it as non-reproducible.

This research gave me the idea for a nice paper: taking a bunch of papers describing alternative models for the same task, proposing to a number of students to implement them, and publish a paper with the outcome. Later, Silvello et al. presented Statistical Stemmers: A Reproducibility Study. The authors asked their students to implement stemmers from different papers and thoroughly analysed the outcome and discussed what made the models reproducible or not. This research obtained the ECIR 2018 best paper award.

Top 5. The keynotes

Radim Rehurek, the founder of RaRe technologies ---world-famous for gensim, the python library for topic modelling--- offered a nice keynote: Anatomy of an idea: mixing open source, research and business. After showing the place in Thailand where gensim (generate similars) was born, he explained the impact the toolkit has had, including 900+ citations (by the way, gensim also works for large data streaming and online learning). Among nice anecdotes and experiences, Radim stressed the importance of good practices in the development of any technology: put whatever that has been done in small demos, blog posts, and web prototypes. Do unit testing, use logging for everything (what is being trained, what is the data). He used some nice illustrations from commitstrip

and mentioned wstein.org.

Managing risk strategies according to R. Rehurek.

Fernando Diaz (Spotify) was granted the Karen Spark-Jones prize and presented the keynote The Harsh Reality of Production Information Access Systems. Among the different aspects he touched, he mentioned a couple of interesting references including

Nicholas J. Belkin and Stephen E. Robertson. Some ethical and political implications of theoretical research in information science. It talks about about monetisation and propaganda in information access.
Lucas D. Introna and Hellen Nisenbaum. Shaping the web: Why the politics of search engines matters

The beginning of Fernando Diaz Karen Spark-Jones prize keynote.

Gabriella Kazai (Microsoft) presented Challenges in building IR evaluation pipelines. She mentioned Evangelos Kanoulas' A short survey on search evaluation and argued about the classical TREC Cranfield framework: corpus -> query -> labels -> metrics. She stressed that people from different places have different interests: some could like personalisation, others prefer trending topics. She discussed around the Lumi news app, which does not deliver news only, but all kind of contents. In Lumi users are incentivised to feed the system with their info (e.g., tweets) because this results in better recommendations.