几个关键点:
ref:
文献题目 | 去谷歌学术搜索 | ||||||||||
Compression-Based Selective Sampling for Learning to Rank | |||||||||||
文献作者 | Rodrigo M. Silva, Guilherme C. M. Gomes, Mário S. Alvim, Marcos A. Gonçalves | ||||||||||
文献发表年限 | 2016 | ||||||||||
文献关键字 | |||||||||||
L2R; IR; AL; Active Learning (主动学习); 半监督学习; 直推学习(transductive learning) | |||||||||||
摘要描述 | |||||||||||
Learning to rank (L2R) algorithms use a labeled training set to generate a ranking model that can be later used to rank new query results. These training sets are very costly and laborious to produce, requiring human annotators to assess the relevance or order of the documents in relation to a query. Active learning (AL) algorithms are able to reduce the labeling effort by actively sampling an unlabeled set and choosing data instances that maximize the effectiveness of a learning function. But AL methods require constant su- pervision, as documents have to be labeled at each round of the process. In this paper, we propose that certain char- acteristics of unlabeled L2R datasets allow for an unsuper- vised, compression-based selection process to be used to cre- ate small and yet highly informative and effective initial sets that can later be labeled and used to bootstrap a L2R sys- tem. We implement our ideas through a novel unsupervised selective sampling method, which we call Cover, that has several advantages over AL methods tailored to L2R. First, it does not need an initial labeled seed set and can select documents from scratch. Second, selected documents do not need to be labeled as the iterations of the method progress since it is unsupervised (i.e., no learning model needs to be updated). Thus, an arbitrarily sized training set can be selected without human intervention depending on the avail- able budget. Third, the method is efficient and can be run on unlabeled collections containing millions of query-document instances. We run various experiments with two important L2R benchmarking collections to show that the proposed method allows for the creation of small, yet very effective training sets. It achieves full training-like performance with less than 10% of the original sets selected, outperforming the baselines in both effectiveness and scalability. |