in the context of our regular workshop on Building and Using Comparable Corpora we ran a shared task on identification of comparable corpora on the Web. More information about the task is available from: https://comparable.limsi.fr/bucc2015/bucc2015-task.html
The workshop proceedings with the participating systems and the results are now available from: http://www.aclweb.org/anthology/W/W15/W15-34.pdf
In order to promote further research on this topic, the gold-standard resources with a standardised train/test split have been made available to everyone: http://corpus.leeds.ac.uk/serge/BUCC/
Feel free to use this set for any tasks involving research of comparable corpora. The standard reference is: @InProceedings{sharoff-zweigenbaum-rapp:2015:BUCC,
author = {Sharoff, Serge and Zweigenbaum, Pierre and Rapp, Reinhard},
title = {BUCC Shared Task: Cross-Language Document Similarity},
booktitle = {Proceedings of the Eighth Workshop on Building and Using Comparable Corpora},
month = {July},
year = {2015},
address = {Beijing, China},
publisher = {Association for Computational Linguistics},
pages = {74--78},
url = {http://www.aclweb.org/anthology/W15-3411} }
Best wishes, Serge