Efficient Parallelization of the Google Trigram Method for Document Relatedness Computation

Finding pair wise document relatedness plays an important role in a variety of Natural Language Processing problems. Google Trigram Method (GTM) is one of the corpus-based unsupervised method that can be used to capture word relatedness and document relatedness. It has been shown that it is possible...

Full description

Saved in:
Bibliographic Details
Published in:2015 44th International Conference on Parallel Processing Workshops pp. 98 - 104
Main Authors: Xinxin Kou, Jie Mei, Zhimin Yao, Rau-Chaplin, Andrew, Islam, Aminul, Moh'd, Abidalrahman, Milios, Evangelos
Format: Conference Proceeding
Language:English
Published: IEEE 01-09-2015
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Finding pair wise document relatedness plays an important role in a variety of Natural Language Processing problems. Google Trigram Method (GTM) is one of the corpus-based unsupervised method that can be used to capture word relatedness and document relatedness. It has been shown that it is possible to apply GTM to construct high quality document relatedness applications. However, there are challenges in implementing GTM for pair-wise document relatedness computation on a large volume of document set given its high computational complexity. This paper presents time and space efficient methods for the computation of pair-wise document relatedness using GTM. In order to improve the performance algorithmic engineering, data structure enhancement, and parallel computing methods are applied. Two parallel methods are discussed in this paper: shared memory multicore implementation and distributed memory Hadoop implementation. Both parallel methods provide an order of magnitude improvement in accelerating the pair-wise document relatedness computation using GTM.
ISSN:1530-2016
2375-530X
DOI:10.1109/ICPPW.2015.42