Efficient Parallelization of the Google Trigram Method for Document Relatedness Computation
Finding pair wise document relatedness plays an important role in a variety of Natural Language Processing problems. Google Trigram Method (GTM) is one of the corpus-based unsupervised method that can be used to capture word relatedness and document relatedness. It has been shown that it is possible...
Saved in:
Published in: | 2015 44th International Conference on Parallel Processing Workshops pp. 98 - 104 |
---|---|
Main Authors: | , , , , , , |
Format: | Conference Proceeding |
Language: | English |
Published: |
IEEE
01-09-2015
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Finding pair wise document relatedness plays an important role in a variety of Natural Language Processing problems. Google Trigram Method (GTM) is one of the corpus-based unsupervised method that can be used to capture word relatedness and document relatedness. It has been shown that it is possible to apply GTM to construct high quality document relatedness applications. However, there are challenges in implementing GTM for pair-wise document relatedness computation on a large volume of document set given its high computational complexity. This paper presents time and space efficient methods for the computation of pair-wise document relatedness using GTM. In order to improve the performance algorithmic engineering, data structure enhancement, and parallel computing methods are applied. Two parallel methods are discussed in this paper: shared memory multicore implementation and distributed memory Hadoop implementation. Both parallel methods provide an order of magnitude improvement in accelerating the pair-wise document relatedness computation using GTM. |
---|---|
ISSN: | 1530-2016 2375-530X |
DOI: | 10.1109/ICPPW.2015.42 |