[Corpora-List] SVD on high-dimension data

Yannick Versley versley at sfs.uni-tuebingen.de
Tue Mar 6 16:11:00 CET 2007


> I have large (1 million by 1 million) term-term matrices. What SVD

> packages work with such massive datasets? I have tried Matlab and

> SVDPACKC without much success.

Both Matlab and the Harwell-Boeing format used by SVDPACK(C) use sparse
matrices, which means that the dimensionality (=number of terms) does not
really matter, but the number of non-zero entries does. To solve your
problem, you could either:
- adjust the constants in the SVDPACKC source code that give maximum limits
for dimensionality and non-zero entries and run the SVD on a machine with
lots of memory.
Ted Pedersen's SenseClusters software uses SVDPACKC and its documentation
gives good advice regarding the values that you need to tweak.
- try to somehow reduce the number of terms and/or the number of non-zero
entries. A sensible thing to do would be to throw away terms that don't occur
at least 5 times in your corpus, and, if the matrix is still too big, throw
away all entries which are below a certain threshold (e.g. all entries with
only 1 in it).

Yannick Versley
Seminar für Sprachwissenschaft, Abt. Computerlinguistik
Wilhelmstr. 19, 72074 Tübingen
Tel.: (07071) 29 77352

More information about the Corpora-archive mailing list