[Corpora-List] Search tool for XCES-encoded parallel corpora?

Mickel Grönroos mickel.gronroos at masterin.com
Fri Sep 23 14:51:00 CEST 2005


Hello!

I am looking for a corpus search tool that could be used for querying a
parallel corpus tagged in XCES format. All operating systems and programming
languages will do. Does anybody now if such a tool exists or do I need to
code it myself?

Basically what I want to be able to do is say something like: "Look for the
word X in language A using my set of sentence align files N. Show me all
sentences in language A and language B where where X occurs."

What I have is three files, one file with the text in language A, another
with the text in language B and finally an file with the alignment markup
aligning the A sentences with the B sentences.

This is what it looks like:

exampledoc_A.xml:
[...]
<p id="p1">
<s id="p1s1">Aktia nostaa Prime-korkoaan.</s>
<s id="p1s2">Aktia Säästöpankki Oyj:n johtoryhmä on tänään päättänyt
nostaa Prime-korkoa 0,5 prosenttiyksiköllä.</s>
</p>
[...]

exampledoc_B.xml:
[...]
<p id="p1">
<s id="p1s1">Aktia höjer sin Prime-ränta.</s>
<s id="p1s2">Aktia Sparbank Abp:s ledningsgrupp har i dag beslutat att
höja Prime-räntan med 0,5 procentenheter.</s>
</p>
[...]

examplealign.xml:
[...]
<translations>
<translation trans.loc="exampledoc_A.xml" wsd="iso-8859-1" lang="fi"
xml:lang="fi" n="1" />
<translation trans.loc="exampledoc_B.xml" wsd="iso-8859-1" lang="sv"
xml:lang="sv" n="2" />
</translations>
[...]
<linkList>
<linkGrp targType="s">
<link>
<align xlink:href="#p1s1" />
<align xlink:href="#p1s1" />
</link>
<link>
<align xlink:href="#p1s2" />
<align xlink:href="#p1s2" />
</link>
</linkGrp>
</linkList>
[...]

I want to be able to say:

xces_search --searchlanguage=sv 'höjer' examplealign.xml

What I want to get is:
Aktia höjer sin Prime-ränta.
Aktia nostaa Prime-korkoaan.

Any ideas?

Best regards,

Mickel Grönroos

--
Mickel Grönroos, project manager, mickel.gronroos at masterin.com, +358 9 2517
4562
Master's Innovations Ltd., Tekniikantie 14, FIN-02150 Espoo, Finland,
www.masterin.com






More information about the Corpora-archive mailing list