Open Parallel Corpus (OPUS)

작성자Dongkwang Shin|작성시간13.12.12|조회수245 목록 댓글 0

http://opus.lingfil.uu.se/bin/opuscqp.pl

위 사이트서 아래의 오픈 소스 중 하나를 선택한 후, 검색 언어(약어로 표기 en-English, fr-French)를 선택하고 검색창에 검색어 입력

Corpora

(http://opus.lingfil.uu.se/ Subtitle 등과 같은 오픈 소스로 구축한 오픈 코퍼스, 파일 다운로드도 가능)

COPUS includes a variety of parallel corpora. Here is some information about the main corpora in the collection:

ECB - Translated websites from the European Central Bank

EMEA - European Medicines Agency documents
A parallel corpus made out of PDF documents from the European Medicines Agency, in 22 European languages, originally extracted from http://www.emea.europa.eu/. All files are automatically converted from PDF to plain text using pdftotext with the command line arguments -layout -nopgbrk -eol unix. There are some known problems with tables and multi-column layouts - some of them are fixed in the current version.

EU bookshop - publications from the EU bookshop

EUconst - The European constitution

EUROPARL - European Parliament Proceedings, release v3, Sept 27, 2007
A parallel corpus that was extracted from the European Parliament web site by Philipp Koehn (University of Edinburgh). Appr. 40 million words per language. The main intended use is to aid statistical machine translation research.

More information can be found at http://www.statmt.org/europarl/. The main difference in this release vs. the first release in 2002 and second release in 2003 is that it is larger and it comes with improved processing tools that allow the creation of parallel corpora between any two of the 11 languages. Some data is now tagged with the original language the text was spoken in.

EUROPARL v7 - version 7 of the Europarl corpus

!Croatian-English WaC - crawled web data

KDE4 - KDE4 localization files (v.2)

KDEdoc - KDE manuals

MBS - parallel newspaper texts from Belgium

MultiUN - multilingual publications by the United Nations

Open Office - A collection of documents from http://www.openoffice.org/
The original documentation of the office package OpenOffice.org (http://www.openoffice.org/) contains 2014 English documents which have been partly translated into 5 languages: French, Spanish, Swedish, German, and Japanese. The original documentation in English comprises about 500,000 words and translations contain between 400,000 and 500,000 words per language. All documents have been tokenized and, except of the Spanish part, tagged with parts of speech. The English part of the corpus has been marked with syntactic chunks as well.

OfisPublik - a small Breton-French corpus

Open Subtitles - A collection of documents from http://www.opensubtitles.org/

OpenSubtitles2011 - A much larger collection of subtitles from the same source

OpenSubtitles2011 - A new improved version of OpenSubtitles2011

PHP - A collection of PHP manuals
Originally extracted from http://se.php.net/download-docs.php. The original documents are written in English and have been partly translated into 21 languages. The original manuals contain about 500,000 words. The amount of actually translated texts varies for different languages between 50,000 and 380,000 words. The corpus is rather noisy and may include parts from the English original in some of the translations. The corpus is tokenized and each language pair has been sentence aligned.

Regeringsförklaringen - a small test corpus (Swedish government policies)

SETIMES - South East European Times
A parallel corpus of news articles in the Balkan languages, originally extracted from http://www.setimes.com
The corpus is PUBLIC DOMAIN, but if you use it in your work, please cite: Francis M. Tyers and Murat Alperen (2010), "South-East European Times: A parallel corpus of the Balkan languages"

SETIMES2 - a cleaner version of SETIMES

SPC - Stockholm Parallel Corpora
A collection of parallel corpora collected by Hercules Dalianis and his research group for bilingual dictionary construction. More information in: Hercules Dalianis, Hao-chun Xing, Xin Zhang: Creating a Reusable English-Chinese Parallel Corpus for Bilingual Dictionary Construction, In Proceedings of LREC2010 (source: http://people.dsv.su.se/~hercules/SEC/) and Konstantinos Charitakis (2007): Using Parallel Corpora to Create a Greek-English Dictionary with UPLUG, In Proceedings of NODALIDA 2007 (source: http://dspace.utlib.ee/xmlui/bitstream/handle/10062/2576/stud-Charitakis-10.pdf?sequence=1)

Tatoeba - user-generated translations in many languages

TedTalks - tanslated subtitles of TED talks (only Croatian - English so far)

TEP - movie subtitles in Farsi and English

UN - another collection of publications from the United Nations

WikiSource - a test collection (Bible) from WikiSource? (Swedish - English only so far)

Download formats

For each corpus, the following data is available for download at the OPUS website:
Complete download of all corpus files

Monolingual sample files
Bilingual sample files
Sentence alignments in XCES format
Gzipped tar-archives of corpus files in XML
Translation memory files in TMX format
Plain text files (MOSES/GIZA++)

For some corpora, there are also word alignments available from here: http://opus.lingfil.uu.se/wordalign/

다음검색

북마크

댓글 0
댓글쓰기
답글쓰기

댓글 리스트

CAFE

Corpus & Lexis

Open Parallel Corpus (OPUS)

Corpora

(http://opus.lingfil.uu.se/ Subtitle 등과 같은 오픈 소스로 구축한 오픈 코퍼스, 파일 다운로드도 가능)

COPUS includes a variety of parallel corpora. Here is some information about the main corpora in the collection:

Download formats

For each corpus, the following data is available for download at the OPUS website:
Complete download of all corpus files

댓글

카페 검색

Open Parallel Corpus (OPUS)

Corpora

(http://opus.lingfil.uu.se/ Subtitle 등과 같은 오픈 소스로 구축한 오픈 코퍼스, 파일 다운로드도 가능)

COPUS includes a variety of parallel corpora. Here is some information about the main corpora in the collection:

Download formats

For each corpus, the following data is available for download at the OPUS website:Complete download of all corpus files

댓글

카페 검색

For each corpus, the following data is available for download at the OPUS website:
Complete download of all corpus files