Overview

GreynirCorrect can be used in three modes, depending on your requirements.

  • You can use it as a Python package to proofread text at the token (word) level, correcting and annotating individual errors in the token stream.

  • It can also perform more extensive grammar checking at the sentence level, returning a list of annotations for each sentence.

  • Finally, it can be used as a command-line tool, at the token level, to consume an input file stream (or stdin) and write to a corrected output file stream (or stdout).

These modes are described below. The Reference section contains further detail.

Token-level correction

GreynirCorrect can tokenize text and return an automatically corrected token stream. This catches token-level errors, such as spelling errors and erroneous fixed phrases (?að ýmsu leitiað ýmsu leyti), but not grammatical errors. The returned token objects are annotated with explanations of each correction or suggestion.

Token-level correction is relatively fast.

Full grammar analysis

GreynirCorrect can analyze text grammatically by attempting to parse each sentence in turn, after token-level correction. The parsing is done according to Greynir’s context-free grammar for Icelandic, augmented with additional production rules for common grammatical errors (?Manninum á verkstæðinu vantaði hamarManninn á verkstæðinu vantaði hamar). The analysis returns each sentence along with a set of annotations (errors and suggestions) that apply to spans (consecutive tokens) within the sentence, in addition to individual token annotations.

Full grammar analysis of sentences is slower than token-level correction.

Command-line tool

GreynirCorrect can be invoked as a command-line tool to perform token-level correction. The command is correct infile.txt outfile.txt, where infile.txt and outfile.txt are the input and output filenames, respectively.

The command-line tool is further documented here.

Examples

To perform token-level correction from Python code:

from reynir_correct import tokenize
g = tokenize("Af gefnu tilefni fékk fékk daninn vilja sýnum "
    "framgengt í auknu mæli.")
for tok in g:
    print("{0:10} {1}".format(tok.txt or "", tok.error_description))

Output:

         Orðasambandið 'Af gefnu tilefni' var leiðrétt í 'að gefnu tilefni'
gefnu
tilefni
fékk       Endurtekið orð ('fékk') var fellt burt
Daninn     Orð á  byrja á hástaf: 'daninn'
vilja      Orðasambandið 'vilja sýnum framgengt' var leiðrétt í 'vilja sínum framgengt'
sínum
framgengt
í          Orðasambandið 'í auknu mæli' var leiðrétt í 'í auknum mæli'
auknum
mæli
.

To perform full spelling and grammar analysis of a sentence from Python code:

from reynir_correct import check_single
sent = check_single("Páli, vini mínum, langaði að horfa á sjónnvarpið.")
for annotation in sent.annotations:
    print("{0}".format(annotation))

Output:

000-004: P_WRONG_CASE_þgf_þf Á líklega  vera 'Pál, vin minn' / [Pál , vin minn]
009-009: S004   Orðið 'sjónnvarpið' var leiðrétt í 'sjónvarpið'
>>> sent.tidy_text

Output:

'Páli, vini mínum, langaði að horfa á sjónvarpið.'

Note that the annotation.start and annotation.end properties (here start is 0 and end is 4) contain the indices of the first and last tokens to which the annotation applies. P_WRONG_CASE_þgf_þf and S004 are error codes.