Reference¶

This section describes the following functions and classes:

tokenize()
check_single()
check()
check_with_stats()
CorrectToken
Annotation

Importing GreynirCorrect¶

After installing the reynir-correct package (see Installation), import it using:

import reynir_correct as grc

If you only want to do token-level checking, the simplest method is to import only the tokenize() function (documented below):

from reynir_correct import tokenize

Similarly, if you only want straightforward checking on single sentences, you can import only the check_single() method (documented below):

from reynir_correct import check_single

The tokenize() function¶

tokenize(text: Union[str, Iterable[str]], **options) → Iterator[CorrectToken]¶

Consumes a text stream and returns a generator of instances of CorrectToken, corrected or annotated as the case may be.

Parameters

text – A text string, or an iterator of strings.
options –
Tokenizer options can be passed via keyword arguments, as in g = tokenize(text, convert_numbers=True). See the documentation for the Tokenizer package for further information.

Two boolean flags directly affect the correction process. Setting only_ci=True tells the checker to look only for context-independent errors. Setting apply_suggestions=True makes the checker more aggressive in turning suggestions into corrections.

Returns

A generator of tokens, where each token is an instance of the CorrectToken class.

Example:

from reynir_correct import tokenize
g = tokenize("Maðurin borðaði aldrey Danskan hammborgara")
for t in g:
    if t.txt:
        print(f"{t.txt:12} {t.error_code:8} {t.error_description}")

Output:

Maðurinn     S004   Orðið 'Maðurin' var leiðrétt í 'Maðurinn'
borðaði
aldrei       S001   Orðið 'aldrey' var leiðrétt í 'aldrei'
danskan      Z001   Orð á að byrja á lágstaf: 'Danskan'
hamborgara   S004   Orðið 'hammborgara' var leiðrétt í 'hamborgara'

The check_single() function¶

check_single(sentence: str) → _Sentence ¶

Analyzes the spelling and grammar of a sentence, returning an instance of the _Sentence class. The _Sentence class is described in the Greynir documentation. GreynirCorrect adds the annotations property to the _Sentence object, which returns a list of Annotation instances applying to the sentence.

Parameters: sentence – The sentence to analyze, as a string. If the string contains more than one sentence, only the first one is analyzed.
Returns: A _Sentence object.

Example:

s = check_single("Ég dreimi um að leita af mindinni")
for a in s.annotations:
    print(a)

Output (showing token span, error code, error description and suggested replacement):

000-000: P_WRONG_CASE_nf_þf Á líklega að vera 'Mig' / [Mig]
001-001: S004   Orðið 'dreimi' var leiðrétt í 'dreymi'
004-005: P001   'leita af' á sennilega að vera 'leita að' / [um að leita að myndinni]
006-006: S004   Orðið 'mindinni' var leiðrétt í 'myndinni'

The check() function¶

check(text: str, *, split_paragraphs: bool = False) → Iterable[_Paragraph]¶

Returns a generator of checked paragraphs of text (instances of the _Paragraph class), with each of those being a generator of checked sentences with annotations. Sentences are parsed and checked “on demand”, just before being returned from the generator.

Parameters

text – The text to analyze, as a string. It may contain multiple paragraphs and sentences.
split_paragraphs – If set to True, the text will be split into paragraphs at each newline.

Returns

A generator of _Paragraph instances.

The check_with_stats() function¶

check_with_stats(text: str, *, split_paragraphs: bool = False) → Dict¶

Returns a dictionary with the results of a grammar and spelling check on the given text. This is a synchronous call, i.e. it does not return until the entire text has been processed.

Parameters

text – The text to analyze, as a string. It may contain multiple paragraphs and sentences.
split_paragraphs – If set to True, the text is automatically split into paragraphs between empty lines.

Returns

A dictionary with the following keys and values:

paragraphs: A list of lists of _Sentence objects, each having the annotations property containing a list of Annotation objects.
num_tokens: The total number of tokens processed.
num_sentences: The number of sentences found in the text.
num_parsed: The number of sentences that were successfully parsed.
ambiguity: A float weighted average of the ambiguity of the parsed sentences. Ambiguity is defined as the n-th root of the number of possible parse trees for the sentence, where n is the number of tokens in the sentence.
parse_time: A float with the wall clock time, in seconds, spent on tokenizing and parsing the sentences.

The CorrectToken class¶

class CorrectToken¶

The CorrectToken class replaces the default tokenizer.Tok named tuple normally returned by the Tokenizer. By way of duck typing, it replicates the kind, txt and val properties of the Tok tuple. It then adds a number of properties to access error codes and annotations on the token, as described here:

error_description(self) → str¶: Returns the description of the error associated with the token, or an empty string if there is no error.

error_code(self) → str¶: Returns the code of the error associated with the token, or an empty string if there is no error.

error_suggestion(self) → str¶: Returns the text of a suggested replacement for the text of this token, or an empty string if there is no error.

error_span(self) → int¶: Returns the number of consecutive tokens, starting with this one, that are affected by the same error. In most cases this is 1, meaning that there are no additional affected tokens.

The Annotation class¶

class Annotation¶

The Annotation class represents an annotation of a token span within a sentence. An annotation describes a correction that has already been applied to the sentence, or a suggested correction.

__str__(self) → str¶: Returns a string representation of the annotation. This is intended mainly for debugging and development purposes.

start(self) → int¶: Returns the index of the first token to which the annotation applies. Token indices are 0-based.

end(self) → int¶: Returns the index of the last token to which the annotation applies. Token indices are 0-based.

code(self) → str¶: Returns an error or warning code for the annotation. If the code ends with "/w", it is a warning.

text(self) → str¶: Returns a brief, human-readable description of the annotation.

detail(self) → str¶: Returns a more detailed, human-readable description of the annotation.

suggest(self) → str¶: Returns a suggested replacement for the text within the token span to which the annotation applies. This only applies for suggested corrections, i.e. if the correction has not been already applied to the sentence.

The _Paragraph class¶

class _Paragraph¶: The _Paragraph class is described in the Greynir documentation.

The _Sentence class¶

class _Sentence¶: The _Sentence class is described in the Greynir documentation.

Navigation

Related Topics

Reference¶

Importing GreynirCorrect¶

The tokenize() function¶

The check_single() function¶

The check() function¶

The check_with_stats() function¶

The CorrectToken class¶

The Annotation class¶

The _Paragraph class¶

The _Sentence class¶