Reference¶
This section describes the following functions and classes:
Importing GreynirCorrect¶
After installing the reynir-correct
package (see Installation),
import it using:
import reynir_correct as grc
If you only want to do token-level checking, the simplest method
is to import only the tokenize()
function (documented below):
from reynir_correct import tokenize
Similarly, if you only want straightforward checking on single
sentences, you can import only the check_single()
method
(documented below):
from reynir_correct import check_single
The tokenize() function¶
-
tokenize
(text: Union[str, Iterable[str]], **options) → Iterator[CorrectToken]¶ Consumes a text stream and returns a generator of instances of
CorrectToken
, corrected or annotated as the case may be.- Parameters
text – A text string, or an iterator of strings.
options –
Tokenizer options can be passed via keyword arguments, as in
g = tokenize(text, convert_numbers=True)
. See the documentation for the Tokenizer package for further information.Two boolean flags directly affect the correction process. Setting
only_ci=True
tells the checker to look only for context-independent errors. Settingapply_suggestions=True
makes the checker more aggressive in turning suggestions into corrections.
- Returns
A generator of tokens, where each token is an instance of the
CorrectToken
class.
Example:
from reynir_correct import tokenize g = tokenize("Maðurin borðaði aldrey Danskan hammborgara") for t in g: if t.txt: print(f"{t.txt:12} {t.error_code:8} {t.error_description}")
Output:
Maðurinn S004 Orðið 'Maðurin' var leiðrétt í 'Maðurinn' borðaði aldrei S001 Orðið 'aldrey' var leiðrétt í 'aldrei' danskan Z001 Orð á að byrja á lágstaf: 'Danskan' hamborgara S004 Orðið 'hammborgara' var leiðrétt í 'hamborgara'
The check_single() function¶
-
check_single
(sentence: str) → _Sentence¶ Analyzes the spelling and grammar of a sentence, returning an instance of the
_Sentence
class. The_Sentence
class is described in the Greynir documentation. GreynirCorrect adds theannotations
property to the_Sentence
object, which returns a list ofAnnotation
instances applying to the sentence.- Parameters
sentence – The sentence to analyze, as a string. If the string contains more than one sentence, only the first one is analyzed.
- Returns
A
_Sentence
object.
Example:
s = check_single("Ég dreimi um að leita af mindinni") for a in s.annotations: print(a)
Output (showing token span, error code, error description and suggested replacement):
000-000: P_WRONG_CASE_nf_þf Á líklega að vera 'Mig' / [Mig] 001-001: S004 Orðið 'dreimi' var leiðrétt í 'dreymi' 004-005: P001 'leita af' á sennilega að vera 'leita að' / [um að leita að myndinni] 006-006: S004 Orðið 'mindinni' var leiðrétt í 'myndinni'
The check() function¶
-
check
(text: str, *, split_paragraphs: bool = False) → Iterable[_Paragraph]¶ Returns a generator of checked paragraphs of text (instances of the
_Paragraph
class), with each of those being a generator of checked sentences with annotations. Sentences are parsed and checked “on demand”, just before being returned from the generator.- Parameters
text – The text to analyze, as a string. It may contain multiple paragraphs and sentences.
split_paragraphs – If set to
True
, the text will be split into paragraphs at each newline.
- Returns
A generator of
_Paragraph
instances.
The check_with_stats() function¶
-
check_with_stats
(text: str, *, split_paragraphs: bool = False) → Dict¶ Returns a dictionary with the results of a grammar and spelling check on the given text. This is a synchronous call, i.e. it does not return until the entire text has been processed.
- Parameters
text – The text to analyze, as a string. It may contain multiple paragraphs and sentences.
split_paragraphs – If set to
True
, the text is automatically split into paragraphs between empty lines.
- Returns
A dictionary with the following keys and values:
paragraphs
: A list of lists of_Sentence
objects, each having theannotations
property containing a list ofAnnotation
objects.num_tokens
: The total number of tokens processed.num_sentences
: The number of sentences found in the text.num_parsed
: The number of sentences that were successfully parsed.ambiguity
: Afloat
weighted average of the ambiguity of the parsed sentences. Ambiguity is defined as the n-th root of the number of possible parse trees for the sentence, where n is the number of tokens in the sentence.parse_time
: Afloat
with the wall clock time, in seconds, spent on tokenizing and parsing the sentences.
The CorrectToken class¶
-
class
CorrectToken
¶ The
CorrectToken
class replaces the defaulttokenizer.Tok
named tuple normally returned by the Tokenizer. By way of duck typing, it replicates thekind
,txt
andval
properties of theTok
tuple. It then adds a number of properties to access error codes and annotations on the token, as described here:-
error_description
(self) → str¶ Returns the description of the error associated with the token, or an empty string if there is no error.
-
error_code
(self) → str¶ Returns the code of the error associated with the token, or an empty string if there is no error.
-
error_suggestion
(self) → str¶ Returns the text of a suggested replacement for the text of this token, or an empty string if there is no error.
-
error_span
(self) → int¶ Returns the number of consecutive tokens, starting with this one, that are affected by the same error. In most cases this is 1, meaning that there are no additional affected tokens.
-
The Annotation class¶
-
class
Annotation
¶ The
Annotation
class represents an annotation of a token span within a sentence. An annotation describes a correction that has already been applied to the sentence, or a suggested correction.-
__str__
(self) → str¶ Returns a string representation of the annotation. This is intended mainly for debugging and development purposes.
-
start
(self) → int¶ Returns the index of the first token to which the annotation applies. Token indices are 0-based.
-
end
(self) → int¶ Returns the index of the last token to which the annotation applies. Token indices are 0-based.
-
code
(self) → str¶ Returns an error or warning code for the annotation. If the code ends with
"/w"
, it is a warning.
-
text
(self) → str¶ Returns a brief, human-readable description of the annotation.
-
detail
(self) → str¶ Returns a more detailed, human-readable description of the annotation.
-
suggest
(self) → str¶ Returns a suggested replacement for the text within the token span to which the annotation applies. This only applies for suggested corrections, i.e. if the correction has not been already applied to the sentence.
-
The _Paragraph class¶
-
class
_Paragraph
¶ The
_Paragraph
class is described in the Greynir documentation.
The _Sentence class¶
-
class
_Sentence
¶ The
_Sentence
class is described in the Greynir documentation.