The mission of vogella is to provide everyone with developer knowledge in plain simple English whether by online tutorials, books or training materials. Improving the text quality is a constant endeavour. It has to be done, because flawed texts become hard and annoying to read. Everyone who wrote a longer work knows, it can be a pain to proof-read spelling, grammar and style. It screams for automatism. However, the simple spell-check, that comes with most tools, is often not smart enough to detect your errors. There has to be a solution and there is!
This is where LanguageTool comes in. It is an Open Source project designed to analyze texts in various languages. What compiler warnings and errors are to programming languages, LanguageTool is to the ones humans understand and speak. It is written in platform-independent Java and comes with plug-ins for word processing (such as LibreOffice), browsers (like Mozilla Firefox) and various IDEs (including experimental support for Eclipse). It even allows headless runs on the command line or exposed via a web interface accessed by a REST API.
It is openly developed at GitHub by a global community and you can help, too. Luckily, you don’t have to be a computer linguist. Just collect your common mistakes and create a new rule either by defining it in XML or writing more complex ones directly in Java. Have a look at the following very basic example to get an idea:
<rule id="VOGELLA_BRANDING" name="vogella lower-case spelling"> <pattern case_sensitive='yes'> <token>Vogella</token> </pattern> <message> Always use a lower-case spelling for the company name. </message> <url>http://www.vogella.com/company/</url> <short>Branding styleguide</short> <example type='incorrect'> <marker>Vogella</marker> GmbH </example> <example type='correct'>vogella GmbH</example> </rule>
The top row gives the rule a unique identifier and a human-readable description. Next is the most important part: the pattern to detect the error. Here it is just one word “Vogella” in a case-sensitive match. Next are the messages provided to the user explaining the possible problem. A small one for right-click menus and a larger one for tooltips. The examples are evaluated during automated software tests to check if everything works as promised. Optionally, a weblink has been added for reference. It should explain the rule with a credible source.
As you can see, instead of teaching computers how the human language works (which is very hard), we tell them how to spot the errors we meatballs do ourselves. Internally it processes every sentence into tokens, like single words or start/end of the sentence and tries to guess the type of the word (adjective, noun or verb, …) and inflection (i.e., singular vs. plural, the tense, …) from large internal databases and tags the user input accordingly with meta-data. Have a look at the text analysis to try it for yourself. This allows to detect flaws in word order, a simple spell-check can never find, with this more advanced rule:
<rule> <pattern> <token postag="VB[DPZ]?" postag_regexp="yes"> <exception inflected="yes" regexp="yes">have|be|will</exception> <exception postag="NNS?" postag_regexp="yes"/> </token> <marker> <token regexp="yes"> always|hardly|never|often|rarely|seldom|sometimes|usually </token> </marker> <token> <exception postag="VB.*|SENT_END" postag_regexp="yes"/> </token> </pattern> <message> The adverb '\2' is usually put before the verb '\1'. </message> <example type="incorrect">Rarely I click the ads.</example> <example type="correct">I rarely click the ads.</example> </rule>
LanguageTool can be used to check Wikipedia as reported on FOSDEM, proof-read your academic final assignment (fun fact: this project started as a diploma thesis itself) or create validated technical documents in a controlled language at work in the industry.
Matthias Mailänder is the junior editor at vogella GmbH who likes the spirit of Open Source. After establishing an internal workflow that fixes errors as early as possible, he decided it was time to give back to the community and has been granted committer rights in the LanguageTool project to help maintain the English language rules.