General Architecture for Text Engineering (GATE) Introduction
General Architecture for Text Engineering (GATE) is a suite of tools written in Java, used for human language processing, analysis, and information extraction. GATE is open-source and free, released under the GNU Lesser General Public License (LGPL).
GATE is used in many different language processing tasks and applications, such as: web mining, information extraction, recruitment, decision support, and so on.
Brief History of GATE
GATE was originally developed at the University of Sheffield, England, and initially released in 1995. GATE development has been continuous since the initial release, and is still ongoing, with the latest stable release of GATE being version 8.1, dated June 2, 2015.
The core development work is done by the GATE research team, with support from many community contributors.
GATE currently supports analysis for the following languages: English, Spanish, Chinese, Arabic, Bulgarian, French, German, Hindi, Italian, Cebuano, Romanian, and Russian.
GATE can accept text input from different formats, like TXT, HTML, XML, Doc, and PDF. Supported databases are Java Serial, PostgreSQL, Lucene, and Oracle.
What's more, GATE interacts with them using the Java Database Connectivity (JDBC) API.
After years of development, GATE is now a stable and mature human language processing solution that includes a desktop client for developers, a workflow-based web application, a Java library, an architecture, and a polished process.
GATE Developer is an Integrated Development Environment (IDE) providing a graphical user interface (GUI) for the creation of human language processing software components.
GATE Developer comes with a bundled Information Extraction (IE) component set called A Nearly-New Information Extraction System (ANNIE).
ANNIE is a set of information extraction components, comprising of a tokenizer, a gazetteer, a sentence splitter, a part of speech tagger, a named entities transducer and a confidence tagger.
GATE Teamware is a web-based management platform for collaborative language annotation and curation.
Using GATE Teamware, you can use distributed workforce for language processing, using its web interface for viewing, adding and editing text annotations. Web-based management is also used for project setup, tracking, and management.
If you are interested in running GATE Teamware, the easiest way to get it is to buy a pre-configured, ready to run GATE Teamware virtual server from GATE Cloud. GATE Teamware is open-source, with its code hosted on SourceForge.
GATE Embedded is GATE's language processing class library, implemented in Java. It is an object oriented-framework used in all GATE systems, and it forms the core elements of Gate Developer.
GATE Embedded allows you to add language processing functions to your own applications. This is a very useful tool for programmers and is available as a set of Java archives (JARs).
GATE is one of the most popular human language processing tools. GATE also has the largest community of users out of all similar software solutions. Its widespread use and long development history has made GATE a stable, efficient, and comprehensive language processing solution.
GATE is used in science for experiments with language computation, where it provides for repeatability of experiments, quantitative evaluation, and measurement and collaboration.
In education, GATE is often used for examples and exercises in natural language engineering courses.
Business uses of GATE include using it as a tool for customer feedback analysis, using GATE to annotate and search scientific documents in pharmaceutical research, processing captions in massive image libraries in media and journalism, and so on.
If you would like to try GATE, it's simple. Just download and run the GATE installer, and follow the detailed installation instructions. GATE is a cross-platform solution, so it can run on any system supporting Java.
If you work with computation tasks involving human language processing, you should take a more detailed look at GATE and some of the following resources:
The GATE Homepage is a good place to start. You can find the GATE user manual and other useful documentation, as well as GATE support and installation files, demos, and so on.
GATE Public Wiki is also accessible from the GATE homepage, but we decided to single it out, because of the many useful examples and content from the GATE training courses.
American National Corpus website has a short tutorial on basic GATE usage.
Books that cover human language processing and GATE are quite rare, but the ones that are available are useful and popular. We recommend the following books:
Text Processing with GATE (2011) by Cunningham, Maynard, and Bontcheva: this book includes a guide to using GATE Developer and GATE Embedded, and chapters on all major areas of functionality, such as processing multiple languages and large collections of unstructured text, as well as a complete plugin documentation. Most of the book content originates from the online GATE user guide.
Building Search Applications: Lucene, Lingpipe, and Gate (2008) by Manu Konchady: this book is a practical guide to building search applications using open-source software. Lucene, LingPipe, and Gate are popular open source tools to build powerful search applications. Building Search Applications describes functions from GATE that include entity extraction, part of speech tagging, sentence extraction, and text tokenization.
Introduction to Linguistic Annotation and Text Analytics (Synthesis Lectures on Human Language Technologies) (2009) by Graham Wilcock: this book provides a basic introduction to linguistic annotation and text analytics. The two main text analytics architectures, GATE and UIMA, are described and compared, with practical exercises showing how to configure and customize them.
GATE is a popular and mature solution. Due to its popularity, it's backed by a large and active community, which guarantees it will be around for years to come.
However, GATE is not for everyone. Its use is restricted to several relatively small niches. On the other hand, its use in said niches is widespread. GATE's flexibility allows its use in a myriad of industries and organizations, ranging from big pharma to education.
Best of all, in case you are not convinced you need it, you can try GATE at no cost, and if you like it, you can deploy it in commercial projects just like any open-source software, as long as you stick to LGPL standards.
Further Reading and Resources
We have more guides, tutorials, and infographics related to coding and development:
Java: Introduction, How to Learn, and Resources: if you're going to use GATE, you will want to check out this introduction to the Java programming language.
Scala Programming Introduction: learn all about Scala — the new and improved Java.
Prolog Resources: Prolog was specially designed to do natural language processing.
How to Avoid Falling in Love with a Chatbot
Interested in natural language processing? Learn all about its history, How to Avoid Falling in Love with a Chatbot. It's come a long way.