How does Cython work at the production level

8 Great Python Libraries for Natural Language Processing

Natural language processing, or NLP for short, is best described as "AI for speech and text". The magic behind voice commands, voice and text translation, sentiment analysis, text summary and many other linguistic applications and analysis, natural language processing, has been dramatically enhanced through deep learning.

The Python language provides a handy front end for all types of machine learning, including NLP. In fact, in the Python ecosystem, choosing from NLP riches is embarrassing. In this article, we're going to explore all of the NLP libraries available for Python - their use cases, their strengths, their weaknesses, and their general awareness.

Note that some of these libraries provide high-level versions of the same functionality provided by others, making that functionality easier to use at the expense of some precision or performance. You should choose a library that suits both your expertise and the type of project.

CoreNLP

The CoreNLP library - a Stanford University product - is designed as a production-ready natural language processing solution that can provide NLP prediction and analysis at scale. CoreNLP is written in Java, but several Python packages and APIs are available, including a native Python NLP library called StanfordNLP.

CoreNLP includes a wide range of language tools - grammar tagging, named entity recognition, analysis, sentiment analysis, and much more. It was designed as an agnostic of human language and currently supports Arabic, Chinese, French, German and Spanish (with Russian, Swedish and Danish support from third parties) in addition to English. CoreNLP also includes a web API server, a convenient way to provide predictions without too much extra work.

The easiest starting point for CoreNLP's Python wrappers is StanfordNLP, the reference implementation created by the Stanford NLP Group. In addition to being well documented, StanfordNLP is regularly maintained. Many of the other Python libraries for CoreNLP have not been updated in a while.

CoreNLP also supports the use of NLTK, an important Python NLP library, which is explained below. As of version 3.2.3, NLTK contains interfaces to CoreNLP in its parser. Just make sure you are using the correct API.

The obvious downside to CoreNLP is that you need to be familiar with Java to get it working, but that's not something that a careful reading of the documentation can't accomplish. Another hurdle could be the licensing of CoreNLP. The entire toolkit is licensed under the GPLv3. A commercial license is required for any use in proprietary software that you distribute to others.

Gensim

Gensim only does two things, but it does them extremely well. The focus is on statistical semantics: Analyze documents for their structure and evaluate other documents based on their similarity.

Gensim can work with very large amounts of text by streaming documents to its analysis engine and learning incrementally unattended. Several types of models can be created, each suitable for different scenarios: Word2Vec, Doc2Vec, FastText, and Latent Dirichlet Allocation.

Gensim's extensive documentation contains tutorials and instructions that explain important concepts and illustrate them with practical examples. General recipes are also available on the Gensim GitHub repo.

NLTK

The Natural Language Toolkit, or NLTK for short, is one of the best-known and most powerful Python libraries for processing natural language. Many corpora (data sets) and trained models are ready to use with NLTK, so you can start experimenting with NLTK right away.

As stated in the documentation, NLTK offers a variety of tools for working with text: "Classification, Tokenization, Stemming, Tagging, Parsing, and Semantic Thinking". It can also work with some third-party tools to improve functionality.

Remember, NLTK was created by and for an academic research audience. It was not designed for NLP models in a production environment. Documentation is also a bit sparse; Even the instructions are thin. There is also no 64-bit binary. You need to install the 32-bit edition of Python to use it. After all, NLTK isn't the fastest library either, but it can be sped up through parallel processing.

If you are determined to take advantage of NLTK's content, you can start with TextBlob instead (see below).

template

When all you need to do is browse a popular website and analyze what you find, reach for Patterns. This natural language processing library is much smaller and narrower than other libraries covered here, but that also means that it focuses on doing a really good job together.

The pattern has built-in functionality for scraping a number of popular web services and sources (Google, Wikipedia, Twitter, Facebook, generic RSS, etc.), all of which are available as Python modules (e.g.). You don't have to reinvent the wheels to get data from these websites with all of their individual quirks. You can then perform a variety of common NLP operations on the data, such as: B. sentiment analysis.

Pattern exposes some of its lower-level functions so that you can use NLP functions, n-gram searches, vectors, and graphs directly if you want. It also has a built-in auxiliary library for future work with popular databases (MySQL, SQLite and MongoDB), which makes it easier to work with tabular data stored from previous sessions or obtained from third parties.

Polyglot

Polyglot, as the name suggests, enables natural language processing applications that process multiple languages ​​at the same time.

The NLP capabilities in Polyglot mirror what can be found in other NLP libraries: tokenization, named entity recognition, tag of speech tagging, sentiment analysis, word embedding, etc. For each of these operations, Polyglot provides models that work with the required languages ​​to work.

Note that Polyglot's language support varies greatly from feature to feature. For example, the tokenization system supports nearly 200 languages ​​(mainly because it uses the Unicode text segmentation algorithm), and sentiment analysis supports 136 languages, but part-of-speech tagging only supports 16.

PyNLPI

PyNLPI (pronounced "pineapple") only has a basic list of natural language processing functions, but it has some really useful data conversion and processing functions for NLP data formats.

Most of the NLP functions in PyNLPI are intended for basic tasks like tokenization or n-gram extraction, along with some statistical functions useful in NLP like Levenshtein distance between strings or Markov chains. These functions are implemented in pure Python for simplicity, so production-level performance is unlikely.

However, PyNLPI is great for working with some of the more exotic data types and formats that have emerged in the NLP space. PyNLPI can read and process the data formats GIZA, Moses ++, SoNaR, Taggerdata and TiMBL and dedicates an entire module to working with FoLiA, the XML document format that is used for commenting on linguistic resources such as corpora (text bodies used for translations or other analyzes are used. .

You should always use PyNLPI when working with these data types.

SpaCy

SpaCy, which types on Python for reasons of convenience and on Cython for speed reasons, is called "industrial-grade processing of natural language". The developers claim that it is cheap in terms of speed, model size, and accuracy when compared to NLTK, CoreNLP, and other competitors. The main disadvantage of SpaCy is that it is relatively new and only covers English and a few other (mainly European) languages. However, SpaCy has already reached version 2.2 at this point in time.

SpaCy includes almost all of the features found in these competing frameworks: language tagging, dependency analysis, named entity recognition, tokenization, sentence segmentation, rule-based match operations, word vectors, and much more. SpaCy also contains optimizations for GPU processes - both to speed up the calculation and to save data on the GPU to avoid copying.

The documentation from Spacy is excellent. A setup wizard generates command line installation actions for Windows, Linux and macOS, as well as for various Python environments (pip, conda, etc.). Language models are installed as Python packages so that they can be tracked as part of an application's dependency list.

TextBlob

TextBlob is an easy-to-use front-end to the Pattern and NLTK libraries that ties both libraries into easy-to-use, high-level interfaces. With TextBlob, you spend less time studying the intricacies of Pattern and NLTK and more time getting results.

TextBlob paves the way using native Python objects and syntax. The quickstart examples show how texts to be processed are simply treated as strings and common NLP methods such as part-of-language tagging are available as methods for these string objects.

Another benefit of TextBlob is that you can raise the hood and change its functionality as you get more confident. Many standard components, such as the sentiment analysis system or the tokenizer, can be exchanged as required. You can also create high-level objects that combine components - that sentiment analyzer, that classifier, and so on - and reuse them with minimal effort. This allows you to quickly prototype with TextBlob and refine it later.