New pre-trained language models for Spark NLP

State-of-the-art NLP capabilities are available today to the open source community for English, for Mandarin Chinese, and to some extent for the major European language (German, French, Italian, Spanish, Portuguese). Spark NLP currently has pre-trained models for English, Italian, French, and German. The goal of this project, which can be divided or scoped based on the priorities of the funders and contributors, is to create pre-trained NLP models and pipelines for the remaining most widely spok... more

Abstract:

State-of-the-art NLP capabilities are available today to the open source community for English, for Mandarin Chinese, and to some extent for the major European language (German, French, Italian, Spanish, Portuguese). Spark NLP currently has pre-trained models for English, Italian, French, and German. The goal of this project, which can be divided or scoped based on the priorities of the funders and contributors, is to create pre-trained NLP models and pipelines for the remaining most widely spoken languages in the world: Chinese (Mandarin & Cantonese), Hindi, Urdu, Arabic, Malay, Russian, Portuguese, Bengali, Punjabi, Telugu, and Javanese. Having contributors who are native speakers in each language is highly preferred, so that language-specific features and dialects (i.e. of Arabic, Mandarin, Portuguese) can be accounted for and maybe reflected in different models.

Timeline:

With one or two engineers working the timeline would be either 6 or 12 months, depending on what gets funded.

Technical Approach:

For each language, the basic required functionality is a tokenizer, stemmer, lemmatizer, part of speech tagger, spell correction model, and named entity recognition model. The deliverables should include any required language-specific code, pre-trained word embeddings, pre-trained models, and pre-trained NLP pipelines. Examples and documentation are required as well.