Natural Language Processing

Definition

Natural Language Processing (NLP) is the field of computer science that studies and develops algorithms and models enabling computers to process, understand, and generate human natural language (text, speech). NLP combines computational linguistics with machine learning to solve practical problems requiring semantic understanding.

The discipline is interdisciplinary: linguistics, computer science, cognitive psychology, and statistics converge on how to formalize language and teach systems to handle it.

Fundamental Components

Morphology and Syntax: analysis of linguistic structure.

Tokenization: segmentation of text into tokens (words, sub-word)
Part-of-speech tagging: identification of nouns, verbs, adjectives
Parsing: extraction of syntactic structures (dependency trees)

Semantics: meaning.

Word sense disambiguation: which meaning of a word is intended?
Relationship extraction: which entities are correlated and how?
Semantic role labeling: who does what to whom?

Pragmatics and Discourse: context and intention.

Coreference resolution: which pronouns refer to which entities?
Sentiment analysis: emotional tone of the text?
Entailment: does one sentence logically imply another?

Main NLP Tasks

Understanding (Discriminative):

Classification: sentiment, spam detection, topic categorization
Named Entity Recognition (NER): identification of people, places, organizations
Relation extraction: extraction of relations between entities
Question Answering: answering questions about text

Generation (Generative):

Machine translation: text from one language to another
Summarization: synthesis of long documents
Text generation: creation of coherent text (articles, creativity)
Dialogue: conversational systems

Structured Prediction:

Tagging: assigning labels to sequences (POS tagging, NER, chunking)
Parsing: extraction of structures (syntax trees, dependency graphs)

Methodological Evolution

Era 1: Rule-based (1950s-1980s): hand-written rule systems. Fragile, limited to restricted domains.

Era 2: Statistical NLP (1990s-2010s): probabilistic models (HMM, CRF, SVM). Manual feature engineering, but better generalization.

Era 3: Neural NLP (2010s): recurrent neural networks (LSTM, GRU), convolutional. Automatic feature learning. Breakthrough on sequence-to-sequence models.

Era 4: Pre-trained models / Transformers (2018+): BERT, GPT, T5. Pre-training on web-scale data. Dominant paradigm today.

Benchmarks and Evaluation

GLUE (General Language Understanding Evaluation): 9 tasks. Average accuracy ~94% (human-level ~96%, attainable). Benchmark “solved” by LLMs.

SuperGLUE: harder version. Many large models still underperform humans.

SQuAD (Stanford Question Answering Dataset): machine reading comprehension. Accuracy over 90% on recent models.

MTEB (Massive Text Embedding Benchmark): 56 tasks of retrieval, clustering, classification. Comprehensive benchmark for embedding models.

WMT (Workshop on Machine Translation): benchmark for translation. BLEU score is standard metric (weak correlation with human quality).

Use Cases

Chatbots and Assistants: conversational chatbots, FAQ answering, customer support automation.

Content analysis: analysis of customer feedback, social media monitoring, content moderation.

Information extraction: structured extraction from unstructured documents (contracts, articles).

Search and Ranking: semantic search (vs. keyword matching), ranking results by relevance.

Machine translation: automatic translation between languages.

Document classification: automatic document categorization.

Practical Considerations

Data requirements: modern NLP requires abundant data (millions of examples for specific tasks). Transfer learning (fine-tuning pre-trained models) mitigates this for some tasks.

Language diversity: models are often trained on English. Multilinguality (Italian, under-resourced languages) remains challenging. Multilingual models (mBERT, XLM-RoBERTa) have lower per-language performance vs. monolingual.

Pragmatic ambiguity: sarcasm, idioms, referential ambiguity remain difficult. Humans disambiguate through world context; models lack this.

Interpretability: LLMs are black boxes. Understanding why a model makes a prediction is hard. Research in explainability is active (attention weights, SHAP, LIME).

Common Misconceptions

”Modern NLP is ‘solved’ by LLMs”

Partial. Stylized benchmarks (GLUE) have reached human-level accuracy. But real tasks (domain shift, adversarial examples, rich linguistics) remain difficult. Zero-shot generalization is better but imperfect.

”NLP models ‘understand’ language”

No. They operate on statistical representations. They have no inner world, consciousness, or understanding in the cognitive sense. They produce plausible output without awareness.

”Once trained, the NLP model solves any linguistic task”

No. Transfer learning mitigates data scarcity, but specialization remains relevant. A model trained on news generates journalistic style; on legal text, it may underperform.

LLM: modern instance of NLP, generative at web-scale
Transformer: dominant architecture in contemporary NLP
Embeddings: vector representations of text
Tokenization: fundamental NLP preprocessing
RAG: retrieval pattern extending NLP capabilities with external knowledge

Sources

Jurafsky, D. & Martin, J.H. (2024). Speech and Language Processing (3rd Edition). Stanford (standard textbook)
Lewis-Kraus, G. (2023). The Great AI Awakening. NYT Magazine
Papers with Code - NLP Benchmarks: aggregation of benchmarks and sota
EMNLP: primary conference for NLP research