18CSE359T - NATURAL LANGUAGE PROCESSING UNIT 1

4M:

Explain in detail the various steps in NLP

Syntactic analysis is the process of analyzing natural language with the rules of a formal grammar
Semantic analysis is the process of understanding the meaning and interpretation of words, signs and sentence structure

Morphology and types with example

Morphology:

Morphology is the domain of linguistics that analyzes the internal structure of words
Terms:

Types:

Inflection: after adding the suffix, be in the same word class

Eg: Horse - Horses

Eat - Eating

Like - Likes

Derivation: after adding the suffix, change to another word class

Eg: happy (adj) - happi + ness (noun)

Nation / national/ nationalist/ nationalism

What is lemmatization and why is it preferred over stemming?

Lemmatization refers to the morphological analysis of words, which aims to remove inflectional endings
It helps in returning the base or dictionary form of the word, known as lemma
Eg: going/went/goes - go

Why lemma is better?

Stemming algorithm works by cutting the suffix from the word
Lemmatization returns the lemma, base form of the word
Stemming is a general operation while lemmatization is an intelligent operation where the proper form will be looked in the dictionary
Hence lemmatization helps in forming better machine learning features

What is POS tagging? Steps to pick the right POS

The POS tagging problem is to determine the POS tag for a particular instance of a word
POS - Parts of Speech

Methods to assign POS to words:

Rule based tagging
Learning based:

Stochastic models: HMM tagging, maximum entropy tagging
Rule learning: transformation based tagging

Various types of language models

Broadly two types:

Statistical language models: these models use traditional statistical methods and certain linguistic rules to learn the probability distribution of words
Neural language models:

Pure statistical models:

N-grams
Exponential models
Skip-gram models

Neural models:

Recurrent neural networks
Transformer-based models
Large language models

I prefer bigrams over 7 grams. Justify

N-grams:

They are basically a set of co-occurring words within a given window and move one word forward when computing the n-grams
Applications:

Machine translation
Speech recognition
Spell correction

Bigrams:

If a model considers only the previous word to predict the current word, then its called bigram

7-grams:

If the model considers the previous 6 words

Why better?

7-grams has a lot of drawbacks like sparsity, high computational cost and diminishing returns
But bigrams have advantages like reduced complexity and captures local context

Multiword Expression

MWE:

Crosses word boundaries
Is lexically, semantically, syntactically, pragmatically and/or statistically idiosyncratic
Eg: traffic signal, green card, kick the bucket, spill the beans
A large fraction of words in English are MWEs (41%)
Conventional grammars and parsers fail
Idiosyncrasies:

Statistical: traffic signal, good morning (semantically decomposable)
Lexical: ad hoc, ad hominem (borrowed from other languages)
Syntactical: wine and dine (exhibit peculiar syntactic behavior)
Semantic: spill the beans - reveal secret (figurative/metaphorical usage)

Basic tasks in MWE:

Extract collocations
Establish linguistic validity of collocation
Measure semantic decompositionality of the MWE

Various approaches to find collocation

A collocation is an expression consisting of two or more words that corresponds to a conventional way of saying things

Types:

Grammatical
Semantic
Flexible

Approaches:

Selection of collocations by frequency
Selection based on mean and variance of the distance between focal word and collocating word
Hypothesis testing
Mutual information

What is stemming? What Stemming can be used for preprocessing

Stemming:

It is a method of normalization of words in NLP
It is a technique in which a set of words in a sentence are converted into a sequence to shorten its lookup
Eg: the root word is ‘eat’ and its variations are ‘eats, eating, eaten and like so’

Porter Stemmer:

A simpler algorithm that doesn’t require the large on-line lexicon
Used in information retrieval
Removal of inflectional ending from the words
Eg: relational -> relate, motoring -> motor, grasses -> grass

Justify King - Male + Female = Queen

Word vectors:

Word vectors represent words as multidimensional continuous floating point numbers where semantically similar words are mapped to proximate points in geometric points
This means that words such as wheel and engine should have similar word vectors to the word vector of car whereas the word banana should be quite distant
The beauty of representing words as vectors is that they lend themselves to mathematical operators
For example, we can add and subtract vectors
In this canonical example, by using word vectors we can determine that King - Man + Woman = Queen

Search This Blog

loveandhate

18CSE359T - NATURAL LANGUAGE PROCESSING UNIT 1 - 4M

Comments

Post a Comment

Popular posts from this blog

18ECO124T - HUMAN ASSIST DEVICES UNIT 2 & 3 - 12M

18CSE483T - INTELLIGENT MACHINING UNIT 1 CLASS NOTES

18CSE483T - INTELLIGENT MACHINING UNIT 4 & 5 - 4M