Week 4: Natural Language Processing > 4a At the Intersection of Language and Data Science > 4a Video
- I’m a Professor in the Computer Science Department, and I’m also the Director of the Data Science Institute.
- You most undoubtedly will have noticed that a lot of the information out there is language, and furthermore, it’s in text form.
- We’re going to be taking the data science point of view in looking at these problems.
- So this falls into the general area of media where we’re looking within data science about how we can develop approaches and tools to enhance communication and interaction within communities.
- We’ll be looking at a variety of different kinds of media in order to do this, taking information from news and social media.
- So we’ll take data, and that data is often labeled with answers.
- We’re going to be extracting features from the text data, and this is where natural language processing comes into play for providing the technology to do this, to extract features.
- In the work that we’re talking about during these classes, we’re interested in what data is available for learning and what features yield good predictions.
- So which of the features that we use have an impact on the results? And which features- if we use in combination- can yield higher accuracy on the task.
- So we’re dealing with a variety of different kinds of media.
- So online discussion and personal narrative contain more information about individuals, and they often contain more subjective information such as opinions or emotions.
- Now if you and I were to look at the language of these different genres, we would have no trouble telling what was what.
- That’s a harder problem for the computer, and we need to think about, for the systems, different approaches when we have different kinds of language.
- We’ve got the kinds of abbreviations and language that gives things away.
- In the next example, “Hurricane Sandy churned about 290 miles off the mid-Atlantic coast Sunday night, with the National Hurricane Center reporting that the monster storm was expected to come ashore with near-hurricane-force winds and potentially ‘life-threatening’ storm surge flooding.
- A lot of the software that we’re developing right now is to be able to look at the problem of crisis informatics, developing tools that can handle different kinds of texts about storms.
Week 4: Natural Language Processing > 4b NLP: News > 4b Video
- I think, at this point in time in the whole series, you probably have heard about how we used large amounts of annotated news to train systems to detect things like part of speech for words, so when a particular word is a verb or a noun.
- If we look in closer, we can see one of the sentences- the first sentence from the summary.
- This leads many summarization researchers to use an approach that’s called sentence selection.
- What they do is they read the input articles, they use different kinds of features associated with the different sentences- for example, how frequent the words are in each of the sentences- to select salient sentences and to string these sentences together to form the summary.
- So we may select a sentence with a pronoun and the reader may not know what the pronoun refers to.
- Or we may put two sentences side by side that weren’t meant to go side by side and it can read incoherently.
- So another approach also looks at doing sentence selection but then editing the sentences so that they are more appropriate for use in the summary.
- So the first sentence may be selected, “the US Geological services said a strong earthquake hit the coas,” but we’re missing information.
- What coast? Where did it happen? And so we would edit the reference in this sentence.
- We also look at transforming text by making it shorter, removing information from the sentence that is relevant.
- Sometimes we take phrases from two different sentences that are important but don’t by themselves form a whole sentence, and we fuse them together to make a new generated sentence for the summary.
- So for the case of making sentences shorter, this is called “text compression.
- So this is a sentence that we extracted during the time of Hurricane Sandy, and we can see that it would be nice to have this pop up as an alert which would- particularly if we were close to 57th Street- and would tell us about the crane collapse which just happened.
- It’s a relatively small set, contains about 3,000 sentence pairs.
- Each element in the set is a pair of an input sentence, which is relatively long, and the output for that sentence, which is a compression or a shortened sentence from the input.
- As output we have a single sentence also with a parse attached to it.
- We do joint inference over the choice of words that should appear in the output, over the ordering of engrams- that is multiple words in the output- and over the choice of what dependency structure remains.
- So the first feature that we’re talking about is word choice, and we’re wondering which words would signal importance, and thus we’d need to retain them.
- ” And then we also have words that are maybe not as common, but still fairly common, like “flying towards Europe.
- There are also words in this sentence that are very unusual, very surprising.
- ” these are all unusual words that you wouldn’t see in your day-to-day lives.
- In addition to words, we also would want to take into account word sequences.
- You could be scrambling an egg, you could be late to class and scrambling to get there on time, but if you have “air force fighters scrambled,” all four words together, that’s pretty significant.
- That tells us what role phrases are playing in a sentence so that we can better understand their relative importance to each other.
- So in this imput sentence we have two verbs, we have “intercept” and we also have “imposed,” and both of them have as their object something to do with Libya whether it be the country Libya or Libyan airliner.
- So we might think that, since both of the important verbs in the sentence target Libyan something, these phrases that include the verb and the target Libya should be included in our summary.
- So this kind of thinking, when we’re given a problem like text compression, and we’re given some input and output, to be able to think about what features would play a role, and what does your intuition behind why they would play a role is important.
- For compression we’re going to have as input a single sentence, and as output we’re going to have a sentence with only the salient information remaining.
- We drop all the words that aren’t boxed and we end up with the sentence, “production was closed at Ford for Christmas.
- ” We use exactly the same model to do sentence fusion.
- As output we’re going to have a single sentence with common- or information that is signaled by both sentences should occur.
- So we’re no longer going to use pairs of sentences, one long and one short.
- For this task, fusion, and we use a dataset created from summarization evaluations where we have a summary created by people and we have the different words that the person created the summary from in the input.
- So we have this sort of model where we have multiple sentences and input and output.
- For example, we see “17 percent” in the top sentence and we see “17 percent” in the bottom sentence, and so we retain it.
- Previously, people had used individual words and engrams for this task.
- So if we added in dependency relations, could we get an improvement in how well the system performed? And if we looked at the chart over here, the top line is the performance of the system with dependency relations and the two bottom lines show two previous systems without dependency relations, one sequential and one joint.
- We can see that we do get an improvement on how much of the sentence we can duplicate as happens when people do the task.
- We’re taking sentences that have been written by people, and we’re modifying them.
- When it was first picked up by the press, we had a very disasterous sentence appear where, on the death of the Queen Mother in England, Newsblaster had Queen Elizabeth attending her own funeral.
Week 4: Natural Language Processing > 4c NLP: Online Discussion Forums > 4c Video
- OK so let’s move on now to the next type of data that we looked at.
- We we’re interested in whether we can use data that’s drawn from different kinds of social media in order to answer questions.
- In this particular example, we see that there is no word overlap between the words of the query and the words in the answer.
- So the question used the words effect, Hurricane Sandy, and New York City.
- So here for data, we used a number of different kinds of data.
- First, we gathered manually annotated data through crowdsourcing.
- Then all of the documents that were returned, we took each sentence in each document and paired it with the query.
- Gave the query-sentence pair to Turkers on Mechanical Turk and asked them to tell us yes/no, is the sentence relevant to this query.
- It ran for three weeks and we still barely had enough data.
- So we might have to go through hundreds of sentences to get them marked before we got even one that is relevant.
- Then we realized gee, we could augment this with data from Newsblaster.
- Then we have the summary that follows and actually many of those sentences would be relevant to the query.
- Then we may get some sentences in our summary- we’re not doing it by hand- that are relevant.
- Given a question when we have a sentence that’s relevant.
- Given the fact that we have large, noisy data, we decided to use the semi-supervised learning approach.
- We start with several simple classifiers that are trained on the seed data, which we know is good data.
- It’s not enough data to produce a good system.
- We’re going to use an approach where we take several different classifiers.
- Or look at overlap between the number of named entities in the relevant or irrelevant sentence and the query.
- Or a look at how semantically related the words are between the answer and the query.
- This is in the form of pairs, where we have a query, like Hurricane Sandy, and a sentence, like the one that we have been talking about all along, Hurricane Sandy churned and so forth.
- So let’s take the same query, Sandy, and a sentence.
- Now we have our classifiers the first classifier that we’re going to look at is the keyword classifier.
- It would know, for example, that the number of keywords that overlap between the sentence and the query.
- So this produces a keyword classifier which we then run on a new data set, our larger, noisy data set.
- So for Hurricane Sandy we might have another sentence that came from our summary that might be, Sandy made landfall in New York at midnight.
- This, our classifier would label as a positive one.
- OK. Now these feed into our second classifier, which Jessica’s going to tell us about.
- The benefit of co-training is that we can get a lot more label data than we otherwise would be able to.
- Then we feed them into the keyword classifier, which produces some more examples of plus and minus.
- Then we get another classifier, say the semantic classifier, and it will take both of these and also the output of the keyword classifier and then it will in turn try to label this noisy data set again.
- Where the query is still Sandy, but because this is the semantic classifier, it’s going to look for synonyms in addition to just straight-up keyword match.
- So we might get a sentence like, the storm churned, instead of Sandy churned.
- This classifier will be able to capture that and label it as a positive example.
- Now these new examples will feed back into the keyword classifier.
- These two cause fires will take turns until they can classify the whole dataset.
Week 4: Natural Language Processing > 4d NLP: News and Online Discussion Forums > 4d Video
- OK, let’s move on to another example where we use both news and online discussion forums.
- As I’ve indicated before, we were motivated by this by Hurricane Sandy, as we all sat in New York, and wondered exactly what was going to happen.
- Because each time the editors made an addition, this could be a new fact that was added about the event at that point in time.
- If we look in at it more closely, we can see that some of the sentences are relevant.
- It’s not something that we would even want to consider whether it’s relevant or irrelevant, and we would want to remove it.
- Where the first sentence tells us that a strong earthquake has hit, where it is hit, and sort of what has happened.
- The second sentence gives us information that is typically provided about an earthquake, the magnitude of the quake, and exactly where it was centered.
- We’re going to work in time slices, where each time slice is an hour.
- At each time we want to predict- of all the input sentences that we have in that hour, we want to predict which ones are salient.
- Because we’re particularly interested in disaster, we’re going to use some features that are disaster specific, and we think could be important here.
- We want to remove sentences that are redundant with what was reported in earlier time slices.
- So we then cluster and select example sentences from each cluster for time T. So I’m going to talk about predicting salience.
- These include things like sentence length, punctuation count, number of capitalized words, and the number of the synonyms, hypernyms or hyponyms with the event type.
- So if you have a sentence with a lot of those, you would have all the information you might like.
- So Jessica thought that capitalized words are important because they tell us about place, and people names, and so forth.
- The number of capitalized words can- so we do see places like Nicaragua and Mexico City being highlighted.
- If we have a lot of capitalized words, this can actually indicate we have a low salience sentence because we have a lot of these spurious words added.
- What about synonyms, hypernyms, and hyponyms? Jessica, what do you think? I think these would be important for the same reason we saw a little bit earlier with the co-training.
- If you have a sentence that just says, the storm did this.
- A disaster as opposed to a storm, as opposed to Hurricane Sandy, you want to capture all levels of granularity.
- That is a feature that gives us a clue that the sentence is important.
- So a language model for earthquakes, one for hurricanes, and so forth, built from related Wikipedia articles for a particular kind of disaster.
- For a domain-specific language model, then you’d get sentences that are more likely to be about the particular type of disaster you might be looking for.
- So we see with a generic news corpus, that it would give a high score to our medium and high salient sentences, and a low score to our low salience one, because it’s not very grammatical.
- Whereas with a domain-specific language models, we’d get a high score for the high salience sentence, because it’s a kind of sentence that is typical of disasters.
- So our domain-specific language model tells us a sentence that is typical for a certain kind of disaster.
- Our geographic features would tell us this is a sentence that is good for this specific disaster that we are now looking at.
- We get coordinates for the locations that are referred to, and a mean distance from those locations that are mentioned to the current event.
- We measure the mean distance to the Guatemalan earthquake, and we see that Nicaragua is closer, so it gets ranked higher.
- So let’s say for example in Hurricane Sandy, where we were monitoring things like flooding, and fires, in Breezy Point, we suddenly had a crane dangling in midtown Manhattan.
- Then all of a sudden it appeared multiple times, so we have the burst of words related to crane.
- Our ultimate goal in all of this is to be able to identify when a new event hits, and then to be able to identify subsequent related events that occur afterwards, like the Manhattan blackout, the Breezy Point fire, or the public transit outage.
- We’ve gotten more accurate sentences when we use the salience.
- So as in the question that I showed first, there are certain things that only people on the ground can see, and we’d like to capture those.
- So we’re looking at correlating news reports on disasters with reactions expressed in social media across events, across time so that we can study what language in news causes a reaction in social media that indicates people are prepared or are preparing for the disaster.
Week 4: Natural Language Processing > 4e NLP: Personal Narrative > 4e Video
- I’ve used here as the example Malcolm X. So The Autobiography of Malcolm X illustrates the kind of story we want to capture.
- It’s a very compelling story of life and death.
- It’s the kind of story that we hope to capture.
- So how is personal narrative different? Well, it’s a coherent telling of the story from beginning to end.
- It was incredibly windy, but the rain hadn’t been that bad.” We then have a part of the story which is the sequence of complicating actions.
- “By 10:00 PM, the skies lit up in the purple and blue brilliance, and the power started to go out here and there.
- ” And now, we have, finally, what we’re calling, after a linguist named Labov, the reportable event.
- ” So we’re working on developing a system that can identify the reportable event in the story.
- If we do, our description of the reportable event could serve as a summary for what is the story about.
- What kind of data do we have that we can use? And we didn’t immediately have a manually annotated data set to use.
- Who is doing the work here, and we’ll talk about it in a few minutes, thought of using data from Reddit.
- We used the Ask Reddit subreddit, where we have posts like what’s your creepiest real life story? And from posts like this, we gathered, we harvested, 3,000 stories.
- So we used a distant supervision to automatically label the remainder of the stories.
- Often Reddit stories have a tl;dr associated with it.
- We used semantic distance between tl;dr and the sentences of the Reddit story to determine the sentences that were closest to that tl;dr.
- So assuming the comments would most likely be about the climax of the story.
- We thought about what features would be important, and based it on work from linguistic theory.
- At Penn, notes that stories are typically about change.
- Polanyi, who has done work on computational narrative, notes that the turning point in the story is marked by change in formality in style or in emphasis.
- So we looked at a number of different features that could represent change over the course of the story.
- So a word like happy or beautiful would have a high pleasantness score, whereas a word like difficult would have a low pleasantness score.
- We can see, if we sum the scores of the words in the different sentences, that at the MRE, which is shown in red, and is this the sentence here, we have a peak in activeness, the most active words in the story, and a minimum in pleasantness.
- So these are clues to us that we’re at the MRE of the story.
- So of the 3,000 stories, about 500 were read by human and annotated with which sentences conveyed the most reportable event.
- So in distant supervision, we have the remaining 2,500 stories, which are labeled automatically using the heuristics that she described.
- Self-training can also be called bootstrapping, where you take your seed set, you train your classifier on it, and then you use it to label everything that remains.
- You take some portion of that- you can choose randomly or you can go by the confidence of the classifier- and then you take that back, feed it back in as training data.
- One weakness of this is that you can imagine if you’re a biased person and you only talk to other biased people, then you’re going to be more and more biased.
- So to counteract increasing the biases that are already in this classifier, we also require that when a classifier labels the 2,500 extra stories, that these new labels have to agree with the distant supervision heuristic labels before we’re going to use them.
Week 4: Natural Language Processing > 4f NLP: Novels > 4f Video
- If we look at the kind of language used in 19th century novels, again, we can see quite different.
- ” So we can see that this kind of conversation is really unique to novels.
- We had a set of 60 novels, of which we took from different genres and different authors.
- We worked together with a professor from comparative literature and we were looking at whether, through analysis of the novels, could we provide evidence for or against literary theory.
- We did this by extracting social networks from the literature.
- We built the networks- a network for each novel- based on the conversation that happened within the novel.
- The theory here is that in the early part of the 19th century, we had a lot of novels which took place in a rural setting.
- So they had fewer people and everyone knew each other.
- Towards the end of the century, we had a move towards the city, where we had many more characters.
- So we see two quotes here, one from Franco Moretti, who says, “At 10 or 20 characters, it was possible to include distant and openly hostile groups.
- ” And another quote from Terry Eagleton who says, “in a large community, our encounters consist of seeing rather than speaking.
- ” So the question we raised was, can we show empirically that conversational networks with fewer people are more closely connected.
- To do this, we had to develop which could automatically do what’s called quote attribution.
- Identify the quote and identify who said each quote.
- We used quote adjacency in the novel as a heuristic for detecting conversations.
- We can only get about half of the conversation pairs in the novel.
- So we looked at the impact of network size and how it related to the theory from comparative literature.
- The theory tells us that as the number of named characters increases, we would expect to find the same or less total speech.
- As the number of named characters increases, we would expect to find lower density in the network.
- We would expect to find the same or fewer cliques in the network.
- We thought, well maybe we shouldn’t be looking at named characters.
- In larger networks, people know more of their neighbors.
- We did find the text perspective dominates the shape of the network.
- So for third-person tellings, we saw a significant increase in the normalized number of quotes, in the average degree of the network, in the density of the graph, and in the rate of three cliques.
- Well, if we think about it when we have a first person novel, only the person who is narrating the novel has access to who he’s talking to.
- So our hypothesis here is that first person narrators are not privy to other characters’ conversations.
- We can see this in the networks that were constructed.
- So shown here is an example of the social network constructed for Jane Austen’s novel, Persuasion.
- We can see connections between Ann and most of the other characters.
- We can see connections between the other characters.
- When we go to what’s called the closed third narrative- so this is drawn from Lady Audley’s Secret by Braddon- here the novel is told from the perspective of Robert.
- So the conversation primarily occurs between Robert and other characters.
- Occasionally, we can see other characters talking to each other.
- When we go to a first person narrative- and here is the network for David Copperfield’s Dickens- here we see how the eye dominates.
- We really cannot see much more than who the first person narrator speaks to.
- Uni-directional, one when somebody sees somebody else, but the other person does not see them.
- We’re also looking at a change over the course of the novel.
- So it’s quite possible in some of these novels placed in the city, that at the beginning people don’t speak much and don’t know each other.
- “He went upstairs to get a tool and in those few seconds, ocean waves broke the steel door lock and flooded the basement six feet high in minutes.
- ” And moving ahead, a month later, in work that we are also doing- which I didn’t talk about here today with a scientific journal articles- we may also have summaries of our articles that come out that look at the impact of what’s happened on the world.
Week 4: Natural Language Processing > 4g Application of Natural Language Processing > 4g Video
- Hi. My name is Michael Collins, and I’m going to be teaching a segment of the class on natural language processing.
- So what is natural language processing? Natural language processing, broadly speaking, is about getting computers to do intelligent things with human languages.
- So in natural language understanding, a computer takes some text as input, and in some sense understands that text or interprets it and does useful things.
- In natural language generation, a computer can actually be generating language.
- So let’s first go through some applications to illustrate the kind of problems that are solved in natural language processing.
- So one key application, going back to the start of computer science and artificial intelligence, is the problem of machine translation.
- So this is getting computers to automatically translate between languages.
- These really revolutionized the field and led to a very different way of looking at this problem.
- A second key problem in natural language processing is information extraction.
- Information extraction is the problem of taking some raw, unstructured text- for example, the text I just showed you here- and pulling out useful information from that text into a database-style structured representation.
- In dialog systems, we want to be able to actually talk to a machine and interact with it using natural language in the goal of achieving some task.
- In addition to these various NLP applications, there are also basic problems in natural language processing which underpin many of these application areas.
- Actually, in my segment of the course, we’ll go over a couple of these problems, and we’ll talk about solutions to these problems.
- So one of the most basic NLP problems is the problem of what’s called tagging, or sometimes called sequence learning.
- Abstractly, the problem here is to map a sequence of words- here, actually, we just have a sequence of letters- to a tagged output, where the tagged output has a tag associated with each word in the output.
- Many problems in language processing, and actually in other fields, can be formulated as tagging or sequence learning problems.
- As a first example, which is perhaps the earliest problem that was tackled using machine learning or statistical approaches in natural language processing, the first problem is part of speech tagging.
- The problem here is to map each word in a sentence to some underlying part of speech.
- A second example of a sequence labeling or tagging problem is the problem of named entity recognition.
- So this is the problem of taking some text and identifying all of the people, companies, locations, and so on, all of the proper nouns- so proper entities- in that text, and identifying the type of each entity.
- This can be reduced to a tagging problem where, for example, we’ve tagged the word Boeing as the start of a company, and Co. as the continuation of the company, and Wall Street is the location, and so on and so on- again, a surprisingly challenging task, largely due to ambiguity.
- Another core problem which we’ll look at in this segment is what’s called natural language pausing.
- So the problem here is to take some sentence’s input- for example, Boeing is located in Seattle- and to produce a linguistic analysis of the sentence which looks like basically a rooted tree with labels at these internal nodes in the tree.
- The first thing we’ll talk about is tagging problems.
- We’ll talk about a very important class model for tagging problems, namely log linear models.
Week 4: Natural Language Processing > 4h Tagging Problems and Log-Linear Models 1 > 4h Video
- We’re going to talk about an extremely widely used class of models- log-linear models- that are actually used for tagging problems and many other problems in NLP and other fields.
- So I’m first going to go over the tagging problem and give some example problems to address and characteristics of the problem.
- So as a first example problem, let’s look at a part of speech tagging, which is a sort of canonical example of the tagging problem in natural language processing.
- So the problem here is to take a sentence as input and then fit each word in the sentence, to have an output corresponding to the part of speech for that particular word.
- So we’ve tagged profits as a noun, sword as a verb, at a proposition, and so on and so on.
- It’s typical, in English at least, to have an inventory of maybe 50, roughly, possible part of speech tags for each word in the sentence.
- In English and in many other languages, words are quite ambiguous for that part of speech.
- The word profits is a noun in this particular sentence.
- Very often we need quite powerful contextual information to disambiguate a word for its part of speech.
- A second challenge is that we are often going to see words that are very infrequently seen in the training data we used to train our models.
- So if we look at this particular sentence, there’s this word, Mulally, which is an extremely rare word.
- It’s probably a word that many of you have never seen before.
- We require machines to be able to nevertheless recover the underlying part of speech tag for this word.
- Here’s a second example of a very important tagging problem in natural language.
- So the problem here is to take some sentences input and to identify various named entities, such as companies, locations, and people within this input sentence.
- It’s very easy to map this to a word by word tagging problem.
- I use NA to refer to a word which is not part of any entity.
- Then we have for each entity type, we have a start tag and a continue tag.
- You can see that these words like Boeing Co., Wall Street, Alan Mulally, make use of these various tags, which are used to find these named entity boundaries.
- Around a million words of text, where human annotators have actually gone in and annotated the underlying part of speech for each word in a sentence.
- Our goal is going to be to, from this training set, learn a function or an algorithm that takes completely new sentences and maps them to the underlying tag sequences.
- These problems with ambiguity and rare words quickly lead to a very complex set of rules being required.
- What people found was that it’s much easier to get annotators to annotate data, for say, part of speech tag data, and then to learn the function automatically from these annotated examples.
- So if you look at a word like can in English, this is again, an ambiguous word.
- So there’s a strong source of local preference, which is that this word, can, in general is more likely to be a modal verb than a noun, OK. But that can of course be overridden by context.
- Where determiner is a word like the or a, OK. So that’s a contextual cue.
- So if I have the sentence, the trash can is in the garage, the word can has a local preference of being a modal verb.
- We would want to tag the word can as a noun here, OK. So we’ll see that the log-linear models that we’ll introduce soon definitely make use of both of these types of constraints.
Week 4: Natural Language Processing > 4i Tagging Problems and Log-Linear Models 2 > 4i Video
- In tagging problems, yi is a complete tagging for the entire sentence.
- A task is given this set of supervised examples, this set of labeled examples, to learn a function f that maps new input x to underlying labels f of x. So again, in our case, x would be a sentence, and f of x would be for sequence of tags for that sentence.
- So we’re going to learn a function that maps an xy pair to a probability, corresponding to the conditional probability of the label y, given the input x. And then given any new test input, we define f of x to be simply the y that has maximal probability of py given x. So it’s very intuitive for in a given input x, simply take the most probable label, or in our case, tag sequence, for that input x. So lovely new models, which we’ll describe the next fall precisely into this category of conditional probabilistic models.
- Ti is the ith tag in the sentence, and it’s the tag for the ith word we see in the sentence.
- We’re going to use a log linear model to define a distribution, a conditional distribution, over tag sequences for a particular input sentence.
- So this model is going to take any tag sequence paired with a word sequence, and return a conditional probability.
- Finally for a new input sentence, w1 through n, we will return the most probable, the highest probability tag sequence under the model.
- So once we’ve learned this model in so-called decoding of an input sentence, we’ll search for the tag sequence that has highest probability under the model.
- The number of such tag sequences will typically grow exponentially fast with respect to the length of the sentence.
- If we have an input sentence of length n, and each word has two possible tags, there are 2 the power n possible tag sequences.
- So for any appreciable length of sentence, brute force search through all possible tag sequences is not going to be impossible.
- So here I have p of t1 through n. Conditional w1 through n. Here I have a product from j equals 1 to n. And in each point I have the conditional probability of t sub j. Conditioned on the entire input w1 through wn, condition on the previous j minus 1 tags.
- So we’ve essentially decomposed this probability into a product of individual terms where at each point, we have a probability of t sub j, given the entire input sentence, and the previous entire sequence of tags t1 through tj minus 1.
- I have t sub j, but now I condition only on tj minus 2 and tj minus 1, the previous two tags in the sequence.
- So I’ve made an independent assumption that once a condition on the entire input sequence, w1 through wn, and once I condition on the previous two tags, tj minus 2 and tj minus 1, tags earlier in the sentence are not relevant.
- One is that typically the last two tags carry by far the most information about the next tag.
- Secondly, it makes search for the highest probability tag sequence easier.
- Essentially allows us to use dynamic programming, technique from computer science, to find the most likely tag sequence under the model quite efficiently.
- Here, just to stress again, the independent assumptions made in the model, that once we condition on the word sequence on the last two part of speech tags, the previous tags are irrelevant.
- We’re going to try to model the conditional distribution over possible tags in that position.
- So here I have a set of possible tags corresponding to actually singular nouns, plural nouns, transitive verbs, intransitive verbs, propositions, determiners, and so on.
- So there are many possible tags at this position for this word base.
- That the previous tag is jj, which is an adjective in this particular part of speech tagging set.
- So you can imagine all kinds of information being used when we’re trying to model the conditional distribution over tags at this point.
- So a little bit of notation, we’re going to make use of the term history for a bundle of information corresponding to the context used at a particular tagging decision.
- T minus 2, t minus 1, w1 through n, and i, where t minus 2, t minus 1 are the previous two tags in the input.
- So a history basically captures the contextual information used in tagging.
- So in the tagging problem, or the part of speech tagging problem, this would be a set of 50 possible part of speech tags.
- One such question might be is the word being tagged base? And is the label being proposed verb? Return 1 only if those two conditions are true.
- So F1 looks at a history in conjunction with a tag and returns 1 if the word being tagged w1 is the word base, and the tag being proposed is vt, which corresponds to say a transitive verb.
- So this looks at a particular word in conjunction with a particular tag.
- Potentially one feature for every possible word paired with every possible tag.
- Even if we have hundreds of thousands of features which look at different words in conjunction with some of the different tags, only one of these features will take the value 1 versus 0.
- The tag being proposed is vbg, which corresponds to gerund verb.
- Now again, we would typically introduce a very large number of features, maybe looking at all prefixes and suffixes up to some length, say 4, in combination with all possible tags.
Week 4: Natural Language Processing > 4j Tagging Problems and Log-Linear Models 3 > 4j Video
- I’m actually going to show you a full set of features used by Adwait Ratnaparkhi in a paper in 1996 that really introduced log-linear models for tagging.
- These features, actually, at least for English parlor speech tagging is still close to state-of-the-art.
- We’d have spelling features for all prefixes and suffixes up to some length.
- So f103 is 1 if the tag trigam t minus 2, t minus 1t is equal to determiner, adjective, transitive verb.
- So this feature will capture a particular sequence of three tags.
- It will fire off a particular subsequence of three tags as seen in the input.
- Again, that’s going to be good evidence about whether the tag Vt is likely or unlikely in this particular context.
- A so-called bigram feature looks at just a pair of tags, JJ and Vt. And the so-called unigram feature looks at just the tag alone.
- That’s the simplest feature really that only looks at the tag Vt being predicted.
- We can also have features which look at the previous word, in this case, there in conjunction with the tag.
- In practice, you probably would have thousands of features of each of these different types.
- The next question is, well how do we turn this into a model? How do we make use of these features in a probabilistic model? And the idea will be to use these features directly as the evidence when calculating these conditional distributions.
- This is going to list a weight for every possible feature.
- V sub 2 is going to be a weight for feature 2 and so on.
- Intuitively, if a value is strongly positive, it means that particular feature is an indicator of a very likely context tag pair or xy pair.
- If a parameter value is very negative, it’s an indicator that the underlying feature indicates a very bad context tag pair.
- Then I have the inner product between v and f. So the inner products is just a standard inner product between a vector and a feature.
- If we think of the features as all being 0, 1 valued, it’s basically can be a sum of the weights corresponding to the features which take value 1.
- So every time I get a feature that takes value 1, I’m going to add in its parameter value that inner product, v dot f. So again, if that’s going to be strongly positive, that’s going to indicate that this particular xy pair is very likely.
- So this is just a transform from the score v dot f, which can be interpreted as a positive or negative score across likely or unlikely combinations basically turning those scores to a distribution of possible labels y. But at a high level, the crucial idea is this idea of features, which capture potentially important information about the input x paired with the output y. And then parameters these parameters v that allow us to learn positive or negative weights associated with each feature.
- So a critical question is where do we learn these parameter vectors v. How do we learn them? And we’re going to assume, as I said earlier, that we have training data consisting, in our case, of sentences labeled with tag sequences.
- So the learning algorithm is going to take that training data as input and return a parameter vector v as the output.
- Each yi is the correct tag in that particular history.
- We now need to, having trained the model, search for the tag sequence that maximizes the conditional probability t1 through tn given w1 through wn.
- I’m going to assume that this conditional distribution takes the form that I showed you earlier where we’ve used the chain rule and independence assumptions to give us a product of terms of the tag ti condition on the previous two tags, the entire sentence, and the position.
- The first thing to note is that, as I said before, we’re searching over potentially an exponential number of tag sequences.
- That is, exponential with respect to the sentence length n. And so brute force enumeration of all possible tag sequences is not going to be possible.
- That is essentially applicable because of this assumption that each tag depends on the previous two tags allows us to do something clever in search where we can avoid brute force enumeration of all tag sequences and instead, use this technique called dynamic programming.
- So part of speech tagging in English results are pretty high, easily around 97% accuracy, at least on some newswire text that’s a fairly well behaved genre.
- So to summarize the segment, the key ideas in log-linear taggers.
- The second step is to estimate this conditional probability of a tag ti given a history using a log-linear model.
- Finally for a given test sentence, w1 through wn, we can use dynamic programming, often called the Viterbi algorithm, to find the highest scoring tag sequence for that particular sentence.
- The key property of log-linear models, really the reason they became so popular, is due to flexibility in the features they can use.
Week 4: Natural Language Processing > 4k Syntax and Parsing > 4k Video
- As input we take some sentence, and as output we produce what is called a parse tree or a syntactic structure.
- We’ll see shortly that this is a linguistic representation that basically shows how a sentence is broken down.
- It shows the kind of hierarchical relationships between words and phrases in that sentence.
- It’s actually very similar or very closely related to the kind of sentence diagrams that many people, at least in the US, produce in high school.
- The Penn Wall Street Journal Treebank, which was one of the original resources of this type, consists of around 50,000 sentences with associated parse trees, or syntactic structures.
- A team of linguists have actually gone through, and sentence by sentence, assigned what they deem to be the correct syntactic structure for each sentence.
- They’re simply the part of speech for each word in the sentence.
- If we look at the higher levels on the tree, we start to see groups of one or more words grouped into phrases.
- So if we look at an internal node in the tree, for example, the nodes labeled NP, we’ll see that they dominate some substring of the original sentence.
- So we can just read off from the tree the fact that we have noun phrases.
- So a verb phrase is typically formed by a verb followed by a noun phrase.
- Finally we have sentences which are full sentences.
- Often created by a noun phrase followed by a verb phrase.
- So we start to see a grouping of phrases into high level pieces of structure corresponding to larger substrings within a sentence, OK. Noun phrases, verb phrases, and sentences in this case.
- So the final thing that these parse trees encode is a set of useful relationships between phrases or words in a sentence.
- If we look at this particular sentence, we know that the burglar is the subject of robbed.
- The implication is the burglar is the person who did the robbing, in this case, OK. And there is a grammatical relationship between the phrase burglar and between verb robbed.
- There is a verb direct object relationship between robbed and the apartment, in this particular sentence.
- That’s crucial in making a first step towards understanding the meaning of a sentence.
- Because, these grammatical relationships are very closely related to the underlying meaning of a sentence.
- There’s a much more complex set of grammatical relationships, and much more complex set of grammatical relations within this particular sentence.
- So if we take a simple sentence like, IBM bought Lotus, in English, the paraphrase in Japanese, the word order would be, IBM Lotus bought.
- You might think it’s simple to, when you are translating a sentence, to recover this difference in word order between the two languages.
- Things quickly become complex when we look at longer sentences.
- The verb, said, has now gone to the end of the sentence because Japanese is generally gonna have the verb at the end.
- Again, as sentences get longer it really quickly becomes complex to define the difference in word order between sentences in the source and target languages.
- So I have this sentence, he drove down the street in the car.
- The phrase, in the car, is a so-called prepositional phrase.
- Prepositional phrases crop up all the time, and they can, in general, modify either nouns or verbs.
- There’s actually noun ambiguity in the sentence.
- So if I say, he drove down the street in the car, it’s pretty clear that the propositional phrase, in the car, is modifying the verb, drove.
- So I drove down the street, which was in the car, OK. And so that’s a pretty representative case of prepositional phrase attachment ambiguity.
- Where a propositional phrase can modify either a verb, drove or a noun, street.
- These types of ambiguities arise everywhere in sentences.
- As we’ll see soon, they conspire to produce a great number of analyses for many sentences.
- The alternative structure again has a prepositional phrase, in the car, but it now modifies the noun phrase, the street.
- So here’s a very simple example sentence, she announced a program to promote safety in trucks and vans.
- OK. If we go to the prepositional phrase, we have the prepositional phrase here, in trucks and vans.
- An alternative analysis, again corresponding to a prepositional phrase attachment ambiguity, is for in trucks and vans to be modifying the verb, promote.
- You can find many other sources of ambiguity in the sentence leading to a large number of analyses.
Week 4: Natural Language Processing > 4l Dependency Parsing > 4l Video
- So here’s a dependency pause for the sentence John saw Mary.
- So we’ll use a convention where each dependency can be represented by a pair of integers, h and m, where h is the index of what’s called the headword, and that is going to be the word at the start of a directed arc.
- That’s the dependency between word 0, root, and word 2, saw.
- We’re going to follow this convention where dependency is a pair, h comma m, where h and m are indices into the sentence.
- So the dependency pausing problem is simply to take a sentence as input and to recover a dependency pause as the output.
- Of course, for a given sentence, there are typically many possible dependency structures.
- It’s easy enough to prove that the number of dependency structures for a sentence of length n will typically grow exponentially fast with respect to the sentence length n. And so our task is going to be to essentially choose between these different dependency structures.
- So over the last decade or so, there’s been a considerable amount of interest in dependency pausing.
- In a conference in 2006, 19 different groups developed dependency pausing models for 12 languages- Arabic, Chinese, Czech, Danish and so on.
- You can see from this 2006, there are at least 12 languages with tree banks- that is, resources which paired sentences with annotated dependency structures from which we could learn machine learning systems for the problem.
- So we can build dependency pausing models for many languages of interest.
- So now we’re going to talk about how to develop a statistical machine learning approach to dependency pausing.
- The critical idea will be to define a function that gives a score for any dependency structure.
- So by giving a score to every possible dependency structure, we now rank the dependency structures for a particular sentence in order of plausibility.
- So if we go back to this particular example, we might have several possible dependency structures for a sentence.
- So the critical question is, how do we define a model that takes a dependency pause as input and returns a score as the output? So here is a very simple scoring function which is the basis of many scoring functions that are used in dependency pausing models.
- So this is going to be a score of a dependency pause.
- Remember, a dependency pause consists of a set of dependencies where each dependency is a pair h and m, and h and m are indices into the sentence representing a directed arc between the word at position h and the word at position m. We’ll use x to denote the sentence that is being paused.
- We have one score for each dependency in the dependency pause.
- It takes the identity of the sentence, and it takes the indices h and m representing a particular dependency, and it returns a feature vector representation of that particular dependency.
- So the score of this entire dependency pause is going to be a sum of scores, one for each dependency.
- So we have one score for each dependency in the structure.
- So as we saw in log linear models for tagging, each of these features- we have little m features in the model- each feature is going to be a function that, in this case, looks at a sentence and looks at the location of a particular dependency.
- They’ll ask various questions about the particular dependency of interest.
- So this particular feature is going to ask a question about the dependency.
- It’s going to ask about the headword and the modifier word involved in a particular dependency.
- Now, if we look at the scoring function, v1 in this case can be interpreted as a score for a dependency between saw and John.
- So if this is high, it indicates that it’s very plausible to have a dependency between saw and John.
- If v1 is strongly negative, it indicates that this particular type of dependency is dispreferred.
- So we would typically run a tagging model to first recover the part of speech of each word in the sentence, and then our dependency pausing features can make use of part of speech tags as well as words in the sentence.
- So in this case we don’t just consider the word at the start of a dependency arc in the end.
- We can also consider words within a one or two-word window of the start and end point of a particular dependency.
- A final set of features which are very useful are so-called in-between features, which look not just at the two words involved in dependency, but they also look at what kinds of words or what parts of speech are seen between those two words.
- Because the context, again, can be a very strong indicator of where a dependency is likely or not to be seen between two particular words.
- The simple first order models I just showed you score close to 91% accuracy in recovering individual dependencies.
- So we can look at the output from a dependency pauser and simply look at what proportion of dependencies between words are correctly recovered.
- A second order dependency model is slightly more complex.
- It basically means that instead of just looking at single dependencies, we’re also going to look at pairs of dependencies, like this.
- Finally, the most recent results- which essentially build on the techniques I’ve shown you but use much more powerful features and dependency structures- might easily achieve over 94% accuracy.
- All of these numbers are on the Penn Wall Street Journal Treebank, so these are representative of the accuracy on Newswire text using these dependency structures for English.
Week 4: Natural Language Processing > 4m Machine Translation 1 > 4m Video
- OK, so in this final segment of the session, I’m going to talk about machine translation.
- Machine translation is the problem of getting computers to automatically translate between languages.
- The idea is as follows- in many language pairs of interest, we actually have access to large quantities of example translations.
- The basic idea is that given these example translations, we’re going to learn a model that automatically translates between languages.
- The earliest work that went seriously with this approach was work at IBM on translation between French and English.
- This gives us many example source sentences, example translations.
- So they used a data set consisting of around two million sentences of French-English example translations.
- So this is kind of an incredible idea- that you could simply take example translations and from that, learn a translation model that correctly or accurately translates future examples.
- The idea actually goes back- way back- to Warren Weaver, who suggested applying statistical and cryptoanalytic techniques to translation in the late ’40s. So what I’m going to do is give you a very brief overview of some important techniques used to actually build these translation models.
- These techniques are now widely used in modern machine translation systems.
- So here’s an example French-English translation.
- What we’ve done here is for each English word, we’ve identified a single French word in the French sentence for which the English word is a translation.
- So “the” is aligned to “le”, “council” is aligned to a similar-sounding word in French, “stated” is aligned to and so on, and so on.
- You can imagine that if you have millions of sentences like this, you’re going to start to see strong evidence that certain words are translations of other words.
- Firstly, just frequency of co-occurring- so if I look at the English word “council” and its French translation- this second word here in the French- those two words are going to be seen frequently in sentences that are translations of each other.
- That’s going to be a rich source of evidence that those two words are actually translations of each other.
- In the crudest form, this is going to say that words close to the start of the English sentence are often translated to words close to the start in the French sentence.
- There are much more sophisticated positional models that try to say structurally which parts of one sentence are likely to be translations of the other.
- Once we combine those two types of evidence, we can start to build up strong evidence for which words are translations of each other.
- It’s basically learned to recover these alignments purely from millions of sentences of example translation data.
- ” And actually, the translation systems I’ll describe are called phrase-based translation systems.
- So in phrase-based translation systems, the crucial results behind these systems is a dictionary which pairs phrases in one language with phrases in another language.
- Of course, again, these probabilities are going to be invaluable in devising a translation model for between German and English.
- So the final thing I want to do is give you a sketch of how, once we’ve learned a phrase lexicon of the type I’ve just shown you, how we can employ this in translation.
- The basic idea is that we’re going to translate a sentence in some foreign language- in this case German- into some target language- in this case English- by essentially laying down the English words in left-to-right order, at each point taking some segment of the German sentence, looking up its translation in some phrase table, and translating it to some phrase in English.
- So in this case, we might first choose the translation of “heute” in German to the English “today.
- So at each point, we basically just pick a sub-segment of the German which hasn’t yet been translated, look up a phrase in the phrase table, and make a translation to that phrase.
- The final thing I want to talk about is how probabilities or score is associated with each of these translation moves.
- So in this particular case, we’re extending the English translation by a single word, “today.
- ” And that has an associated log probability, which is the conditional probability of seeing the word “today” given that the previous two words were star, star.
- The language model score estimates the probability of the word “debating” conditioned on the previous two English words we produced, that “shall” and “be.
- The second term gives the conditional probability of the verb “diskutieren” in German conditioned on the English word “debating” that’s being generated.
- This can be thought of as a measure of how likely we have this particular translation.
- So we’ve actually skipped over six words in the input in making this particular translation.
- We’re going to use some kind of search method- often a method called beam search- to attempt to find the highest scoring translations- the most plausible translation for a particular input.
- I want to talk about one method which makes use of syntax, which really ties together these last couple of segments in the class- the segment on syntax and machine translation.
- So this is one very simple way of using those syntactic structures- for example, dependency structures- to improve the performance of machine translation system.
- Then we look at the grammatical relations in the parse tree, and reorder them to try to recover a word order that’s much more similar to the target language- in our case, for example, English.
- So we’re essentially going to take German sentences, and move the words around so we end up with sentences which are again using German vocabulary, but are much closer to English in word order.
- ” We parse the German, apply a set of rules, and in this reordered German, we actually have something that looks much closer to English in word order.
Week 4: Natural Language Processing > 4n Machine Translation 2 > 4n Video
- OK. So having defined features in the model, let’s now talk about how we can make use of these features to derive a probability distribution, specifically, a conditional distribution over the set of possible tags for a given history.
- So in the general case of a log-linear model, we have some set of possible inputs, which we use script X to denote this, and we have a finite set of possible labels, script Y, and our aim is to define a conditional probability py given x for any xy pair.
- In NLP, it’s very common for these features to be indicator functions that basically ask a question about an xy pair and return either 0 or 1.
- In addition to the feature vector, we’re going to have a parameter vector v, which is also in m dimensions and is basically going to store all of the parameters of the model.
- So for each feature f sub k, we’re going to have a parameter or weight v sub k. And that parameter could be positive or negative.
- Given the definitions of f and v, we define the probability distribution in a particular way.
- So we want to have the conditional probability of y conditioned on x. And I’ll use a semicolon here followed by v. This expression can be interpreted as the conditional probability of y given x under parameter values v. And this is going be the ratio of two terms.
- Then each vk is basically going to be the weight associated with this feature.
- So whenever a feature is 1, we’re going to add in this weight.
- Intuitively, if the v sub k is strongly positive, it indicates that that particular feature being true indicates that x and y is a very plausible pair.
- So we exponentiate v dot f. And then on the denominator, I have a sum of all possible labels y primed, and then I basically have the same thing e v dot f x y primed.
- So of course we want this property that if we sum over all labels, py given x under parameters v is equal to 1, and this normalization constant essentially ensures this is true.
- You can see if we sum over y here, we end up with a ratio of terms where the numerator and denominator are the same, and so we just end up with probability 1 in this case.
- So basically in a nutshell, we first score each decision using v dot f. We then exponentiate and we normalize and we end up with a probability distribution.
- How do I end up- How do I get these weights? I’m just going to give you a very, very high level overview of how we actually learn the parameter values from data.
- The training set consists of xy pairs and the tagging sample consists of histories combined with the true tag in each of these histories.