edu.washington.cs.knowitall
Class Sentence

java.lang.Object
  extended by edu.washington.cs.knowitall.sequence.SimpleLayeredSequence
      extended by edu.washington.cs.knowitall.sequence.BIOLayeredSequence
          extended by edu.washington.cs.knowitall.nlp.ChunkedSentence
              extended by edu.washington.cs.knowitall.Sentence
All Implemented Interfaces:
edu.washington.cs.knowitall.sequence.LayeredSequence, TokenSequence, XmlSerializable, Serializable

public class Sentence
extends edu.washington.cs.knowitall.nlp.ChunkedSentence
implements TokenSequence, Serializable, XmlSerializable

A representation of a sentence. This class extends ChunkedSentence to support types, lemmas, and various serialization methods.

Author:
schmmd
See Also:
Serialized Form

Field Summary
 Long id
           
 String originalText
           
protected  List<com.google.common.collect.TreeMultimap<String,Type>> typeLookup
           
 
Fields inherited from class edu.washington.cs.knowitall.nlp.ChunkedSentence
NP_LAYER, POS_LAYER, TOKEN_LAYER
 
Constructor Summary
Sentence(edu.washington.cs.knowitall.nlp.ChunkedSentence chunked, String originalText)
           
Sentence(edu.washington.cs.knowitall.nlp.ChunkedSentence chunked, String originalText, Iterable<String> norms)
           
Sentence(Long id, String originalText, List<String> tokens, Iterable<String> norms, List<String> posTags, List<String> npChunkTags)
           
Sentence(Long id, String originalText, String[] tokens, String[] norms, String[] posTags, String[] npChunkTags)
           
Sentence(String originalText, List<String> tokens, Iterable<String> norms, List<String> posTags, List<String> chunkTags)
           
Sentence(String originalText, String[] tokens, String[] norms, String[] posTags, String[] chunkTags)
           
 
Method Summary
 void addExtraction(Iterable<edu.washington.cs.knowitall.nlp.extraction.ChunkedBinaryExtraction> extractions)
          Add multiple extractions to this sentence.
 void addExtraction(RelationExtraction extraction)
          Add an extraction to this sentence.
 void addExtractions(Iterable<RelationExtraction> extractions)
          Add multiple extractions to this sentence.
static String convertGroup(edu.washington.cs.knowitall.commonlib.regex.Match.Group<Token> group)
           
 boolean equals(Object that)
           
 List<RelationExtraction> extractions()
          The extractions in this sentence.
static Iterable<RelationExtraction> extractions(Iterable<Sentence> sentences)
           
static List<Sentence> fromDocument(org.jdom.Document document)
          Deserialize sentence from an XML document.
static Sentence fromXmlElement(org.jdom.Element e)
           
 Long getId()
           
 List<String> getLemmas()
          The lemmas of this sentence.
 List<String> getLemmas(edu.washington.cs.knowitall.commonlib.Range range)
          The lemmas of this sentence, constraint to the specified range.
 edu.washington.cs.knowitall.commonlib.Range getRange()
           
 edu.washington.cs.knowitall.commonlib.Range getRange(String string)
           
 List<Type> getTypes()
           
 int hashCode()
           
static edu.washington.cs.knowitall.commonlib.regex.RegularExpression<Token> makeRegex(String regex)
          This class compiles regular expressions over the tokens in a sentence into an NFA.
 void tag(Iterable<Type> types)
          Add a collection of types to this sentence.
 void tag(Type type)
          Add a type to this sentence.
 String toString()
           
 org.jdom.Element toXmlElement()
           
 List<Type> types()
          The types associated with this sentence.
 List<Token> zip()
          Represent this sentence as a list of tokens (instead of an object that contains separate array for each field).
 List<Token> zip(edu.washington.cs.knowitall.commonlib.Range range)
          Represent a range in this sentence as a list of tokens.
 
Methods inherited from class edu.washington.cs.knowitall.nlp.ChunkedSentence
clone, getChunkTag, getChunkTags, getChunkTags, getChunkTags, getChunkTagsAsString, getNpChunkRanges, getPosTag, getPosTags, getPosTags, getPosTags, getPosTagsAsString, getPosTagsAsString, getPosTagsAsString, getSubSequence, getSubSequence, getToken, getTokenRange, getTokens, getTokens, getTokens, getTokensAsString, getTokensAsString, getTokensAsString, toOpenNlpFormat
 
Methods inherited from class edu.washington.cs.knowitall.sequence.BIOLayeredSequence
addSpanLayer, addSpanLayerRanges, getSpans, getSpans, getSubSequence, getSubSequence, isSpanLayer
 
Methods inherited from class edu.washington.cs.knowitall.sequence.SimpleLayeredSequence
addLayer, addLayer, addLayer, get, getLayer, getLayerAsString, getLayerAsString, getLayerAsString, getLayerNames, getLength, getNumLayers, hasLayer
 
Methods inherited from class java.lang.Object
finalize, getClass, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface edu.washington.cs.knowitall.TokenSequence
getChunkTags, getPosTags, getTokens, getTokensAsString
 

Field Detail

id

public final Long id

originalText

public final String originalText

typeLookup

protected final List<com.google.common.collect.TreeMultimap<String,Type>> typeLookup
Constructor Detail

Sentence

public Sentence(edu.washington.cs.knowitall.nlp.ChunkedSentence chunked,
                String originalText,
                Iterable<String> norms)
         throws edu.washington.cs.knowitall.sequence.SequenceException
Throws:
edu.washington.cs.knowitall.sequence.SequenceException

Sentence

public Sentence(edu.washington.cs.knowitall.nlp.ChunkedSentence chunked,
                String originalText)

Sentence

public Sentence(String originalText,
                String[] tokens,
                String[] norms,
                String[] posTags,
                String[] chunkTags)
         throws edu.washington.cs.knowitall.sequence.SequenceException
Throws:
edu.washington.cs.knowitall.sequence.SequenceException

Sentence

public Sentence(String originalText,
                List<String> tokens,
                Iterable<String> norms,
                List<String> posTags,
                List<String> chunkTags)
         throws edu.washington.cs.knowitall.sequence.SequenceException
Throws:
edu.washington.cs.knowitall.sequence.SequenceException

Sentence

public Sentence(Long id,
                String originalText,
                String[] tokens,
                String[] norms,
                String[] posTags,
                String[] npChunkTags)
         throws edu.washington.cs.knowitall.sequence.SequenceException
Throws:
edu.washington.cs.knowitall.sequence.SequenceException

Sentence

public Sentence(Long id,
                String originalText,
                List<String> tokens,
                Iterable<String> norms,
                List<String> posTags,
                List<String> npChunkTags)
Method Detail

fromDocument

public static List<Sentence> fromDocument(org.jdom.Document document)
Deserialize sentence from an XML document.

Parameters:
document - document to deserialize
Returns:
resulting sentence object

toString

public String toString()
Overrides:
toString in class edu.washington.cs.knowitall.nlp.ChunkedSentence

equals

public boolean equals(Object that)
Overrides:
equals in class edu.washington.cs.knowitall.sequence.SimpleLayeredSequence

hashCode

public int hashCode()
Overrides:
hashCode in class edu.washington.cs.knowitall.sequence.SimpleLayeredSequence

zip

public List<Token> zip()
Represent this sentence as a list of tokens (instead of an object that contains separate array for each field). This is used by the regular expression library.

The list is cached for speed.

Specified by:
zip in interface TokenSequence

zip

public List<Token> zip(edu.washington.cs.knowitall.commonlib.Range range)
Represent a range in this sentence as a list of tokens. The returned object is a view into the cached list of the entire sentence.

Parameters:
range -
Returns:

getRange

public edu.washington.cs.knowitall.commonlib.Range getRange(String string)

types

public List<Type> types()
The types associated with this sentence.

Returns:

tag

public void tag(Iterable<Type> types)
Add a collection of types to this sentence.

Parameters:
types -

tag

public void tag(Type type)
Add a type to this sentence.

Parameters:
type -

getRange

public edu.washington.cs.knowitall.commonlib.Range getRange()
Returns:
the range of this sentence

getLemmas

public List<String> getLemmas()
The lemmas of this sentence. Lemmas are normalizations of the token strings.

Specified by:
getLemmas in interface TokenSequence

getLemmas

public List<String> getLemmas(edu.washington.cs.knowitall.commonlib.Range range)
The lemmas of this sentence, constraint to the specified range.

Parameters:
range -
Returns:

addExtraction

public void addExtraction(RelationExtraction extraction)
Add an extraction to this sentence.

Parameters:
extraction -

addExtractions

public void addExtractions(Iterable<RelationExtraction> extractions)
Add multiple extractions to this sentence.

Parameters:
extractions -

addExtraction

public void addExtraction(Iterable<edu.washington.cs.knowitall.nlp.extraction.ChunkedBinaryExtraction> extractions)
Add multiple extractions to this sentence. The ReVerb style extractions will be converted into instances of RelationExtraction.

Parameters:
extractions -

extractions

public List<RelationExtraction> extractions()
The extractions in this sentence.

Returns:

fromXmlElement

public static Sentence fromXmlElement(org.jdom.Element e)

toXmlElement

public org.jdom.Element toXmlElement()
Specified by:
toXmlElement in interface XmlSerializable

extractions

public static Iterable<RelationExtraction> extractions(Iterable<Sentence> sentences)

makeRegex

public static edu.washington.cs.knowitall.commonlib.regex.RegularExpression<Token> makeRegex(String regex)
This class compiles regular expressions over the tokens in a sentence into an NFA. There is a lot of redundancy in their expressiveness. This is largely because it supports pattern matching on the fields This is not necessary but is an optimization and a shorthand (i.e. <pos="NNPS?"> is equivalent to "<pos="NNP" | pos="NNPS"> and (?:<pos="NNP"> | <pos="NNPS">).

Here are some equivalent examples:

  1. <pos="JJ">* <pos="NNP.">+
  2. <pos="JJ">* <pos="NNPS?">+
  3. <pos="JJ">* <pos="NNP" | pos="NNPS">+
  4. <pos="JJ">* (?:<pos="NNP"> | <pos="NNPS">)+
Note that (3) and (4) are not preferred for efficiency reasons. Regex OR (in example (4)) should only be used on multi-token sequences.

The Regular Expressions support named groups (: ... ), unnamed groups (?: ... ), and capturing groups ( ... ). The operators allowed are +, ?, *, and |. The Logic Expressions (that describe each token) allow grouping "( ... )", not '!', or '|', and and '&'.

Parameters:
regex -
Returns:

convertGroup

public static String convertGroup(edu.washington.cs.knowitall.commonlib.regex.Match.Group<Token> group)

getId

public Long getId()

getTypes

public List<Type> getTypes()
Specified by:
getTypes in interface TokenSequence


Copyright © 2011 University of Washington CSE. All Rights Reserved.