Class TextExtractor

java.lang.Object
it.unimi.dsi.parser.callback.DefaultCallback
it.unimi.dsi.parser.callback.TextExtractor
All Implemented Interfaces:
Callback

@Deprecated public class TextExtractor extends DefaultCallback
Deprecated.
This class is obsolete and kept around for backward compatibility only.
A callback extracting text and titles.

This callbacks extracts all text in the page, and the title. The resulting text is available through text, and the title through title.

Note that text and title are never trimmed.

  • Field Details

    • text

      public final MutableString text
      Deprecated.
      The text resulting from the parsing process.
    • title

      public final MutableString title
      Deprecated.
      The title resulting from the parsing process.
  • Constructor Details

    • TextExtractor

      public TextExtractor()
      Deprecated.
  • Method Details

    • configure

      public void configure(BulletParser parser)
      Deprecated.
      Configure the parser to parse text.
      Specified by:
      configure in interface Callback
      Overrides:
      configure in class DefaultCallback
    • startDocument

      public void startDocument()
      Deprecated.
      Description copied from interface: Callback
      Receive notification of the beginning of the document.

      The callback must use this method to reset its internal state so that it can be resued. It must be safe to invoke this method several times.

      Specified by:
      startDocument in interface Callback
      Overrides:
      startDocument in class DefaultCallback
    • characters

      public boolean characters(char[] characters, int offset, int length, boolean flowBroken)
      Deprecated.
      Description copied from interface: Callback
      Receive notification of character data inside an element.

      You must not write into text, as it could be passed around to many callbacks.

      flowBroken will be true iff the flow was broken before text. This feature makes it possible to extract quickly the text in a document without looking at the elements.

      Specified by:
      characters in interface Callback
      Overrides:
      characters in class DefaultCallback
      Parameters:
      characters - an array containing the character data.
      offset - the start position in the array.
      length - the number of characters to read from the array.
      flowBroken - whether the flow is broken at the start of text.
      Returns:
      true to keep the parser parsing, false to stop it.
    • endElement

      public boolean endElement(Element element)
      Deprecated.
      Description copied from interface: Callback
      Receive notification of the end of an element. Warning: unless specific decorators are used, in general a callback will just receive notifications for elements whose closing tag appears explicitly in the document.

      This method will never be called for element without closing tags, even if such a tag is found.

      Specified by:
      endElement in interface Callback
      Overrides:
      endElement in class DefaultCallback
      Parameters:
      element - the element whose closing tag was found.
      Returns:
      true to keep the parser parsing, false to stop it.
    • startElement

      public boolean startElement(Element element, Map<Attribute,MutableString> attrMapUnused)
      Deprecated.
      Description copied from interface: Callback
      Receive notification of the start of an element.

      For simple elements, this is the only notification that the callback will ever receive.

      Specified by:
      startElement in interface Callback
      Overrides:
      startElement in class DefaultCallback
      Parameters:
      element - the element whose opening tag was found.
      attrMapUnused - a map from Attributes to MutableStrings.
      Returns:
      true to keep the parser parsing, false to stop it.