Class BulletParser
The bullet parser has been written with two specific goals in mind: web crawling and targeted data extraction from massive web data sets. To be usable in such environments, a parser must obey a number of restrictions:
- it should avoid excessive object creation (which, for instance, forbids a significant usage of Java strings);
- it should tolerate invalid syntax and recover reasonably; in fact, it should never throw exceptions;
- it should perform actual parsing only on a settable feature subset: there is no reason to
parse the attributes of a
P
element while searching for links; - it should parse HTML as a regular language, and leave context-free properties (e.g., stack maintenance and repair) to suitably designed callbacks.
Thus, in fact the bullet parser is not a parser. It is a bunch of spaghetti code that analyses a stream of characters pretending that it is an (X)HTML document. It has a very defensive attitude against the stream character it is parsing, but at the same time it is forgiving with all typical (X)HTML mistakes.
The bullet parser is officially StringFree™. MutableString
s
are used for internal processing, and Java strings are used only to return attribute values. All
internal maps are reference-based
maps from fastutil
, which helps to
accelerate further the parsing process.
HTML data
The bullet parser uses attributes and methods of HTMLFactory
,
Element
, Attribute
and
Entity
. Thus, for instance, whenever an element is to be passed
around it is one of the shared objects contained in Element
(e.g.,
Element.BODY
).
Callbacks
The result of the parsing process is the invocation of a callback. The
callback interface of the bullet parser
remembers closely SAX2, but it has some additional methods targeted at (X)HTML, such as
Callback.cdata(it.unimi.dsi.parser.Element,char[],int,int)
,
which returns characters found in a CDATA section (e.g., a stylesheet).
Each callback must configure the parser, by requesting to perform the analysis and the callbacks
it requires. A callback that wants to extract and tokenise text, for instance, will certainly
require parseText(true)
, but not parseTags(true)
. On the other hand, a callback wishing to extract links will require to
parse selectively certain attribute types.
A more precise description follows.
Writing callbacks
The first important issue is what has to be required to the parser. A newly created parser does not invoke any callback. It is up to every callback to add features so that it can do its job. Remember that since many callbacks can be composed, you must always add features, never remove them, and moreover your callbacks must be ready to be invoked with features they did not request (e.g., attribute types added by another callback).
The following parse features may be configured; most of them are just boolean features, a.k.a. flags: unless otherwise specified, by default all flags are set to false (e.g., by the default the parser will not parse tags):
- tags (
parseTags(boolean)
method): whether tags should be parsed; - attributes (
parseAttributes(boolean)
andmethods)
: whether attributes should be parsed (of course, setting this flag is useless if you are not parsing tags); note that setting this flag will just activate the attribute parsing feature, but you must also register every attribute whose value you want to obtain. - text (
parseText(boolean)
method): whether text should be parsed; if this flag is set, the parser will call theCallback.characters(char[], int, int, boolean)
method for every text chunk found. - CDATA sections (
parseCDATA(boolean)
method): whether CDATA sections (stylesheets & scripts) should be parsed; if this flag is set, the parser will call theCallback.cdata(Element,char[],int,int)
method for every CDATA section found.
Invoking the parser
After setting the parser callback, you just call
parse(char[], int, int)
.
-
Field Summary
Modifier and TypeFieldDescriptionprotected Reference2ObjectMap<Attribute,
MutableString> Deprecated.A map from attributes to attribute values.protected Callback
Deprecated.The callback of this parser.protected static final TextPattern
Deprecated.Closed section (conditional, CDATA, etc.).protected static final TextPattern
Deprecated.Closed comment.protected static final TextPattern
Deprecated.Closed ASP or similar tag.protected static final TextPattern
Deprecated.Closed processing instruction.protected static final TextPattern
Deprecated.Closed section (conditional, etc.).final ParsingFactory
Deprecated.The parsing factory used by this parser.protected static final int
Deprecated.The base for non-decimal entity.protected char
Deprecated.The character represented by the last scanned entity.protected static final int
Deprecated.The maximum number of digits of a decimal numeric entity.protected static final int
Deprecated.The maximum Unicode value accepted for a numeric entity.protected static final int
Deprecated.The maximum number of digits of a hexadecimal numeric entity.protected static final char[]
Deprecated.An array containing the non-space whitespace.protected boolean
Deprecated.Whether we should parse attributes.protected boolean
Deprecated.Whether we should invoke the CDATA section handler.Deprecated.An externally visible, immutable subset of attributes whose values will be actually parsed.protected ReferenceArraySet<Attribute>
Deprecated.The subset of attributes whose values will be actually parsed (if, of course,parseAttributes
is true).protected boolean
Deprecated.Whether we should parse tags.protected boolean
Deprecated.Whether we should invoke the text handler.protected static final TextPattern
Deprecated.Closing tag for a script element.protected static final char[]
Deprecated.An array, parallel toNONSPACE_WHITESPACE
, containing spaces.protected static final int
Deprecated.Scanning a closing tag.protected static final int
Deprecated.Scanning attribute name/value pairs.protected static final int
Deprecated.Scanning a closing tag.protected static final int
Deprecated.Scanning attribute name/value pairs.protected static final int
Deprecated.Scanning text..protected static final TextPattern
Deprecated.Closing tag for a style element. -
Constructor Summary
ConstructorDescriptionDeprecated.Creates a new bullet parser using the default factoryHTMLFactory.INSTANCE
.BulletParser
(ParsingFactory factory) Deprecated.Creates a new bullet parser. -
Method Summary
Modifier and TypeMethodDescriptionprotected char
entity2Char
(MutableString name) Deprecated.Returns the character corresponding to a given entity name.protected int
handleMarkup
(char[] text, int pos, int end) Deprecated.Handles markup.protected int
handleProcessingInstruction
(char[] text, int pos, int end) Deprecated.Handles processing instruction, ASP tags etc.void
parse
(char[] text) Deprecated.Analyze the text document to extract information.void
parse
(char[] text, int offset, int length) Deprecated.Analyze the text document to extract information.parseAttribute
(Attribute attribute) Deprecated.Adds the given attribute to the set of attributes to be parsed.boolean
Deprecated.Returns whether this parser will parse attributes.parseAttributes
(boolean parseAttributes) Deprecated.Sets the attribute parsing flag.boolean
Deprecated.Returns whether this parser will invoke the CDATA-section handler.parseCDATA
(boolean parseCDATA) Deprecated.Sets the CDATA-section handler flag.boolean
Deprecated.Returns whether this parser will parse tags and invoke element handlers.parseTags
(boolean parseTags) Deprecated.Sets whether this parser will parse tags and invoke element handlers.boolean
Deprecated.Returns whether this parser will invoke the text handler.parseText
(boolean parseText) Deprecated.Sets the text handler flag.protected void
replaceEntities
(MutableString s, MutableString entity, boolean loose) Deprecated.Replaces entities with the corresponding characters.protected int
scanEntity
(char[] a, int offset, int length, boolean loose, MutableString entity) Deprecated.Searches for the end of an entity.setCallback
(Callback callback) Deprecated.Sets the callback for this parser, resetting at the same time all parsing flags.
-
Field Details
-
STATE_TEXT
protected static final int STATE_TEXTDeprecated.Scanning text..- See Also:
-
STATE_BEFORE_START_TAG_NAME
protected static final int STATE_BEFORE_START_TAG_NAMEDeprecated.Scanning attribute name/value pairs.- See Also:
-
STATE_BEFORE_END_TAG_NAME
protected static final int STATE_BEFORE_END_TAG_NAMEDeprecated.Scanning a closing tag.- See Also:
-
STATE_IN_START_TAG
protected static final int STATE_IN_START_TAGDeprecated.Scanning attribute name/value pairs.- See Also:
-
STATE_IN_END_TAG
protected static final int STATE_IN_END_TAGDeprecated.Scanning a closing tag.- See Also:
-
MAX_ENTITY_VALUE
protected static final int MAX_ENTITY_VALUEDeprecated.The maximum Unicode value accepted for a numeric entity.- See Also:
-
HEXADECIMAL
protected static final int HEXADECIMALDeprecated.The base for non-decimal entity.- See Also:
-
MAX_HEX_ENTITY_LENGTH
protected static final int MAX_HEX_ENTITY_LENGTHDeprecated.The maximum number of digits of a hexadecimal numeric entity.- See Also:
-
MAX_DEC_ENTITY_LENGTH
protected static final int MAX_DEC_ENTITY_LENGTHDeprecated.The maximum number of digits of a decimal numeric entity.- See Also:
-
SCRIPT_CLOSE_TAG_PATTERN
Deprecated.Closing tag for a script element. -
STYLE_CLOSE_TAG_PATTERN
Deprecated.Closing tag for a style element. -
NONSPACE_WHITESPACE
protected static final char[] NONSPACE_WHITESPACEDeprecated.An array containing the non-space whitespace. -
SPACE
protected static final char[] SPACEDeprecated.An array, parallel toNONSPACE_WHITESPACE
, containing spaces. -
CLOSED_COMMENT
Deprecated.Closed comment. It should be "-->", but mistakes are common. -
CLOSED_PERCENT
Deprecated.Closed ASP or similar tag. -
CLOSED_PIC
Deprecated.Closed processing instruction. -
CLOSED_SECTION
Deprecated.Closed section (conditional, etc.). -
CLOSED_CDATA
Deprecated.Closed section (conditional, CDATA, etc.). -
factory
Deprecated.The parsing factory used by this parser. -
callback
Deprecated.The callback of this parser. -
attrMap
Deprecated.A map from attributes to attribute values. -
parseText
protected boolean parseTextDeprecated.Whether we should invoke the text handler. -
parseCDATA
protected boolean parseCDATADeprecated.Whether we should invoke the CDATA section handler. -
parseTags
protected boolean parseTagsDeprecated.Whether we should parse tags. -
parseAttributes
protected boolean parseAttributesDeprecated.Whether we should parse attributes. -
parsedAttrs
Deprecated.The subset of attributes whose values will be actually parsed (if, of course,parseAttributes
is true). -
parsedAttributes
Deprecated.An externally visible, immutable subset of attributes whose values will be actually parsed. -
lastEntity
protected char lastEntityDeprecated.The character represented by the last scanned entity.
-
-
Constructor Details
-
BulletParser
Deprecated.Creates a new bullet parser. -
BulletParser
public BulletParser()Deprecated.Creates a new bullet parser using the default factoryHTMLFactory.INSTANCE
.
-
-
Method Details
-
parseText
public boolean parseText()Deprecated.Returns whether this parser will invoke the text handler.- Returns:
- whether this parser will invoke the text handler.
- See Also:
-
parseText
Deprecated.Sets the text handler flag.- Parameters:
parseText
- the new value.- Returns:
- this parser.
-
parseCDATA
public boolean parseCDATA()Deprecated.Returns whether this parser will invoke the CDATA-section handler.- Returns:
- whether this parser will invoke the CDATA-section handler.
- See Also:
-
parseCDATA
Deprecated.Sets the CDATA-section handler flag.- Parameters:
parseCDATA
- the new value.- Returns:
- this parser.
-
parseTags
public boolean parseTags()Deprecated.Returns whether this parser will parse tags and invoke element handlers.- Returns:
- whether this parser will parse tags and invoke element handlers.
- See Also:
-
parseTags
Deprecated.Sets whether this parser will parse tags and invoke element handlers.- Parameters:
parseTags
- the new value.- Returns:
- this parser.
-
parseAttributes
public boolean parseAttributes()Deprecated.Returns whether this parser will parse attributes.- Returns:
- whether this parser will parse attributes.
- See Also:
-
parseAttributes
Deprecated.Sets the attribute parsing flag.- Parameters:
parseAttributes
- the new value for the flag.- Returns:
- this parser.
-
parseAttribute
Deprecated.Adds the given attribute to the set of attributes to be parsed.- Parameters:
attribute
- an attribute that should be parsed.- Returns:
- this parser.
- Throws:
IllegalStateException
- ifparseAttributes(true)
has not been invoked on this parser.
-
setCallback
Deprecated.Sets the callback for this parser, resetting at the same time all parsing flags.- Parameters:
callback
- the new callback.- Returns:
- this parser.
-
entity2Char
Deprecated.Returns the character corresponding to a given entity name.- Parameters:
name
- the name of an entity.- Returns:
- the character corresponding to the entity, or an ASCII NUL if no entity with that name was found.
-
scanEntity
Deprecated.Searches for the end of an entity.This method will search for the end of an entity starting at the given offset (the offset must correspond to the ampersand).
Real-world HTML pages often contain hundreds of misplaced ampersands, due to the unfortunate idea of using the ampersand as query separator (please use the comma in new code!). All such ampersand should be specified as
&
. If named entities are delimited using a transition from alphabetical to non-alphabetical characters, we can easily get false positives. If the parameterloose
is false, named entities can be delimited only by whitespace or by a comma.- Parameters:
a
- a character array containing the entity.offset
- the offset at which the entity starts (the offset must point at the ampersand).length
- an upper bound to the maximum returned position.loose
- if true, named entities can be terminated by any non-alphabetical character (instead of whitespace or comma).entity
- a support mutable string used to queryParsingFactory.getEntity(MutableString)
.- Returns:
- the position of the last character of the entity, or -1 if no entity was found.
-
replaceEntities
Deprecated.Replaces entities with the corresponding characters.This method will modify the mutable string
s
so that all legal occurrences of entities are replaced by the corresponding character.- Parameters:
s
- a mutable string whose entities will be replaced by the corresponding characters.entity
- a support mutable string used byscanEntity(char[], int, int, boolean, MutableString)
.loose
- a parameter that will be passed toscanEntity(char[], int, int, boolean, MutableString)
.
-
handleMarkup
protected int handleMarkup(char[] text, int pos, int end) Deprecated.Handles markup.- Parameters:
text
- the text.pos
- the first character in the markup after<!
.end
- the end oftext
.- Returns:
- the position of the first character after the markup.
-
handleProcessingInstruction
protected int handleProcessingInstruction(char[] text, int pos, int end) Deprecated.Handles processing instruction, ASP tags etc.- Parameters:
text
- the text.pos
- the first character in the markup after<%
.end
- the end oftext
.- Returns:
- the position of the first character after the processing instruction.
-
parse
public void parse(char[] text) Deprecated.Analyze the text document to extract information.- Parameters:
text
- achar
array of text to be parsed.
-
parse
public void parse(char[] text, int offset, int length) Deprecated.Analyze the text document to extract information.- Parameters:
text
- achar
array of text to be parsed.offset
- the offset in the array from which the parsing will begin.length
- the number of characters to be parsed.
-