Pypes - Components

Components

The real power behind pypes and flow-based programming in general, is the modular architecture it provides. This modularity allows application logic to be broken down into smaller more manageable components that act as building blocks. By themselves components don’t provide much functionality but when combined, they can create complex data flow networks.

Below is a list of components provided with pypes and Visual Design Studio. If your even remotely familiar with Python then creating custom components is extremely easy as well.

Pypes
The core pypes framework ships with a few components mainly for the sake of testing and examples.

•Null - works like Unix /dev/null
•TextFileInputReader - Reads file input in a line oriented manner
•StringInputReader - Reads strings input (used for passing in string data)
•ConsoleOutputWriter - Writes output to stdout
•Grep - Like Unix grep command
•Sort - Like Unix sort command
•BinarySplit - Splits the stream sending a copy out each output port
•Uniq - Like Unix uniq command
•Cut - Like Unix cut command

Visual Design Studio
The pypes Visual Design Studio offers a wider and more versatile set of components. VDS also provides a plugin architecture that allows new components to easily be added to the system. Components are organized into several different categories.
Adapters
Adapters are special components that adapt the incoming data into a model that the pypes framework can understand. They are used to convert incoming streams of data into “packets” of information that all other components can then operate on in a uniform manner.
CsvReader
Description: Converts CSV (comma-separated values) files into a stream of packets.
NOTE: Assumes the dialect is Excel (see http://docs.python.org/library/csv.html#dialects-and-formatting-parameters for details on dialects). There are plans to provide additional dialects with separators other than commas, etc.
Parameters: None

HTML
Description: Converts HTML files into packets. Uses BeautifulSoup and is pretty simple at the moment. There are plans to do some extensive work here on providing a rule based language that can control the parser at runtime (sort of like Kapow). This would allow the user to specify which elements of the incoming HTML to pull out based on matching CSS classes, element names, or regular expressions.
Parameters: None

PDF
Description: Converts PDF files into packets. This is a pure python implementation. If your working high volumes or very large PDF documents then you should consider using a C/C++ wrapped implementation in order to get better performance.
Parameters: None
Word2007
Description: Converts Microsoft Word 2007 documents into packets. We only support 2007 (or 2008 on Mac) because these documents use the new open standard. There are converters for previous versions of Word online but we don’t package any because they are written in C/C++ or Java. You should be able to find one that fits your needs and wrap it in a component.
Parameters: None
XML
Description: Converts XML content into packet streams. A mapping file is used to instruct the XML parser on which elements/attributes to extract and where to place them in the packet. This implementation uses cElementTree and is extremely fast as far as XML parsers goes (about 2x as fast as libxml). Batch submissions supported (need to provide an example/tutorial of this).
Parameters:
•MapFile: The mapfile containing the mappings to be used (see sample-xmlmap.xml under etc/ for an example)
Transformers
Transformers typically mutate packet data or packet fields. For instance, the CaseNormalizer component would set a standard case for a defined packet field. Likewise, the CopyField component would create a copy of the specified packet field.
CaseNormalizer
Description: Converts the data in a single packet field to one standard case; lower, upper, or title.
Parameters:
•Field: The packet field to operate on
•Operation: The type of normalization logic to use; choices are Lowercase, Uppercase, or TitleCase
CopyField
Description: Copies the data from one packet field to another
Parameters:
•SrcField: The packet field to be copied
•DstField: The packet field acting as the destination (if the destination field contains data it is overwritten. If it does not then the field is created).

DateFormatter
Description: Formats a packet field containing a date string into the specified date format string. Useful for normalizing data strings.
Parameters:
•InFormat: The current format of the string (see http://docs.python.org/library/datetime.html#strftime-behavior for details)
•Field: The field containing the data to be formatted
•OutFormat: The desired output format to be used

DeleteField
Description: Deletes a specified field from the packet
Parameters:
•Field: The packet field to be deleted

GeoCode
Description: Provides the latitude & longitude for a given address or postal code contained in a packet field.
Parameters:
•AddressField: The packet field containing the address to be used for the geocoding process.

RenameField
Description: Renames a packet field
Parameters:
•NewName: The new name of the packet field
•OldName: The packet field to be renamed

SetField
Description: Sets a static value to a specified packet field
Parameters:
•Field: The packet field where the static data should be placed
•Value: The static value you wish to place into the field

XMLMapper
Description: Maps XML content into packet fields using a subset of XPATH operations. A mapping file is used to instruct the XML parser on which elements/attributes to extract and where to place them in the packet. This implementation uses cElementTree and is extremely fast as far as XML parsers goes (about 2x as fast as libxml).
Parameters:
•InputField: The packet field containing the XML payload/content
•MapFile: The map file to use for all mappings (see sample-xmlmap.xml under etc/ for an example)
Filters
Filter components perform filtering operations on a stream of packets. For instance, the Drop component will exclude (destroy) packets matching specified characteristics.
Drop
Description: Destroys a packet based on matching field criteria (if field contains a value of ‘X’ then drop)
Parameters:
•Field: The packet field to check
•DropValue: The field value to be matched (if the field matches this value the packet is destroyed)
Operators
Operator components perform operations on a packet stream such as splitting or merging it.
Merge
Description: Merges the two incoming packets into one single packet (useful for combing packets).
NOTE: this DOES NOT merge streams, it merges actual packets in a stream. I’m working on a Collate component that will merge streams.
Parameters: None

Split
Description: Splits the current packet stream by cloning the packets and sending on both outputs
Parameters: None
Extractors
Extractors perform extraction logic on packet data. For instance the EmailExtractor would identify email addresses contained in a specified packet field (freetext) and copy them into a new field. This is useful in search indexes where these values can then used for faceted browsing.
AddressExtractor
Description: Uses a regular expression to identify street addresses
Parameters:
•SrcField: The source packet field to perform the extraction on
•DstField: The destination packet field where the extracted data will be stored

EmailExtractor
Description: Uses a regular expression to identify email addresses
Parameters:
•SrcField: The source packet field to perform the extraction on
•DstField: The destination packet field where the extracted data will be stored
Publishers
Publishers are network end points. They typically provide a mechanism whereby the packet data is converted back into a format that consuming applications expect. For instance, the Solr publisher converts packets into a representation that can then be used to index the data. Likewise, the CsvWriter would produce a comma-separated file containing the data.
Note: A packet does not necessarily represent a single producible unit. The CsvReader would convert each row of the file into a packet and the CsvWriter would reconstruct a final output document that reassembles all of these packets. The same can be said of the Solr publisher which would assemble the packets into a batch submission.
CsvWriter
Description: Produces a CSV file based on a packet stream
Parameters:
•Fields: The packet fields to be used as column values (a value of _originals_ can be used to map all original fields assuming that the incoming data was produced by the CsvReader -- because then the Writer knows the column names used)
•File: The name of the output file (should be a full path otherwise the file will have a location relative the directory where pypesvds was installed)

Debug
Description: Used to dump packets to the console (must start pypesvds in non-daemon mode). Packet info is formated and readable.
TODO: This should really dump data the log file.
Parameters: None

FastXML
Description: Publishes packet streams to FAST (ESP) XML so that the data can be indexed for search
Parameters:
•Encoding: The encoding to use for the output
•OutputDirectory: The directory where the output files should be stored.

Null
Description: Sends packet stream to /dev/null essentially destroying it. Useful for testing a system without actually committing or publishing the data to the actual end-point. Debug can do this as well but can be very verbose. The Null component will simply log a message stating that it received data.
Parameters: None

Solr
Description: Publishes packet streams to SOLR for indexing.
TODO: Need to add support for SOLR multi-core systems (see http://wiki.apache.org/solr/CoreAdmin)
Parameters:
•Host: The host of the SOLR indexing master
•Port: The port of the SOLR indexing master