Copyright © 2009. All rights reserved.

Components

  1. The real power behind pypes and flow-based programming in general, is the modular architecture it provides. This modularity allows application logic to be broken down into smaller more manageable components that act as building blocks. By themselves components don’t provide much functionality but when combined, they can create complex data flow networks.


  1. Below is a list of components provided with pypes and Visual Design Studio. If your even remotely familiar with Python then creating custom components is extremely easy as well.


  1. Pypes

  2. The core pypes framework ships with a few components mainly for the sake of testing and examples.


  1. Null - works like Unix /dev/null

  2. TextFileInputReader - Reads file input in a line oriented manner

  3. StringInputReader - Reads strings input (used for passing in string data)

  4. ConsoleOutputWriter - Writes output to stdout

  5. Grep - Like Unix grep command

  6. Sort - Like Unix sort command

  7. BinarySplit - Splits the stream sending a copy out each output port

  8. Uniq - Like Unix uniq command

  9. Cut - Like Unix cut command


  1. Visual Design Studio

  2. The pypes Visual Design Studio offers a wider and more versatile set of components. VDS also provides a plugin architecture that allows new components to easily be added to the system. Components are organized into several different categories.


  3. Adapters

  4. Adapters are special components that adapt the incoming data into a model that the pypes framework can understand. They are used to convert incoming streams of data into “packets” of information that all other components can then operate on in a uniform manner.


  5. CsvReader

  6. Description: Converts CSV (comma-separated values) files into a stream of packets.

  7. NOTE: Assumes the dialect is Excel (see http://docs.python.org/library/csv.html#dialects-and-formatting-parameters for details on dialects). There are plans to provide additional dialects with separators other than commas, etc.

  8. Parameters: None


  1. HTML

  2. Description: Converts HTML files into packets. Uses BeautifulSoup and is pretty simple at the moment. There are plans to do some extensive work here on providing a rule based language that can control the parser at runtime (sort of like Kapow). This would allow the user to specify which elements of the incoming HTML to pull out based on matching CSS classes, element names, or regular expressions.

  3. Parameters: None


  1. PDF

  2. Description: Converts PDF files into packets. This is a pure python implementation. If your working high volumes or very large PDF documents then you should consider using a C/C++ wrapped implementation in order to get better performance.

  3. Parameters: None


  4. Word2007

  5. Description: Converts Microsoft Word 2007 documents into packets. We only support 2007 (or 2008 on Mac) because these documents use the new open standard. There are converters for previous versions of Word online but we don’t package any because they are written in C/C++ or Java. You should be able to find one that fits your needs and wrap it in a component.

  6. Parameters: None


  7. XML

  8. Description: Converts XML content into packet streams. A mapping file is used to instruct the XML parser on which elements/attributes to extract and where to place them in the packet. This implementation uses cElementTree and is extremely fast as far as XML parsers goes (about 2x as fast as libxml). Batch submissions supported (need to provide an example/tutorial of this).

  9. Parameters:

  10. MapFile: The mapfile containing the mappings to be used (see sample-xmlmap.xml under etc/ for an example)



  11. Transformers

  12. Transformers typically mutate packet data or packet fields. For instance, the CaseNormalizer component would set a standard case for a defined packet field. Likewise, the CopyField component would create a copy of the specified packet field.


  13. CaseNormalizer

  14. Description: Converts the data in a single packet field to one standard case; lower, upper, or title.

  15. Parameters:

  16. Field: The packet field to operate on

  17. Operation: The type of normalization logic to use; choices are Lowercase, Uppercase, or TitleCase


  18. CopyField

  19. Description: Copies the data from one packet field to another

  20. Parameters:

  21. SrcField: The packet field to be copied

  22. DstField: The packet field acting as the destination (if the destination field contains data it is overwritten. If it does not then the field is created).


  1. DateFormatter

  2. Description: Formats a packet field containing a date string into the specified date format string. Useful for normalizing data strings.

  3. Parameters:

  4. InFormat: The current format of the string (see http://docs.python.org/library/datetime.html#strftime-behavior for details)

  5. Field: The field containing the data to be formatted

  6. OutFormat: The desired output format to be used


  1. DeleteField

  2. Description: Deletes a specified field from the packet

  3. Parameters:

  4. Field: The packet field to be deleted


  1. GeoCode

  2. Description: Provides the latitude & longitude for a given address or postal code contained in a packet field.

  3. Parameters:

  4. AddressField: The packet field containing the address to be used for the geocoding process.


  1. RenameField

  2. Description: Renames a packet field

  3. Parameters:

  4. NewName: The new name of the packet field

  5. OldName: The packet field to be renamed


  1. SetField

  2. Description: Sets a static value to a specified packet field

  3. Parameters:

  4. Field: The packet field where the static data should be placed

  5. Value: The static value you wish to place into the field


  1. XMLMapper

  2. Description: Maps XML content into packet fields using a subset of XPATH operations. A mapping file is used to instruct the XML parser on which elements/attributes to extract and where to place them in the packet. This implementation uses cElementTree and is extremely fast as far as XML parsers goes (about 2x as fast as libxml).

  3. Parameters:

  4. InputField: The packet field containing the XML payload/content

  5. MapFile: The map file to use for all mappings (see sample-xmlmap.xml under etc/ for an example)


  6. Filters

  7. Filter components perform filtering operations on a stream of packets. For instance, the Drop component will exclude (destroy) packets matching specified characteristics.


  8. Drop

  9. Description: Destroys a packet based on matching field criteria (if field contains a value of ‘X’ then drop)

  10. Parameters:

  11. Field: The packet field to check

  12. DropValue: The field value to be matched (if the field matches this value the packet is destroyed)


  13. Operators

  14. Operator components perform operations on a packet stream such as splitting or merging it.


  15. Merge

  16. Description: Merges the two incoming packets into one single packet (useful for combing packets).

  17. NOTE: this DOES NOT merge streams, it merges actual packets in a stream. I’m working on a Collate component that will merge streams.

  18. Parameters: None


  1. Split

  2. Description: Splits the current packet stream by cloning the packets and sending on both outputs

  3. Parameters: None


  4. Extractors

  5. Extractors perform extraction logic on packet data. For instance the EmailExtractor would identify email addresses contained in a specified packet field (freetext) and  copy them into a new field. This is useful in search indexes where these values can then used for faceted browsing.


  6. AddressExtractor

  7. Description: Uses a regular expression to identify street addresses

  8. Parameters:

  9. SrcField: The source packet field to perform the extraction on

  10. DstField: The destination packet field where the extracted data will be stored


  1. EmailExtractor

  2. Description: Uses a regular expression to identify email addresses

  3. Parameters:

  4. SrcField: The source packet field to perform the extraction on

  5. DstField: The destination packet field where the extracted data will be stored


  6. Publishers

  7. Publishers are network end points. They typically provide a mechanism whereby the packet data is converted back into a format that consuming applications expect. For instance, the Solr publisher converts packets into a representation that can then be used to index the data. Likewise, the CsvWriter would produce a comma-separated file containing the data.


  8. Note: A packet does not necessarily represent a single producible unit. The CsvReader would convert each row of the file into a packet and the CsvWriter would reconstruct a final output document that reassembles all of these packets. The same can be said of the Solr publisher which would assemble the packets into a batch submission.


  9. CsvWriter

  10. Description: Produces a CSV file based on a packet stream

  11. Parameters:

  12. Fields: The packet fields to be used as column values (a value of _originals_ can be used to map all original fields assuming that the incoming data was produced by the CsvReader -- because then the Writer knows the column names used)

  13. File: The name of the output file (should be a full path otherwise the file will have a location relative the directory where pypesvds was installed)


  1. Debug

  2. Description: Used to dump packets to the console (must start pypesvds in non-daemon mode). Packet info is formated and readable.

  3. TODO: This should really dump data the log file.

  4. Parameters: None


  1. FastXML

  2. Description: Publishes packet streams to FAST (ESP) XML so that the data can be indexed for search

  3. Parameters:

  4. Encoding: The encoding to use for the output

  5. OutputDirectory: The directory where the output files should be stored.


  1. Null

  2. Description: Sends packet stream to /dev/null essentially destroying it. Useful for testing a system without actually committing or publishing the data to the actual end-point. Debug can do this as well but can be very verbose. The Null component will simply log a message stating that it received data.

  3. Parameters: None


  1. Solr

  2. Description: Publishes packet streams to SOLR for indexing.

  3. TODO: Need to add support for SOLR multi-core systems (see http://wiki.apache.org/solr/CoreAdmin)

  4. Parameters:

  5. Host: The host of the SOLR indexing master

  6. Port: The port of the SOLR indexing master