Copyright © 2009. All rights reserved.
Copyright © 2009. All rights reserved.
Frequently Asked Question
What made you decide to write pypes?
My day job involves architecting and engineering enterprise search platforms for some of the most prominent organizations on the globe. A large part of that work involves moving, scrubbing, and preparing data to be indexed and made searchable. Pypes is a bi-product of my experiences. I wanted/needed an application that was completely agnostic to whatever backend search platform the customer decided to go with as well as something that provided me with truly reusable components. At the same time, I needed something that scaled well as it’s not uncommon to process millions of documents at a time.
What is Flow-Based Programming?
Flow-Based Programming (FBP) is a programming paradigm that defines applications as networks of "black box" processes, which exchange data across predefined one-way connections. These black box processes can be reconnected endlessly to form different applications without having to be changed internally. It is thus naturally component-oriented. See J. Paul Morrison’s book for a detailed explanation of FBP.
What’s the difference between pypes and Visual Design Studio?
Pypes is programming framework that enables applications developers to use flow based programming techniques in the applications they design. Visual Design Studio is an application built using the pypes framework. It allows non-developers (as well as developers) to visually design flow based applications using a friendly and familiar techniques like drag and drop. If you’re a developer, you can also write your own custom components and easily make them available within Visual Design Studio. If you’re not a developer, you can download additional components written by third party folks and add them as well.
Do I need to install both pypes and Visual Design Studio?
No. Since Visual Design Studio was built using pypes, the installation process will install pypes for you. If you are an application developer looking to leverage pypes to build your own application then you just need to download pypes (though Visual Design Studio might serve as a good reference).
Why did you choose to use Stackless Python?
I chose Python because it was the best language for the job. Too often programmers gravitate to a specific language and then use it for everything. I believe in using the right language for the task at hand. Python is well suited for text processing which is important when designing a data flow system (in comparison to a control-flow). The idea behind FBP is a data centric model from which we infer the need for substantial text processing functionality. Python allows developers to rapidly code and deploy new components without the need for verbose XML configurations and build semantics.
Another strong argument for Python is it’s coroutine implementation (i.e., generators). Although the new generator syntax allows values to be passed in, I felt I needed something more. In my design, I wanted a push model and Python generators really enforce more of a pull style (i.e., data isn’t pushed into the system but rather pulled or triggered by the last component requesting data). Stackless Python was a natural solution that fully implements coroutines. It’s a wonderful enhancement to the Python interpreter and much like FBP, it’s a old style programming paradigm that is making a comeback.
The first language I looked at was Erlang but it seemed to lack the more mature text processing libraries offered by Python. I also had to consider the community of developers and I’d speculate that more folks are familiar with Python than Erlang.
What the heck are these coroutines you keep mentioning?
Coroutines are a generalization of the more common subroutine. A subroutine (method, function, etc.) has one entry point and one exit point. The routine is called, it executes some logic, and it returns back to the portion of code that originally called it. Coroutines, on the other hand, have multiple entry and exit points. How is this useful? It allows a routine to pause execution at any point and later resume exactly where it left off. Don’t threads do the same thing? Yes a thread can also exhibit similar behavior but there are some fundamental differences. For starters, threads require more system resources to function so an application can only create a limited amount of threads before exhausting it’s resources. Switching between threads is also more time consuming than switching between coroutines. In a flow based application we expect to switch between various components quite often (billions of times in some cases). Lastly, threads can operate in parallel which creates a situation for deadlock, race condition, and other problem generally associated with traditional thread models.
Do coroutines operate in parallel?
Generally speaking, no. Coroutines use cooperative rather than preemptive multitasking. This means that the executing code will relinquish use of the CPU to allow another routine to run. It is possible to preempt a coroutine but that control is handled within the application (as opposed to inside the OS) and we’re still guaranteed that no two coroutines are running simultaneously (whereas at the OS level our application has no real control).
Does pypes offer any kind of parallelism?
Yes. Pypes does leverage parallelism on systems that are capable. What do you mean by capable? Well, contrary to popular belief, a single CPU (or core) system is not able to achieve any sort of parallelism. It may seem like many processes are running concurrently but I assure you that the CPU is only executing one instruction at a time. If a system has multiple cores and/or CPUs then pypes will execute code in parallel on each one (considering the user has configured pypes to do so). When people think about processes executing in parallel they tend to think in terms of latency (how quickly a task is completed). Pypes was written so that throughput is maximized. Contrast to latency, throughput is the measure of how much work a system can complete over a given window of time. Of course this a classic argument in computer science (latency vs. throughput) with the most notable being RISC vs. CISC architectural designs. In terms of hardware, this gap has bridged over the past decade and the argument is becoming moot. In the case of pypes, improved throughput is simpler and less risky to achieve than improved latency. Less risky means less bugs which means better user experience and that should a high priority in any design goal.
Python, isn’t that a scripting language? How well can it perform?
I’m not crazy about the phrase “scripting language” because it’s vague in today’s context. A more appropriate definition would be “interpreted language”. With that said, Java is also interpreted. Python is admittedly slower than Java in most cases but it comes at a cost. Java is much more verbose than Python and it also uses a static typing system. Python is much more dynamic (important in achieving modularity) and offers richer meta programming techniques.
Still it’s hard to measure the speed of one language over another and what’s generally done is that the same piece of logic is run across various languages where it can be timed. One of the ingenious characteristics of Python is that you can re-factor computationally intensive portions of code using C/C++. Many of the core libraries shipped with Python are actually implemented in C making them faster than their Java counterparts. Don’t believe me? Test out the XML adapter/mapper shipped with Visual Design Studio. It’s about twice as fast as libxml which a fast XML library written in C.
Bottom line is this; in my experience most data processing tasks are typically an aggregate of many small and simple tasks. For more complex processing needs (or even interfacing with legacy C libs) you can always drop into C/C++ and achieve great performance where necessary. This, however, will be the more rare case and using pure Python allows for rapid development and more maintainable code which should be a priority of any design.