shteynbuk

Python

From Java to Python

Several years ago, while I was primarily focused on Java development, our team needed a tool for data manipulation and analysis. After evaluating our options, the Python library Pandas emerged as the clear winner. Pandas is practically a world of its own — learning it arguably requires more effort than learning Python itself. I ended up using Pandas extensively and even contributed to the project Pandas (v0.21.0) and pandas-datareader (v0.5.0). You can find more details about these contributions here

Before this, I had used Python occasionally, but my knowledge of the language was fairly superficial. That didn’t stop me from contributing to a Python-based open source project many years ago. My contribution focused on interoperability with a C++ library, so it was arguably more C++ than Python.

Over time, my exposure to Python deepened. I’ve worked with various Python modules, packages, and tools — including multithreading and concurrency. My first Python multi-threaded program run slower than the single-threaded one; it was before Python 3.13. Coming from Java, this was unexpected.

It turned out the Python interpreter uses a Global Interpreter Lock (GIL), and the reason for this is that the Python interpreter is not thread safe(in CPython pre 3.13 versions). You need an I/O bound program to see the benefits of multithreading in Python.

In our case, the solution was to use multiprocessing, which scales well for distributed processing. Another solution would be to do computationally intensive tasks in C++/C and use Cython.

As some of our tasks turned out to be I/O bound we added the task execution section to the system configuration. The user can configure the task execution engine, like the number of processes or/and threads, and to run experiments to find out the optimal number of processes/thread. Initially, we built a custom solution using Python’s standard library, and later added a Redis and RQ. We’re also evaluating distributed frameworks like Spark and Dask.

We also use Python as a portable scripting language, with scripts running on both Linux and Windows.

Transitioning from Python 2 to Python 3

The migration from Python 2 to Python 3 was a rather big project. Our first step was to ensure the codebase was compatible with both versions, eliminating the need to maintain two separate codebases. This process also gave us the opportunity to review, clean, and refactor the code-especially sections that had been written when we were new to Python.

The transition was made smoother because we had been preparing for it, and all new code was written with Python 3 compatibility in mind whenever possible.

More detailed description of our transition experience deserves its own entry. Here I just want to mention one side effect of the transition: code readability. After large-scale code conversion, especially when using automated tools, the resulting code is often less readable. Code readability is, after all, one of the main reasons Python has become so popular.

A significant part of the conversion was an intermediate step, making the code compatible with both Python 2 and Python 3. This step introduced some extra code that is no longer necessary. We are very happy with the results and plan to clean up the remaining transitional code when time permits.

What's Next

In future posts, I’ll describe examples of different Python features with examples from my projects. A significant portion of my Python development involves data processing, web development, and scripting. Some of my data processing projects might require a separate entry, like, extracting financial data from EDGAR regulatory filings (XBRL format, unstructured text)

First Text Analysis Python Project was my first unstructured text analysis project.

Some samples of Python code are on GitHub Gist

Hedge Fund Investment Projection Program.

This program calculates projections for when hedge fund investors will receive their investments over time, with most calculations performed using Pandas.

The program reads data from an Excel file containing at least two worksheets: Liquidity Terms and Tranche Investments.

The program includes three scripts that generate reports and visualizations based on the data.

The emphasis is on the most common hedge fund withdrawal restrictions.

A more detailed description is in the HedgeFundsRedemption.md file.

NYSE trading days for backtesting

This is a fork of jckantor's Python dateutil rule sets for NYSE trading days and holiday observances. The original rules are valid from the present onward. However, for backtesting or pattern recognition, there is often a need to access NYSE trading days from the past several years. The rules have been modified to provide NYSE trading days and holiday observances starting from 1986.

This website was created using Pelican, a static site generator written in Python.

Below are some of the Python libraries and tools that I have worked with:

Pandas, NumPy, Beautiful Soup, Requests, Selenium, Cython, SciPy, NLTK, Matplotlib, pytest, unittest, Django, Flask, Redis, RQ, Pelican, reStructuredText, Sphinx.

View comments.

more ...

Web Development

In the early days of web development, web pages were generated on the server, and the browser’s role was to display them. Today, that responsibility is shared between the server and the client (browser), with ongoing discussions about which tasks belong where. A key aspect of this division involves how the client and server communicate. My experience has primarily been on the server side, but I’ve also done a fair amount of client-side development.

Additionally, I have worked extensively with web scraping. While it isn’t web development in the traditional sense, it involves analyzing websites and extracting data from them — a process that overlaps with web development skills. As websites become more complex and dynamic, scraping them becomes increasingly challenging.

Below are some of the libraries and tools that I have worked with:

Django, Flask, Redis, RQ, Celery, RabbitMQ, Docker, Heroku, Netlify, Pelican, HTML, CSS, JavaScript, jQuery, AJAX, Bootstrap, reStructuredText, REST, Beautiful Soup, Jsoup, Selenium, XML, JSON, YAML. JBoss, Tomcat, Glassfish.

View comments.

more ...

C++

Below is a brief overview of several C++ projects I was responsible for. Information about these projects is already publicly available.

Client Infrastructure Library Wrapper

The underlying infrastructure was a complex company-wide system that integrated three key components: a persistence layer, a shared cache, and a messaging component. The purpose of the wrapper was to hide this complexity from end users, simplifying infrastructure usage and enabling easier integration with third-party libraries and legacy systems.

For more information see Presentations .

Database Design and DB Library Wrapper

All database-related activities were my responsibility, covering both data management and code. We have used the c-tree Plus library as an embedded db engine in the application process address space.

To simplify and standardize database access, I developed an application-specific Object-to-Datastore Mapping library-essentially a wrapper around c-tree. This wrapper was used by the entire development team for all database calls.

Later, I added SQL support to the library, enabling integration with Sybase as an SQL database. The transition from c-tree to Sybase was smooth, and developers had the flexibility to choose between an embedded database and a Sybase server, depending on the needs of each project.

Beyond these projects, I participated in various C++ discussion groups, provided C++ consultations, and taught C++ classes. I also moderated the NYC Design Patterns Study Group and posted summaries of our discussions; most of the code samples were in C++.

After reading an early edition of Scott Meyers’ Effective C++, I reached out to the author with my comments. This led to an extensive email correspondence, which continued after the publication of his next book, More Effective C++. Scott Meyers cited me in subsequent editions of both Effective C++ and More Effective C++.

Below are some of the C++ libraries and tools I have worked with. For memory management, I often relied on smart pointers.

STL, Rogue Wave libraries, Orbix, CORBA, TIBCO, c-tree Plus

View comments.

more ...

Presentations

These presentations describe a wrapper around a client C++ infrastructure library.

The underlying infrastructure was a complex company-wide system that integrated three key components: a persistence layer, a shared cache, and a messaging component. The purpose of the wrapper was to hide this complexity from end users, simplifying infrastructure usage and enabling easier integration with third-party libraries and legacy systems.

The design was influenced by STL adapter classes, IBM's collection reference classes, and the Orbix TIE approach. This approach was later recognized in the External Polymorphism Pattern paper as an independent discovery of a variant of that pattern.

These two presentations describe the same project from different angles. The UP'97 (international conference on Using Patterns) presentation is more detail-oriented and includes extensive code samples, while the COOTS'97 (Conference on Object-Oriented Technologies) presentation focuses more on design aspects.

The final sentence of the COOTS'97 presentation sparked a heated debate involving participants such as Bjarne Stroustrup, Scott Meyers, Doug Lea, and Andrew Koenig.

COOTS'97 - HTML PDF

UP'97 - HTML PDF

View comments.

more ...

Development Environment

Currently, most of my development is done on Linux or/and Windows.

Version control is Git. In the past, I have used Mercurial, Subversion, and CVS.

These days, most of my scripting is in Python and sometimes in Ant. The main reason for this is that Python and Ant scripts can be executed on Linux and Windows. I have also used Perl and Groovy. I use the system shell when I have to; portability has its limitations, and sometimes, you just need to use the system shell. On several projects, I was responsible for setting up and managing the development environment.

Most of the time I use PyCharm, sometimes Eclipse, PyDev, Visual Studio Code, IntelliJ IDEA, or any of the available text editors, depending on the environment. |

View comments.

more ...

Open Source

GitHub

First Text Analysis Python Project was my first unstructured text analysis project.

GitHub Gist

My open-source contributions were in several programming languages. The tools and languages used for each contribution are listed at the end of their descriptions.

Pandas and pandas-datareader

While working on a stock screening application using Pandas, I encountered issues that only appeared with specific data values. Initially, I implemented a workaround in my code that functioned correctly but was cumbersome and difficult to maintain. However, this did not address existing code that might behave inconsistently depending on the data. To fix the issue, I contributed changes to Pandas itself as well as to pandas-datareader, which is now maintained as a standalone project.

My contributions included:

Improvements to missing value handling, unit tests, and documentation.
A signature-preserving decorator for compatibility with Python 2.
API-level changes, including updates to documentation and “What’s New” notes.
A temporary workaround for a known issue, along with active participation in related discussions.

Python, Cython, pytest, Sphinx, reStructuredText.

Pandas commits

https://github.com/pandas-dev/pandas/commits?author=olegshteynbuk

pandas-datareader commits

https://github.com/pydata/pandas-datareader/commits?author=OlegShteynbuk

    Pandas pull requests
https://github.com/pandas-dev/pandas/pull/16589
https://github.com/pandas-dev/pandas/pull/16534

    Pandas issues
https://github.com/pandas-dev/pandas/issues/16606
https://github.com/pandas-dev/pandas/issues/16471

    pandas-datareader pull requests:
https://github.com/pydata/pandas-datareader/pull/357
https://github.com/pydata/pandas-datareader/pull/364 (merged from a command line and GitHub GUI  incorrectly displays it as not merged but correctly shows it as merged in the source code
https://github.com/pydata/pandas-datareader/commit/6cce5f18d52be802c7245c8a28d534236a9e2b24 )

    pandas-datareader issues
https://github.com/pydata/pandas-datareader/issues/363
https://github.com/pydata/pandas-datareader/issues/342

              Arelle
    Arelle pull requests
https://github.com/Arelle/Arelle/pull/98

              geWorkbench
At the time I was at The Center for Computational Biology and Bioinformatics(C2B2) at Columbia University.
Most of my work focused on geWorkbench, a Java-based open-source desktop application for integrated genomics.
While working on geWorkbench, I was involved in developing new features, design, enhancements, and bug fixes.
I also contributed to server-side work, primarily on a grid-based application called CaArray, which is an open-source web and programmatically accessible array data management system. My responsibilities with CaArray included installation and administration.
Some tasks required advanced calculus and mathematical modeling.
Java, C++/C, Swing, JBoss, Tomcat, Ant.

GitHub Gist has Python and Java code.

Hedge Fund Investment Projection Program.

This program calculates projections for when hedge fund investors will receive their investments over time, with most calculations performed using Pandas.

The program reads data from an Excel file containing at least two worksheets: Liquidity Terms and Tranche Investments.

The program includes three scripts that generate reports and visualizations based on the data.

The emphasis is on the most common hedge fund withdrawal restrictions.

A more detailed description is in the HedgeFundsRedemption.md file.

NYSE trading days for backtesting

This is a fork of jckantor's Python dateutil rule sets for NYSE trading days and holiday observances. The original rules are valid from the present onward. However, for backtesting or pattern recognition, there is often a need to access NYSE trading days from the past several years. The rules have been modified to provide NYSE trading days and holiday observances starting from 1986.

Java program to perform several dataset operations

This Java program performs basic operations on datasets stored in CSV (comma-separated) files. It reads the dataset into a dataframe to perform various operations.

The program can be used as a library or directly from the command line. Users can define operations using a simple language when running from the command line.

The main purpose of this project is to illustrate that in Java, the absence of a comprehensive library like Pandas makes advanced data processing quite time-consuming. In many cases, you may find it more efficient to use Python and Pandas, even if it requires learning a new language.

That said, if you are a Java developer who doesn't know Python and only needs to perform relatively simple column-based dataset operations, this tool could be a practical option.

For more details, please refer to the project’s README file.

Other contributions

Leo outlining editor

Submitted a bug fix for integration with Aspell, a spell-checking library for C++. Several years ago, Leo switched from Aspell to PyEnchant.

Python, C++.

Submitted code patches to Apache Axis to improve AxisFault logging and provided a temporary workaround.

Java.

Filed bug reports against Swing, JBoss, and several others Java tools and libraries.

View comments.

more ...

DB - SQL, NoSQL

Based on project requirements, I have worked with SQL (relational) and NoSQL (non-relational) databases.

In Python, I frequently used Pandas to load data from databases into DataFrames for processing and export.

My experience in Java includes using APIs like JDBC and ORMs such as Hibernate for database access and management.

In C++, I have worked with the c-tree Plus database library, which supports custom index structures and can be configured to operate in either SQL or NoSQL mode.

Below is a description of how c-tree Plus was used in one of my projects.

All database-related activities were my responsibility, covering both data management and code. We have used the c-tree Plus library as an embedded db engine in the application process address space.

To simplify and standardize database access, I developed an application-specific Object-to-Datastore Mapping library-essentially a wrapper around c-tree. This wrapper was used by the entire development team for all database calls.

Later, I added SQL support to the library, enabling integration with Sybase as an SQL database. The transition from c-tree to Sybase was smooth, and developers had the flexibility to choose between an embedded database and a Sybase server, depending on the needs of each project.

SQL

Many of the projects I worked on required handling SQL at various levels of complexity.

In one project, I implemented a conversion from Oracle and PL/SQL to Python, Pandas, and MySQL. While Pandas was used, a significant portion of the work was implemented in SQL scripts.

In another project involving PostgreSQL, I wrote a database schema conversion tool and implemented a synchronization module between the application cache and the database, and wrote triggers and stored procedures in PL/pgSQL, C, and Python. It's a very nice feature of PostgreSQL to have different languages that can be used for stored procedures. In this project, replacing PL/pgSQL procedures with Python-based logic resulted in simpler code.

NoSQL

My experience with NoSQL databases includes object-oriented (OO) databases, graph databases, and in-memory key–value stores.

OO database was used for a Quartz based project.

In a project involving a graph database (Neo4j), I was assigned a ticket to resolve a severe performance issue. After some investigation, I found that depending on the data, one Neo4j query was taking nearly an hour to complete, while other queries executed in seconds. It appeared to be a Neo4j issue likely fixed in later releases, as we were using an outdated version. I found a workaround that reduced the query time from about an hour to several seconds. Neo4j’s comprehensive and well-written documentation proved to be very helpful.

Below are some of the databases I have worked with:

PostgreSQL, Oracle, MySQL, SQLite, Sandra, Neo4j, SYBASE, SQL Server, Redis

View comments.

more ...

Tea

Tea was always a big part of my life.

Like many in the Western world, I started with black tea - which is known as red tea in China . After trying nearly every type of tea, I settled on Pu-erh tea, which in China is often called black tea.

Pu-erh is earthy and warm, with an aroma and taste that can vary significantly depending on where it’s grown (even the slope of the mountain matters), how it’s processed, stored, aged, and which factory produced it - pretty much like wine in that sense.

Another great thing about pu-erh is that it can be aged, and in many cases, it gets better with time — like scotch. Isn't that great? Well, most of the time it is. Occasionally you get a miss — a tea that actually tasted better when it was young.

I drink other types of tea too, just not as often. If I ever run out of pu-erh (which is unlikely — I probably have enough to last me the rest of my life), I’d reach for oolong or green tea next, followed by black and white.

Most of my tea comes from China, and I usually brew it using a slightly modified version of Gongfu cha. I’ll be posting photos of my collection soon — I’m currently in the process of retasting my teas and updating my notes.

And in case you haven’t noticed: the bootstrap theme for this site is inspired by tea.

View comments.

more ...

Java

My Java experience is evenly split between Core Java and J2EE.

In Core Java, I have utilized various packages and tools, including collections, multithreading, and concurrency. A partial list of the Java packages and libraries I’ve used is included at the end of this entry.

Many Java applications begin as single-threaded, but as performance demands grow, they often evolve into multi-threaded applications. Java provides robust concurrency APIs.

In J2EE, my primary focus has been on server-side development, but I also have significant experience with client-side development. On the client side, I have developed both desktop-based and browser-based clients. For desktop applications, I have used Swing. My browser-based clients also included non-Java technologies such as JavaScript, HTML, and CSS. On the server side, I’ve worked with various application servers and libraries — a partial list can be found at the end of this entry.

I’ve also worked with XML processing, integrated various data feeds, created and consumed web services, as well as extracted data from websites (web scraping) and was responsible for the migration from Java 8 to Java 11.

While working with Java, I’ve been involved in both new application development and maintaining legacy systems. Probably a separate entry is needed for a more detailed description of the projects.

You can find a few Java code samples on my GitHub Gist

Java Program for Dataset Operations

This Java program performs basic operations on datasets stored in CSV (comma-separated) files. It reads the dataset into a dataframe to perform various operations.

The program can be used as a library or directly from the command line. Users can define operations using a simple language when running from the command line.

The main purpose of this project is to illustrate that in Java, the absence of a comprehensive library like Pandas makes advanced data processing quite time-consuming. In many cases, you may find it more efficient to use Python and Pandas, even if it requires learning a new language.

That said, if you are a Java developer who doesn't know Python and only needs to perform relatively simple column-based dataset operations, this tool could be a practical option.

For more details, please refer to the project’s README file.

Below are some of the libraries and tools that I have worked with:

Core Java, Java EE (J2EE), JSP, Spring, Hibernate, EclipseLink, JPA, Ejb, JMS, MQ, TIBCO, Twitter4J, Jsoup, Tomcat, JBoss, Glassfish, Log4j, Guice, JUnit, Mockito, jQuery, Tomcat, JBoss, Apache Axis, JXPath, JAXB, Swing, Restlet, Ant, Maven.

View comments.

more ...

Miscellaneous Development

This page is a temporary space for various topics I’ve worked on, listed in no particular order. These don’t yet have dedicated blog entries, but I hope to organize and publish them soon.

Multi-threaded and distributed programming, concurrency. Java vs Python: multithreading and multiprocessing, including the Global Interpreter Lock (GIL)

Messaging, Distributed cache - Redis, RQ, Celery, RabbitMQ, JMS, Tibco, IBM MQ, Oracle Coherence. On one project, we used Qpid for testing and IBM MQ in production.

RSS and Twitter feeds processing.

Science and Math; data structures and algorithms - Pandas, NumPy, SciPy.

Decision support system for railroad network management. This system includes a mathematical model that describes the flow of railroad cars within the railroad region.

Lexical analyzer and parser development.

sysadmin and dbadmin tasks. Internal Maven repository set up.

Troubleshooting and resolving production issues.

Working with legacy code, maintenance, enhancement, refactoring.

Other languages - Fortran, Assembler, Pascal.

Design, Design Patterns.

Publications - decision support systems, operating systems, numerical solutions of partial differential equations.

Taught programming and math as a part-time job.

Moderated the NYC Design Patterns Study Group, where I posted summaries of the discussions.

View comments.

more ...

Technical Writing

Technical writing is sometimes described as simplifying the complex(techwhirl.com), a goal I also strive for in my code. There’s a saying that explaining something helps you understand it better. And in my experience, writing documentation often deepens my understanding of the codebase.

If code has value, it will eventually be read and modified by others. Documentation that clarifies the code's purpose or structure is helpful, even years later when you or someone else revisits the project.

The same applies to user guides. A well-written guide allows users to work with a tool or library without needing to dig into the source code. For example, in a J2EE project, I was responsible for migrating the build system to Maven. I set up the project structure, including build and deployment processes, and configured an internal Maven repository. Writing user guides not only aided team migration to Maven but also reduced the number of Maven-related questions I had to answer.

Most of my technical writing has been internal and is not publicly available, but some of my work - including content on this site and documentation for open source contributions - is public.

Although technical writing has never been my full-time role, I enjoy the creative process of writing. After completing several academic writing courses at Columbia University, I passed the Level 10 exam.

The tools I use for technical writing include:

reStructuredText, Markdown, Sphinx, wiki.

View comments.

more ...

Testing

I have experience with both functional and unit testing, and there was a period in my career when testing was my primary responsibility.

Ideally, every piece of code should be tested. But in practice — especially in languages like Python, where prototyping and experimentation are common — writing tests for throwaway code isn’t always efficient. Additionally, deadlines often lead to testing being the first casualty.

One context where testing is rigorously enforced is Open Source development. This is especially true for Pandas, where every change is expected to include a corresponding test case. My contributions to Pandas taught me a great deal about Python testing best practices.

In C++ I developed a client/server Test Bed for system testing of infrastructure components during a time when testing was my primary role.

In both Python and Java, I’ve written a substantial number of unit tests, but they were mostly standard unit tests, similar to those commonly found online.

Below are some of the libraries and tools that I have worked with:

pytest, unittest, mocking, Selenium. JUnit, Mockito, PowerMock

View comments.

more ...

Ant Script to Update/Install Eclipse

This Ant script, originally written several years ago and previously hosted on this site, is now available on GitHub.

At the time the script was created, Eclipse was still relatively new. While it offered extensibility, many essential features were left to plugin developers. As a result, Eclipse users often found themselves waiting anxiously for new builds or updated versions of key plugins.

However, updating Eclipse — or its plugins — was anything but straightforward. There were many undocumented or poorly documented rules, and plugin packaging conventions varied widely. Manual installation was common, and sometimes updates introduced compatibility issues that could render a workspace unusable. The script addressed all these issues and was easily customizable; sometimes you just need a command line tool.

It's hard to estimate how widely the script was used, but there were signs of the script usage: user emails with questions, feature requests, suggestions, and code contributions. The script was reviewed in several blogs and newsgroups; and, for a time, it ranked at the top of Google search results for "Ant script." At the time Apache Ant itself was gaining popularity as a build tool - an alternative to UNIX make utility - and this script extended Ant’s utility.

The following text was written alongside the script’s initial release. While some statements may now seem dated, they’ve been left untouched to preserve the context and spirit of the time:

Eclipse is a great IDE. It is relatively new and the speed with which Eclipse team introduces new features is amazing. However, my first update to a new Eclipse build was rather time consuming; and from reading news I realized that other developers were struggling with updates too. The script simplified updates and, over time, evolved to incorporate some other related activities. All the documentation is in the Readme file. One chapter in the Readme file, named What is the Right Way to Update?, is different from the other chapters as this chapter is not about how to use the script but rather why and when to use it.

You can download a zip file from this site or just browse Readme file online. Latest changes are in Release Notes. It is worth noting that Eclipse comes with the Update Manager but it seems to be for major releases and not for builds. The primary audience for this script is developers who update Eclipse quite often.

Read What Others Have Said:

http://youarenumber6.blogspot.com/2004/08/departmental-eclipse-with-ant.html

http://www.jroller.com/page/dorodok/20030106

View comments.

more ...