Open Source

First Text Analysis Python Project was my first unstructured text analysis project.

My open-source contributions were in several programming languages. The tools and languages used for each contribution are listed at the end of their descriptions.

Pandas and pandas-datareader

While working on a stock screening application using Pandas, I encountered issues that only appeared with specific data values. Initially, I implemented a workaround in my code that functioned correctly but was cumbersome and difficult to maintain. However, this did not address existing code that might behave inconsistently depending on the data. To fix the issue, I contributed changes to Pandas itself as well as to pandas-datareader, which is now maintained as a standalone project.

My contributions included:

Improvements to missing value handling, unit tests, and documentation.
A signature-preserving decorator for compatibility with Python 2.
API-level changes, including updates to documentation and “What’s New” notes.
A temporary workaround for a known issue, along with active participation in related discussions.

Python, Cython, pytest, Sphinx, reStructuredText.

Pandas commits

https://github.com/pandas-dev/pandas/commits?author=olegshteynbuk

pandas-datareader commits

https://github.com/pydata/pandas-datareader/commits?author=OlegShteynbuk

    Pandas pull requests
https://github.com/pandas-dev/pandas/pull/16589
https://github.com/pandas-dev/pandas/pull/16534

    Pandas issues
https://github.com/pandas-dev/pandas/issues/16606
https://github.com/pandas-dev/pandas/issues/16471

    pandas-datareader pull requests:
https://github.com/pydata/pandas-datareader/pull/357
https://github.com/pydata/pandas-datareader/pull/364 (merged from a command line and GitHub GUI  incorrectly displays it as not merged but correctly shows it as merged in the source code
https://github.com/pydata/pandas-datareader/commit/6cce5f18d52be802c7245c8a28d534236a9e2b24 )

    pandas-datareader issues
https://github.com/pydata/pandas-datareader/issues/363
https://github.com/pydata/pandas-datareader/issues/342

              Arelle
    Arelle pull requests
https://github.com/Arelle/Arelle/pull/98

              geWorkbench
At the time I was at The Center for Computational Biology and Bioinformatics(C2B2) at Columbia University.
Most of my work focused on geWorkbench, a Java-based open-source desktop application for integrated genomics.
While working on geWorkbench, I was involved in developing new features, design, enhancements, and bug fixes.
I also contributed to server-side work, primarily on a grid-based application called CaArray, which is an open-source web and programmatically accessible array data management system. My responsibilities with CaArray included installation and administration.
Some tasks required advanced calculus and mathematical modeling.
Java, C++/C, Swing, JBoss, Tomcat, Ant.

GitHub Gist has Python and Java code.

Hedge Fund Investment Projection Program.

This program calculates projections for when hedge fund investors will receive their investments over time, with most calculations performed using Pandas.

The program reads data from an Excel file containing at least two worksheets: Liquidity Terms and Tranche Investments.

The program includes three scripts that generate reports and visualizations based on the data.

The emphasis is on the most common hedge fund withdrawal restrictions.

A more detailed description is in the HedgeFundsRedemption.md file.

NYSE trading days for backtesting

This is a fork of jckantor's Python dateutil rule sets for NYSE trading days and holiday observances. The original rules are valid from the present onward. However, for backtesting or pattern recognition, there is often a need to access NYSE trading days from the past several years. The rules have been modified to provide NYSE trading days and holiday observances starting from 1986.

Java program to perform several dataset operations

This Java program performs basic operations on datasets stored in CSV (comma-separated) files. It reads the dataset into a dataframe to perform various operations.

The program can be used as a library or directly from the command line. Users can define operations using a simple language when running from the command line.

The main purpose of this project is to illustrate that in Java, the absence of a comprehensive library like Pandas makes advanced data processing quite time-consuming. In many cases, you may find it more efficient to use Python and Pandas, even if it requires learning a new language.

That said, if you are a Java developer who doesn't know Python and only needs to perform relatively simple column-based dataset operations, this tool could be a practical option.

For more details, please refer to the project’s README file.

Other contributions

Leo outlining editor

Submitted a bug fix for integration with Aspell, a spell-checking library for C++. Several years ago, Leo switched from Aspell to PyEnchant.

Python, C++.

Submitted code patches to Apache Axis to improve AxisFault logging and provided a temporary workaround.

Java.

Filed bug reports against Swing, JBoss, and several others Java tools and libraries.

View comments.

more ...

Ant Script to Update/Install Eclipse

This Ant script, originally written several years ago and previously hosted on this site, is now available on GitHub.

At the time the script was created, Eclipse was still relatively new. While it offered extensibility, many essential features were left to plugin developers. As a result, Eclipse users often found themselves waiting anxiously for new builds or updated versions of key plugins.

However, updating Eclipse — or its plugins — was anything but straightforward. There were many undocumented or poorly documented rules, and plugin packaging conventions varied widely. Manual installation was common, and sometimes updates introduced compatibility issues that could render a workspace unusable. The script addressed all these issues and was easily customizable; sometimes you just need a command line tool.

It's hard to estimate how widely the script was used, but there were signs of the script usage: user emails with questions, feature requests, suggestions, and code contributions. The script was reviewed in several blogs and newsgroups; and, for a time, it ranked at the top of Google search results for "Ant script." At the time Apache Ant itself was gaining popularity as a build tool - an alternative to UNIX make utility - and this script extended Ant’s utility.

The following text was written alongside the script’s initial release. While some statements may now seem dated, they’ve been left untouched to preserve the context and spirit of the time:

Eclipse is a great IDE. It is relatively new and the speed with which Eclipse team introduces new features is amazing. However, my first update to a new Eclipse build was rather time consuming; and from reading news I realized that other developers were struggling with updates too. The script simplified updates and, over time, evolved to incorporate some other related activities. All the documentation is in the Readme file. One chapter in the Readme file, named What is the Right Way to Update?, is different from the other chapters as this chapter is not about how to use the script but rather why and when to use it.

You can download a zip file from this site or just browse Readme file online. Latest changes are in Release Notes. It is worth noting that Eclipse comes with the Update Manager but it seems to be for major releases and not for builds. The primary audience for this script is developers who update Eclipse quite often.

Read What Others Have Said:

http://youarenumber6.blogspot.com/2004/08/departmental-eclipse-with-ant.html

http://www.jroller.com/page/dorodok/20030106

View comments.

more ...