baloo.core package

baloo.core.frame module

class baloo.core.frame.DataFrame(data=None, index=None)[source]

Bases: baloo.core.generic.BinaryOps, baloo.core.generic.BalooCommon

Weld-ed pandas DataFrame.

Examples

>>> import baloo as bl
>>> import numpy as np
>>> from collections import OrderedDict
>>> df = bl.DataFrame(OrderedDict((('a', np.arange(5, 8)), ('b', [1, 0, 2]))))
>>> df.index  # repr
RangeIndex(start=0, stop=3, step=1)
>>> df  # repr
DataFrame(index=RangeIndex(start=0, stop=3, step=1), columns=[a: int64, b: int64])
>>> print(df.evaluate())  # omitting evaluate would trigger exception as index is now an unevaluated RangeIndex
       a    b
---  ---  ---
  0    5    1
  1    6    0
  2    7    2
>>> print(len(df))
3
>>> print((df * 2).evaluate())
       a    b
---  ---  ---
  0   10    2
  1   12    0
  2   14    4
>>> print((df * [2, 3]).evaluate())
       a    b
---  ---  ---
  0   10    3
  1   12    0
  2   14    6
>>> print(df.min().evaluate())
<BLANKLINE>
---  --
a     5
b     0
>>> print(df.mean().evaluate())
<BLANKLINE>
---  --
a     6
b     1
>>> print(df.agg(['var', 'count']).evaluate())
         a    b
-----  ---  ---
var      1    1
count    3    3
>>> df.rename({'a': 'c'})
DataFrame(index=RangeIndex(start=0, stop=3, step=1), columns=[c: int64, b: int64])
>>> df.drop('a')
DataFrame(index=RangeIndex(start=0, stop=3, step=1), columns=[b: int64])
>>> print(df.reset_index().evaluate())
       index    a    b
---  -------  ---  ---
  0        0    5    1
  1        1    6    0
  2        2    7    2
>>> print(df.set_index('b').evaluate())
  b    a
---  ---
  1    5
  0    6
  2    7
>>> print(df.sort_values('b').evaluate())
       a    b
---  ---  ---
  1    6    0
  0    5    1
  2    7    2
>>> df2 = bl.DataFrame({'b': np.array([0, 2])})
>>> print(df.merge(df2, on='b').evaluate())
  b    index_x    a    index_y
---  ---------  ---  ---------
  0          1    6          0
  2          2    7          1
>>> df3 = bl.DataFrame({'a': [1., -999., 3.]}, bl.Index([-999, 1, 2]))
>>> print(df3.dropna().evaluate())
        a
----  ---
-999    1
   2    3
>>> print(df3.fillna({'a': 15}).evaluate())
        a
----  ---
-999    1
   1   15
   2    3
>>> print(bl.DataFrame({'a': [0, 1, 1, 2], 'b': [1, 2, 3, 4]}).groupby('a').sum().evaluate())
  a    b
---  ---
  0    1
  2    4
  1    5
Attributes:
index
dtypes

Series of NumPy dtypes present in the DataFrame with index of column names.

columns

Index of the column names present in the DataFrame in order.

iloc

Retrieve Indexer by index.

__getitem__(item)[source]

Select from the DataFrame.

Supported functionality exemplified below.

Examples

>>> df = bl.DataFrame(OrderedDict({'a': np.arange(5, 8)}))
>>> print(df['a'].evaluate())
       a
---  ---
  0    5
  1    6
  2    7
>>> print(df[['a']].evaluate())
       a
---  ---
  0    5
  1    6
  2    7
>>> print(df[df['a'] < 7].evaluate())
       a
---  ---
  0    5
  1    6
__init__(data=None, index=None)[source]

Initialize a DataFrame object.

Note that (unlike pandas) there’s currently no index inference or alignment between the indexes of any Series passed as data. That is, all data, be it raw or Series, inherits the index of the DataFrame. Alignment is currently restricted to setitem

Parameters:
data : dict, optional

Data as a dict of str -> np.ndarray or Series or list.

index : Index or RangeIndex or MultiIndex, optional

Index linked to the data; it is assumed to be of the same length. RangeIndex by default.

__len__()[source]

Eagerly get the length of the DataFrame.

Note that if the length is unknown (such as for WeldObjects), it will be eagerly computed.

Returns:
int

Length of the DataFrame.

__setitem__(key, value)[source]

Add/update DataFrame column.

Note that for raw data, it does NOT check for the same length with the DataFrame due to possibly not knowing the length before evaluation. Hence, columns of different lengths are possible if using raw data which might lead to unexpected behavior. To avoid this, use the more expensive setitem by wrapping with a Series. This, in turn, means that if knowing the indexes match and the data has the same length as the DataFrame, it is more efficient to setitem using the raw data.

Parameters:
key : str

Column name.

value : numpy.ndarray or Series

If a Series, the data will be aligned based on the index of the DataFrame, i.e. df.index left join sr.index.

Examples

>>> df = bl.DataFrame(OrderedDict({'a': np.arange(5, 8)}))
>>> df['b'] = np.arange(3)
>>> print(df.evaluate())
       a    b
---  ---  ---
  0    5    0
  1    6    1
  2    7    2
agg(aggregations)[source]

Multiple aggregations optimized.

Parameters:
aggregations : list of str

Which aggregations to perform.

Returns:
DataFrame

DataFrame with the aggregations per column.

astype(dtype)[source]

Cast DataFrame columns to given dtype.

Parameters:
dtype : numpy.dtype or dict

Dtype or column_name -> dtype mapping to cast columns to. Note index is excluded.

Returns:
DataFrame

With casted columns.

columns

Index of the column names present in the DataFrame in order.

Returns:
Index
drop(columns)[source]

Drop 1 or more columns. Any column which does not exist in the DataFrame is skipped, i.e. not removed, without raising an exception.

Unlike Pandas’ drop, this is currently restricted to dropping columns.

Parameters:
columns : str or list of str

Column name or list of column names to drop.

Returns:
DataFrame

A new DataFrame without these columns.

drop_duplicates(subset=None, keep='min')[source]

Return DataFrame with duplicate rows (excluding index) removed, optionally only considering subset columns.

Note that the row order is NOT maintained due to hashing.

Parameters:
subset : list of str, optional

Which columns to consider

keep : {‘+’, ‘*’, ‘min’, ‘max’}, optional

What to select from the duplicate rows. These correspond to the possible merge operations in Weld. Note that ‘+’ and ‘-‘ might produce unexpected results for strings.

Returns:
DataFrame

DataFrame without duplicate rows.

dropna(subset=None)[source]

Remove missing values according to Baloo’s convention.

Parameters:
subset : list of str, optional

Which columns to check for missing values in.

Returns:
DataFrame

DataFrame with no null values in columns.

dtypes

Series of NumPy dtypes present in the DataFrame with index of column names.

Returns:
Series
empty

Check whether the data structure is empty.

Returns:
bool
evaluate(verbose=False, decode=True, passes=None, num_threads=1, apply_experimental=True)[source]

Evaluates by creating a DataFrame containing evaluated data and index.

See LazyResult

Returns:
DataFrame

DataFrame with evaluated data and index.

fillna(value)[source]

Returns DataFrame with missing values replaced with value.

Parameters:
value : {int, float, bytes, bool} or dict

Scalar value to replace missing values with. If dict, replaces missing values only in the key columns with the value scalar.

Returns:
DataFrame

With missing values replaced.

classmethod from_pandas(df)[source]

Create baloo DataFrame from pandas DataFrame.

Parameters:
df : pandas.frame.DataFrame
Returns:
DataFrame
groupby(by)[source]

Group by certain columns, excluding index.

Simply reset_index if desiring to group by some index column too.

Parameters:
by : str or list of str

Column(s) to groupby.

Returns:
DataFrameGroupBy

Object encoding the groupby operation.

head(n=5)[source]

Return DataFrame with first n values per column.

Parameters:
n : int

Number of values.

Returns:
DataFrame

DataFrame containing the first n values per column.

Examples

>>> df = bl.DataFrame(OrderedDict((('a', np.arange(5, 8)), ('b', np.arange(3)))))
>>> print(df.head(2).evaluate())
       a    b
---  ---  ---
  0    5    0
  1    6    1
iloc

Retrieve Indexer by index.

Supported iloc functionality exemplified below.

Examples

>>> df = bl.DataFrame(OrderedDict((('a', np.arange(5, 8)), ('b', np.array([1, 0, 2])))))
>>> print(df.iloc[0:2].evaluate())
       a    b
---  ---  ---
  0    5    1
  1    6    0
>>> print(df.iloc[bl.Series(np.array([0, 2]))].evaluate())
       a    b
---  ---  ---
  0    5    1
  2    7    2
join(other, on=None, how='left', lsuffix=None, rsuffix=None, algorithm='merge', is_on_sorted=True, is_on_unique=True)[source]

Database-like join this DataFrame with the other DataFrame.

Currently assumes the on columns are sorted and the on-column(s) values are unique! Next work handles the other cases.

Note there’s no automatic cast if the type of the on columns differs.

Check DataFrame.merge() for more details.

Parameters:
other : DataFrame

With which to merge.

on : str or list or None, optional

The columns from both DataFrames on which to join. If None, will join on the index if it has the same name.

how : {‘inner’, ‘left’, ‘right’, ‘outer’}, optional

Which kind of join to do.

lsuffix : str, optional

Suffix to use on columns that overlap from self.

rsuffix : str, optional

Suffix to use on columns that overlap from other.

algorithm : {‘merge’, ‘hash’}, optional

Which algorithm to use. Note that for ‘hash’, the other DataFrame is the one hashed.

is_on_sorted : bool, optional

If we know that the on columns are already sorted, can employ faster algorithm.

is_on_unique : bool, optional

If we know that the values are unique, can employ faster algorithm.

Returns:
DataFrame

DataFrame containing the merge result, with the on columns as index.

keys()[source]

Retrieve column names as Index, i.e. for axis=1.

Returns:
Index

Column names as an Index.

merge(other, how='inner', on=None, suffixes=('_x', '_y'), algorithm='merge', is_on_sorted=False, is_on_unique=True)[source]

Database-like join this DataFrame with the other DataFrame.

Currently assumes the on-column(s) values are unique!

Note there’s no automatic cast if the type of the on columns differs.

Algorithms and limitations:

  • Merge algorithms: merge-join or hash-join. Typical pros and cons apply when choosing between the two. Merge-join shall be used on fairly equally-sized DataFrames while a hash-join would be better when one of the DataFrames is (much) smaller.
  • Limitations:
    • Hash-join requires the (smaller) hashed DataFrame (more precisely, the on columns) to contain no duplicates!
    • Merge-join requires the on-columns to be sorted!
    • For unsorted data can only sort a single column! (current Weld limitation)
  • Sortedness. If the on-columns are sorted, merge-join does not require to sort the data so it can be significantly faster. Do add is_on_sorted=True if this is known to be true!
  • Uniqueness. If the on-columns data contains duplicates, the algorithm is more complicated, i.e. slow. Also hash-join cannot be used on a hashed (smaller) DataFrame with duplicates. Do add is_on_unique=True if this is known to be true!
  • Setting the above 2 flags incorrectly, e.g. is_on_sorted to True when data is in fact not sorted, will produce undefined results.
Parameters:
other : DataFrame

With which to merge.

how : {‘inner’, ‘left’, ‘right’, ‘outer’}, optional

Which kind of join to do.

on : str or list or None, optional

The columns from both DataFrames on which to join. If None, will join on the index if it has the same name.

suffixes : tuple of str, optional

To append on columns not in on that have the same name in the DataFrames.

algorithm : {‘merge’, ‘hash’}, optional

Which algorithm to use. Note that for ‘hash’, the other DataFrame is the one hashed.

is_on_sorted : bool, optional

If we know that the on columns are already sorted, can employ faster algorithm. If False, the DataFrame will first be sorted by the on columns.

is_on_unique : bool, optional

If we know that the values are unique, can employ faster algorithm.

Returns:
DataFrame

DataFrame containing the merge result, with the on columns as index.

rename(columns)[source]

Returns a new DataFrame with renamed columns.

Currently a simplified version of Pandas’ rename.

Parameters:
columns : dict

Old names to new names.

Returns:
DataFrame

With columns renamed, if found.

reset_index()[source]

Returns a new DataFrame with previous index as column(s).

Returns:
DataFrame

DataFrame with the new index a RangeIndex of its length.

set_index(keys)[source]

Set the index of the DataFrame to be the keys columns.

Note this means that the old index is removed.

Parameters:
keys : str or list of str

Which column(s) to set as the index.

Returns:
DataFrame

DataFrame with the index set to the column(s) corresponding to the keys.

sort_index(ascending=True)[source]

Sort the index of the DataFrame.

Currently MultiIndex is not supported since Weld is missing multiple-column sort.

Note this is an expensive operation (brings all data to Weld).

Parameters:
ascending : bool, optional
Returns:
DataFrame

DataFrame sorted according to the index.

sort_values(by, ascending=True)[source]

Sort the DataFrame based on a column.

Unlike Pandas, one can sort by data from both index and regular columns.

Currently possible to sort only on a single column since Weld is missing multiple-column sort. Note this is an expensive operation (brings all data to Weld).

Parameters:
by : str or list of str

Column names to sort.

ascending : bool, optional
Returns:
DataFrame

DataFrame sorted according to the column.

tail(n=5)[source]

Return DataFrame with last n values per column.

Parameters:
n : int

Number of values.

Returns:
DataFrame

DataFrame containing the last n values per column.

Examples

>>> df = bl.DataFrame(OrderedDict((('a', np.arange(5, 8)), ('b', np.arange(3)))))
>>> print(df.tail(2).evaluate())
       a    b
---  ---  ---
  1    6    1
  2    7    2
to_csv(filepath, sep=', ', header=True, index=True)[source]

Save DataFrame as csv.

Parameters:
filepath : str
sep : str, optional

Separator used between values.

header : bool, optional

Whether to save the header.

index : bool, optional

Whether to save the index columns.

Returns:
None
to_pandas()[source]

Convert to pandas DataFrame.

Note the data is expected to be evaluated.

Returns:
pandas.frame.DataFrame
values

Alias for data attribute.

Returns:
dict

The internal dict data representation.

baloo.core.series module

class baloo.core.series.Series(data=None, index=None, dtype=None, name=None)[source]

Bases: baloo.weld.lazy_result.LazyArrayResult, baloo.core.generic.BinaryOps, baloo.core.generic.BitOps, baloo.core.generic.BalooCommon

Weld-ed Pandas Series.

Examples

>>> import baloo as bl
>>> import numpy as np
>>> sr = bl.Series([0, 1, 2])
>>> sr
Series(name=None, dtype=int64)
>>> sr.index
RangeIndex(start=0, stop=3, step=1)
>>> sr = sr.evaluate()
>>> sr  # repr
Series(name=None, dtype=int64)
>>> print(sr)  # str
<BLANKLINE>
---  --
  0   0
  1   1
  2   2
>>> sr.index
Index(name=None, dtype=int64)
>>> print(sr.index)
[0 1 2]
>>> len(sr)  # eager computation
3
>>> sr.values
array([0, 1, 2])
>>> (sr + 2).evaluate().values
array([2, 3, 4])
>>> (sr - bl.Index(np.arange(3))).evaluate().values
array([0, 0, 0])
>>> print(sr.max().evaluate())
2
>>> print(sr.var().evaluate())
1.0
>>> print(sr.agg(['min', 'std']).evaluate())
<BLANKLINE>
---  --
min   0
std   1
Attributes:
index
dtype
name
iloc

Retrieve Indexer by index.

__getitem__(item)[source]

Select from the Series.

Supported selection functionality exemplified below.

Examples

>>> sr = bl.Series(np.arange(5, dtype=np.float32), name='Test')
>>> sr = sr[sr > 0]
>>> sr
Series(name=Test, dtype=float32)
>>> print(sr.evaluate())
       Test
---  ------
  1       1
  2       2
  3       3
  4       4
>>> sr = sr[(sr != 1) & ~(sr > 3)]
>>> print(sr.evaluate())
       Test
---  ------
  2       2
  3       3
>>> print(sr[:1].evaluate())
       Test
---  ------
  2       2
__init__(data=None, index=None, dtype=None, name=None)[source]

Initialize a Series object.

Parameters:
data : numpy.ndarray or WeldObject or list, optional

Raw data or Weld expression.

index : Index or RangeIndex or MultiIndex, optional

Index linked to the data; it is assumed to be of the same length. RangeIndex by default.

dtype : numpy.dtype or type, optional

Desired Numpy dtype for the elements. If type, it must be a NumPy type, e.g. np.float32. If data is np.ndarray with a dtype different to dtype argument, it is astype’d to the argument dtype. Note that if data is WeldObject, one must explicitly astype to convert type. Inferred from data by default.

name : str, optional

Name of the Series.

agg(aggregations)[source]

Multiple aggregations optimized.

Parameters:
aggregations : list of str

Which aggregations to perform.

Returns:
Series

Series with resulting aggregations.

apply(func, mapping=None, new_dtype=None, **kwargs)[source]

Apply an element-wise UDF to the Series.

There are currently 6 options for using a UDF. First 4 are lazy, other 2 are eager and require the use of the raw decorator:

  • One of the predefined functions in baloo.functions.
  • Implementing a function which encodes the result. kwargs are automatically passed to it.
  • Pure Weld code and mapping.
  • Weld code and mapping along with a dynamically linked C++ lib containing the UDF.
  • Using a NumPy function, which however is EAGER and hence requires self.values to be raw. Additionally, NumPy
    does not support kwargs in (all) functions so must use raw decorator to strip away weld_type.
  • Implementing an eager function with the same precondition as above. Use the raw decorator to check this.
Parameters:
func : function or str

Weld code as a str to encode or function from baloo.functions.

mapping : dict, optional

Additional mappings in the weld_template to replace on execution. self is added by default to reference to this Series.

new_dtype : numpy.dtype, optional

Specify the new dtype of the result Series. If None, it assumes it’s the same dtype as before the apply.

Returns:
Series

With UDF result.

Examples

>>> import baloo as bl
>>> sr = bl.Series([1, 2, 3])
>>> weld_template = 'map({self}, |e| e + {scalar})'
>>> mapping = {'scalar': '2L'}
>>> print(sr.apply(weld_template, mapping).evaluate())
<BLANKLINE>
---  --
  0   3
  1   4
  2   5
>>> weld_template2 = 'map({self}, |e| e + 3L)'
>>> print(sr.apply(weld_template2).evaluate())
<BLANKLINE>
---  --
  0   4
  1   5
  2   6
>>> print(bl.Series([1., 4., 100.]).apply(bl.sqrt).evaluate())  # lazy predefined function
<BLANKLINE>
---  --
  0   1
  1   2
  2  10
>>> sr = bl.Series([4, 2, 3, 1])
>>> print(sr.apply(bl.sort, kind='q').evaluate())  # eager wrapper over np.sort (which uses raw decorator)
<BLANKLINE>
---  --
  0   1
  1   2
  2   3
  3   4
>>> print(sr.apply(bl.raw(np.sort, kind='q')).evaluate())  # np.sort directly
<BLANKLINE>
---  --
  0   1
  1   2
  2   3
  3   4
>>> print(sr.apply(bl.raw(lambda x: np.sort(x, kind='q'))).evaluate())  # lambda also works, with x = np.array
<BLANKLINE>
---  --
  0   1
  1   2
  2   3
  3   4

# check tests/core/cudf/* and tests/core/test_series.test_cudf for C UDF example

dropna()[source]

Returns Series without null values according to Baloo’s convention.

Returns:
Series

Series with no null values.

evaluate(verbose=False, decode=True, passes=None, num_threads=1, apply_experimental=True)[source]

Evaluates by creating a Series containing evaluated data and index.

See LazyResult

Returns:
Series

Series with evaluated data and index.

Examples

>>> sr = bl.Series(np.arange(3)) > 0
>>> weld_code = sr.values  # accessing values now returns the weld code as a string
>>> sr = sr.evaluate()
>>> sr.values  # now it is evaluated to raw data
array([False,  True,  True])
fillna(value)[source]

Returns Series with missing values replaced with value.

Parameters:
value : {int, float, bytes, bool}

Scalar value to replace missing values with.

Returns:
Series

With missing values replaced.

classmethod from_pandas(series)[source]

Create baloo Series from pandas Series.

Parameters:
series : pandas.series.Series
Returns:
Series
head(n=5)[source]

Return Series with first n values.

Parameters:
n : int

Number of values.

Returns:
Series

Series containing the first n values.

Examples

>>> sr = bl.Series(np.arange(3))
>>> print(sr.head(2).evaluate())
<BLANKLINE>
---  --
  0   0
  1   1
iloc

Retrieve Indexer by index.

Supported iloc functionality exemplified below.

Returns:
_ILocIndexer

Examples

>>> sr = bl.Series(np.arange(3))
>>> print(sr.iloc[2].evaluate())
2
>>> print(sr.iloc[0:2].evaluate())
<BLANKLINE>
---  --
  0   0
  1   1
>>> print(sr.iloc[bl.Series(np.array([0, 2]))].evaluate())
<BLANKLINE>
---  --
  0   0
  2   2
str

Get Access to string functions.

Returns:
StringMethods

Examples

>>> sr = bl.Series([b' aB ', b'GoOsfrABA'])
>>> print(sr.str.lower().evaluate())
<BLANKLINE>
---  ---------
  0   ab
  1  goosfraba
tail(n=5)[source]

Return Series with the last n values.

Parameters:
n : int

Number of values.

Returns:
Series

Series containing the last n values.

Examples

>>> sr = bl.Series(np.arange(3))
>>> print(sr.tail(2).evaluate())
<BLANKLINE>
---  --
  1   1
  2   2
to_pandas()[source]

Convert to pandas Series

Returns:
pandas.series.Series
unique()[source]

Return unique values in the Series.

Note that because it is hash-based, the result will NOT be in the same order (unlike pandas).

Returns:
LazyArrayResult

Unique values in random order.

baloo.core.generic module

class baloo.core.generic.BalooCommon[source]
empty

Check whether the data structure is empty.

Returns:
bool
evaluate()[source]

Evaluate by returning object of the same type but now containing raw data.

values

The internal data representation.

class baloo.core.generic.BinaryOps[source]
class baloo.core.generic.BitOps[source]
class baloo.core.generic.IndexCommon[source]
name

Name of the Index.

Returns:
str

name

baloo.core.groupby module

class baloo.core.groupby.DataFrameGroupBy(df, by: list)[source]

Object encoding a groupby operation.

max()[source]
mean()[source]
min()[source]
prod()[source]
size()[source]
std()[source]
sum()[source]
var()[source]

baloo.core.strings module

class baloo.core.strings.StringMethods(data)[source]
capitalize()[source]

Convert first character to uppercase and remainder to lowercase.

Returns:
Series
contains(pat)[source]

Test if pat is included within elements.

Parameters:
pat : str
Returns:
Series
endswith(pat)[source]

Test if elements end with pat.

Parameters:
pat : str
Returns:
Series
find(sub, start=0, end=None)[source]

Test if elements contain substring.

Parameters:
sub : str
start : int, optional

Index to start searching from.

end : int, optional

Index to stop searching from.

Returns:
Series
get(i)[source]

Extract i’th character of each element.

Parameters:
i : int
Returns:
Series
lower()[source]

Convert all characters to lowercase.

Returns:
Series
replace(pat, rep)[source]

Replace first occurrence of pat with rep in each element.

Parameters:
pat : str
rep : str
Returns:
Series
slice(start=None, stop=None, step=None)[source]

Slice substrings from each element.

Note that negative step is currently not supported.

Parameters:
start : int
stop : int
step : int
Returns:
Series
split(pat, side='left')[source]

Split once each element from the left and select a side to return.

Note this is unlike pandas split in that it essentially combines the split with a select.

Parameters:
pat : str
side : {‘left’, ‘right’}

Which side of the split to select and return in each element.

Returns:
Series
startswith(pat)[source]

Test if elements start with pat.

Parameters:
pat : str
Returns:
Series
strip()[source]

Strip whitespace from start and end of each element.

Note it currently only looks for whitespace (ASCII 32), not tabs or EOL.

Returns:
Series
upper()[source]

Convert all characters to uppercase.

Returns:
Series