baloo.core package¶

Subpackages¶

baloo.core.indexes package

baloo.core.frame module¶

class baloo.core.frame.DataFrame(data=None, index=None)[source]¶

Bases: baloo.core.generic.BinaryOps, baloo.core.generic.BalooCommon

Weld-ed pandas DataFrame.

See also

pandas.DataFrame: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

Examples

>>> import baloo as bl
>>> import numpy as np
>>> from collections import OrderedDict
>>> df = bl.DataFrame(OrderedDict((('a', np.arange(5, 8)), ('b', [1, 0, 2]))))
>>> df.index  # repr
RangeIndex(start=0, stop=3, step=1)
>>> df  # repr
DataFrame(index=RangeIndex(start=0, stop=3, step=1), columns=[a: int64, b: int64])
>>> print(df.evaluate())  # omitting evaluate would trigger exception as index is now an unevaluated RangeIndex
       a    b
---  ---  ---
  0    5    1
  1    6    0
  2    7    2
>>> print(len(df))
3
>>> print((df * 2).evaluate())
       a    b
---  ---  ---
  0   10    2
  1   12    0
  2   14    4
>>> print((df * [2, 3]).evaluate())
       a    b
---  ---  ---
  0   10    3
  1   12    0
  2   14    6
>>> print(df.min().evaluate())
<BLANKLINE>
---  --
a     5
b     0
>>> print(df.mean().evaluate())
<BLANKLINE>
---  --
a     6
b     1
>>> print(df.agg(['var', 'count']).evaluate())
         a    b
-----  ---  ---
var      1    1
count    3    3
>>> df.rename({'a': 'c'})
DataFrame(index=RangeIndex(start=0, stop=3, step=1), columns=[c: int64, b: int64])
>>> df.drop('a')
DataFrame(index=RangeIndex(start=0, stop=3, step=1), columns=[b: int64])
>>> print(df.reset_index().evaluate())
       index    a    b
---  -------  ---  ---
  0        0    5    1
  1        1    6    0
  2        2    7    2
>>> print(df.set_index('b').evaluate())
  b    a
---  ---
  1    5
  0    6
  2    7
>>> print(df.sort_values('b').evaluate())
       a    b
---  ---  ---
  1    6    0
  0    5    1
  2    7    2
>>> df2 = bl.DataFrame({'b': np.array([0, 2])})
>>> print(df.merge(df2, on='b').evaluate())
  b    index_x    a    index_y
---  ---------  ---  ---------
  0          1    6          0
  2          2    7          1
>>> df3 = bl.DataFrame({'a': [1., -999., 3.]}, bl.Index([-999, 1, 2]))
>>> print(df3.dropna().evaluate())
        a
----  ---
-999    1
   2    3
>>> print(df3.fillna({'a': 15}).evaluate())
        a
----  ---
-999    1
   1   15
   2    3
>>> print(bl.DataFrame({'a': [0, 1, 1, 2], 'b': [1, 2, 3, 4]}).groupby('a').sum().evaluate())
  a    b
---  ---
  0    1
  2    4
  1    5

Attributes:	index `dtypes` Series of NumPy dtypes present in the DataFrame with index of column names. `columns` Index of the column names present in the DataFrame in order. `iloc` Retrieve Indexer by index.

__getitem__(item)[source]¶

Select from the DataFrame.

Supported functionality exemplified below.

Examples

>>> df = bl.DataFrame(OrderedDict({'a': np.arange(5, 8)}))
>>> print(df['a'].evaluate())
       a
---  ---
  0    5
  1    6
  2    7
>>> print(df[['a']].evaluate())
       a
---  ---
  0    5
  1    6
  2    7
>>> print(df[df['a'] < 7].evaluate())
       a
---  ---
  0    5
  1    6

__init__(data=None, index=None)[source]¶

Initialize a DataFrame object.

Note that (unlike pandas) there’s currently no index inference or alignment between the indexes of any Series passed as data. That is, all data, be it raw or Series, inherits the index of the DataFrame. Alignment is currently restricted to setitem

Parameters:	data : dict, optional Data as a dict of str -> np.ndarray or Series or list. index : Index or RangeIndex or MultiIndex, optional Index linked to the data; it is assumed to be of the same length. RangeIndex by default.

__len__()[source]¶

Eagerly get the length of the DataFrame.

Note that if the length is unknown (such as for WeldObjects), it will be eagerly computed.

Returns:	int Length of the DataFrame.

__setitem__(key, value)[source]¶

Add/update DataFrame column.

Note that for raw data, it does NOT check for the same length with the DataFrame due to possibly not knowing the length before evaluation. Hence, columns of different lengths are possible if using raw data which might lead to unexpected behavior. To avoid this, use the more expensive setitem by wrapping with a Series. This, in turn, means that if knowing the indexes match and the data has the same length as the DataFrame, it is more efficient to setitem using the raw data.

Parameters:	key : str Column name. value : numpy.ndarray or Series If a Series, the data will be aligned based on the index of the DataFrame, i.e. df.index left join sr.index.

Examples

>>> df = bl.DataFrame(OrderedDict({'a': np.arange(5, 8)}))
>>> df['b'] = np.arange(3)
>>> print(df.evaluate())
       a    b
---  ---  ---
  0    5    0
  1    6    1
  2    7    2

agg(aggregations)[source]¶

Multiple aggregations optimized.

Parameters:	aggregations : list of str Which aggregations to perform.
Returns:	DataFrame DataFrame with the aggregations per column.

astype(dtype)[source]¶

Cast DataFrame columns to given dtype.

Parameters:	dtype : numpy.dtype or dict Dtype or column_name -> dtype mapping to cast columns to. Note index is excluded.
Returns:	DataFrame With casted columns.

columns¶

Index of the column names present in the DataFrame in order.

Returns:	Index

drop(columns)[source]¶

Drop 1 or more columns. Any column which does not exist in the DataFrame is skipped, i.e. not removed, without raising an exception.

Unlike Pandas’ drop, this is currently restricted to dropping columns.

Parameters:	columns : str or list of str Column name or list of column names to drop.
Returns:	DataFrame A new DataFrame without these columns.

drop_duplicates(subset=None, keep='min')[source]¶

Return DataFrame with duplicate rows (excluding index) removed, optionally only considering subset columns.

Note that the row order is NOT maintained due to hashing.

Parameters:	subset : list of str, optional Which columns to consider keep : {‘+’, ‘*’, ‘min’, ‘max’}, optional What to select from the duplicate rows. These correspond to the possible merge operations in Weld. Note that ‘+’ and ‘-‘ might produce unexpected results for strings.
Returns:	DataFrame DataFrame without duplicate rows.

dropna(subset=None)[source]¶

Remove missing values according to Baloo’s convention.

Parameters:	subset : list of str, optional Which columns to check for missing values in.
Returns:	DataFrame DataFrame with no null values in columns.

dtypes¶

Series of NumPy dtypes present in the DataFrame with index of column names.

Returns:	Series

empty¶

Check whether the data structure is empty.

Returns:	bool

evaluate(verbose=False, decode=True, passes=None, num_threads=1, apply_experimental=True)[source]¶

Evaluates by creating a DataFrame containing evaluated data and index.

See LazyResult

Returns:	DataFrame DataFrame with evaluated data and index.

fillna(value)[source]¶

Returns DataFrame with missing values replaced with value.

Parameters:	value : {int, float, bytes, bool} or dict Scalar value to replace missing values with. If dict, replaces missing values only in the key columns with the value scalar.
Returns:	DataFrame With missing values replaced.

classmethod from_pandas(df)[source]¶

Create baloo DataFrame from pandas DataFrame.

Parameters:	df : pandas.frame.DataFrame
Returns:	DataFrame

groupby(by)[source]¶

Group by certain columns, excluding index.

Simply reset_index if desiring to group by some index column too.

Parameters:	by : str or list of str Column(s) to groupby.
Returns:	DataFrameGroupBy Object encoding the groupby operation.

head(n=5)[source]¶

Return DataFrame with first n values per column.

Parameters:	n : int Number of values.
Returns:	DataFrame DataFrame containing the first n values per column.

Examples

>>> df = bl.DataFrame(OrderedDict((('a', np.arange(5, 8)), ('b', np.arange(3)))))
>>> print(df.head(2).evaluate())
       a    b
---  ---  ---
  0    5    0
  1    6    1

iloc¶

Retrieve Indexer by index.

Supported iloc functionality exemplified below.

Examples

>>> df = bl.DataFrame(OrderedDict((('a', np.arange(5, 8)), ('b', np.array([1, 0, 2])))))
>>> print(df.iloc[0:2].evaluate())
       a    b
---  ---  ---
  0    5    1
  1    6    0
>>> print(df.iloc[bl.Series(np.array([0, 2]))].evaluate())
       a    b
---  ---  ---
  0    5    1
  2    7    2

join(other, on=None, how='left', lsuffix=None, rsuffix=None, algorithm='merge', is_on_sorted=True, is_on_unique=True)[source]¶

Database-like join this DataFrame with the other DataFrame.

Currently assumes the on columns are sorted and the on-column(s) values are unique! Next work handles the other cases.

Note there’s no automatic cast if the type of the on columns differs.

Check DataFrame.merge() for more details.

Parameters:

other : DataFrame: With which to merge.
on : str or list or None, optional: The columns from both DataFrames on which to join. If None, will join on the index if it has the same name.
how : {‘inner’, ‘left’, ‘right’, ‘outer’}, optional: Which kind of join to do.
lsuffix : str, optional: Suffix to use on columns that overlap from self.
rsuffix : str, optional: Suffix to use on columns that overlap from other.
algorithm : {‘merge’, ‘hash’}, optional: Which algorithm to use. Note that for ‘hash’, the other DataFrame is the one hashed.
is_on_sorted : bool, optional: If we know that the on columns are already sorted, can employ faster algorithm.
is_on_unique : bool, optional: If we know that the values are unique, can employ faster algorithm.

Returns:

DataFrame: DataFrame containing the merge result, with the on columns as index.

keys()[source]¶

Retrieve column names as Index, i.e. for axis=1.

Returns:	Index Column names as an Index.

merge(other, how='inner', on=None, suffixes=('_x', '_y'), algorithm='merge', is_on_sorted=False, is_on_unique=True)[source]¶

Database-like join this DataFrame with the other DataFrame.

Currently assumes the on-column(s) values are unique!

Note there’s no automatic cast if the type of the on columns differs.

Algorithms and limitations:

Merge algorithms: merge-join or hash-join. Typical pros and cons apply when choosing between the two. Merge-join shall be used on fairly equally-sized DataFrames while a hash-join would be better when one of the DataFrames is (much) smaller.
Limitations:
- Hash-join requires the (smaller) hashed DataFrame (more precisely, the on columns) to contain no duplicates!
- Merge-join requires the on-columns to be sorted!
- For unsorted data can only sort a single column! (current Weld limitation)
Sortedness. If the on-columns are sorted, merge-join does not require to sort the data so it can be significantly faster. Do add is_on_sorted=True if this is known to be true!
Uniqueness. If the on-columns data contains duplicates, the algorithm is more complicated, i.e. slow. Also hash-join cannot be used on a hashed (smaller) DataFrame with duplicates. Do add is_on_unique=True if this is known to be true!
Setting the above 2 flags incorrectly, e.g. is_on_sorted to True when data is in fact not sorted, will produce undefined results.

Parameters:

other : DataFrame: With which to merge.
how : {‘inner’, ‘left’, ‘right’, ‘outer’}, optional: Which kind of join to do.
on : str or list or None, optional: The columns from both DataFrames on which to join. If None, will join on the index if it has the same name.
suffixes : tuple of str, optional: To append on columns not in on that have the same name in the DataFrames.
algorithm : {‘merge’, ‘hash’}, optional: Which algorithm to use. Note that for ‘hash’, the other DataFrame is the one hashed.
is_on_sorted : bool, optional: If we know that the on columns are already sorted, can employ faster algorithm. If False, the DataFrame will first be sorted by the on columns.
is_on_unique : bool, optional: If we know that the values are unique, can employ faster algorithm.

Returns:

DataFrame: DataFrame containing the merge result, with the on columns as index.

rename(columns)[source]¶

Returns a new DataFrame with renamed columns.

Currently a simplified version of Pandas’ rename.

Parameters:	columns : dict Old names to new names.
Returns:	DataFrame With columns renamed, if found.

reset_index()[source]¶

Returns a new DataFrame with previous index as column(s).

Returns:	DataFrame DataFrame with the new index a RangeIndex of its length.

set_index(keys)[source]¶

Set the index of the DataFrame to be the keys columns.

Note this means that the old index is removed.

Parameters:	keys : str or list of str Which column(s) to set as the index.
Returns:	DataFrame DataFrame with the index set to the column(s) corresponding to the keys.

sort_index(ascending=True)[source]¶

Sort the index of the DataFrame.

Currently MultiIndex is not supported since Weld is missing multiple-column sort.

Note this is an expensive operation (brings all data to Weld).

Parameters:	ascending : bool, optional
Returns:	DataFrame DataFrame sorted according to the index.

sort_values(by, ascending=True)[source]¶

Sort the DataFrame based on a column.

Unlike Pandas, one can sort by data from both index and regular columns.

Currently possible to sort only on a single column since Weld is missing multiple-column sort. Note this is an expensive operation (brings all data to Weld).

Parameters:	by : str or list of str Column names to sort. ascending : bool, optional
Returns:	DataFrame DataFrame sorted according to the column.

tail(n=5)[source]¶

Return DataFrame with last n values per column.

Parameters:	n : int Number of values.
Returns:	DataFrame DataFrame containing the last n values per column.

Examples

>>> df = bl.DataFrame(OrderedDict((('a', np.arange(5, 8)), ('b', np.arange(3)))))
>>> print(df.tail(2).evaluate())
       a    b
---  ---  ---
  1    6    1
  2    7    2

to_csv(filepath, sep=', ', header=True, index=True)[source]¶

Save DataFrame as csv.

Parameters:	filepath : str sep : str, optional Separator used between values. header : bool, optional Whether to save the header. index : bool, optional Whether to save the index columns.
Returns:	None

to_pandas()[source]¶

Convert to pandas DataFrame.

Note the data is expected to be evaluated.

Returns:	pandas.frame.DataFrame

values¶

Alias for data attribute.

Returns:	dict The internal dict data representation.

baloo.core.series module¶

class baloo.core.series.Series(data=None, index=None, dtype=None, name=None)[source]¶

Bases: baloo.weld.lazy_result.LazyArrayResult, baloo.core.generic.BinaryOps, baloo.core.generic.BitOps, baloo.core.generic.BalooCommon

Weld-ed Pandas Series.

See also

pandas.Series: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html

Examples

>>> import baloo as bl
>>> import numpy as np
>>> sr = bl.Series([0, 1, 2])
>>> sr
Series(name=None, dtype=int64)
>>> sr.index
RangeIndex(start=0, stop=3, step=1)
>>> sr = sr.evaluate()
>>> sr  # repr
Series(name=None, dtype=int64)
>>> print(sr)  # str
<BLANKLINE>
---  --
  0   0
  1   1
  2   2
>>> sr.index
Index(name=None, dtype=int64)
>>> print(sr.index)
[0 1 2]
>>> len(sr)  # eager computation
3
>>> sr.values
array([0, 1, 2])
>>> (sr + 2).evaluate().values
array([2, 3, 4])
>>> (sr - bl.Index(np.arange(3))).evaluate().values
array([0, 0, 0])
>>> print(sr.max().evaluate())
2
>>> print(sr.var().evaluate())
1.0
>>> print(sr.agg(['min', 'std']).evaluate())
<BLANKLINE>
---  --
min   0
std   1

Attributes:	index dtype name `iloc` Retrieve Indexer by index.

__getitem__(item)[source]¶

Select from the Series.

Supported selection functionality exemplified below.

Examples

>>> sr = bl.Series(np.arange(5, dtype=np.float32), name='Test')
>>> sr = sr[sr > 0]
>>> sr
Series(name=Test, dtype=float32)
>>> print(sr.evaluate())
       Test
---  ------
  1       1
  2       2
  3       3
  4       4
>>> sr = sr[(sr != 1) & ~(sr > 3)]
>>> print(sr.evaluate())
       Test
---  ------
  2       2
  3       3
>>> print(sr[:1].evaluate())
       Test
---  ------
  2       2

__init__(data=None, index=None, dtype=None, name=None)[source]¶

Initialize a Series object.

Parameters:

data : numpy.ndarray or WeldObject or list, optional: Raw data or Weld expression.
index : Index or RangeIndex or MultiIndex, optional: Index linked to the data; it is assumed to be of the same length. RangeIndex by default.
dtype : numpy.dtype or type, optional: Desired Numpy dtype for the elements. If type, it must be a NumPy type, e.g. np.float32. If data is np.ndarray with a dtype different to dtype argument, it is astype’d to the argument dtype. Note that if data is WeldObject, one must explicitly astype to convert type. Inferred from data by default.
name : str, optional: Name of the Series.

agg(aggregations)[source]¶

Multiple aggregations optimized.

Parameters:	aggregations : list of str Which aggregations to perform.
Returns:	Series Series with resulting aggregations.

apply(func, mapping=None, new_dtype=None, **kwargs)[source]¶

Apply an element-wise UDF to the Series.

There are currently 6 options for using a UDF. First 4 are lazy, other 2 are eager and require the use of the raw decorator:

One of the predefined functions in baloo.functions.
Implementing a function which encodes the result. kwargs are automatically passed to it.
Pure Weld code and mapping.
Weld code and mapping along with a dynamically linked C++ lib containing the UDF.
Using a NumPy function, which however is EAGER and hence requires self.values to be raw. Additionally, NumPy

does not support kwargs in (all) functions so must use raw decorator to strip away weld_type.
Implementing an eager function with the same precondition as above. Use the raw decorator to check this.

Parameters:	func : function or str Weld code as a str to encode or function from baloo.functions. mapping : dict, optional Additional mappings in the weld_template to replace on execution. self is added by default to reference to this Series. new_dtype : numpy.dtype, optional Specify the new dtype of the result Series. If None, it assumes it’s the same dtype as before the apply.
Returns:	Series With UDF result.

Examples

>>> import baloo as bl
>>> sr = bl.Series([1, 2, 3])
>>> weld_template = 'map({self}, |e| e + {scalar})'
>>> mapping = {'scalar': '2L'}
>>> print(sr.apply(weld_template, mapping).evaluate())
<BLANKLINE>
---  --
  0   3
  1   4
  2   5
>>> weld_template2 = 'map({self}, |e| e + 3L)'
>>> print(sr.apply(weld_template2).evaluate())
<BLANKLINE>
---  --
  0   4
  1   5
  2   6
>>> print(bl.Series([1., 4., 100.]).apply(bl.sqrt).evaluate())  # lazy predefined function
<BLANKLINE>
---  --
  0   1
  1   2
  2  10
>>> sr = bl.Series([4, 2, 3, 1])
>>> print(sr.apply(bl.sort, kind='q').evaluate())  # eager wrapper over np.sort (which uses raw decorator)
<BLANKLINE>
---  --
  0   1
  1   2
  2   3
  3   4
>>> print(sr.apply(bl.raw(np.sort, kind='q')).evaluate())  # np.sort directly
<BLANKLINE>
---  --
  0   1
  1   2
  2   3
  3   4
>>> print(sr.apply(bl.raw(lambda x: np.sort(x, kind='q'))).evaluate())  # lambda also works, with x = np.array
<BLANKLINE>
---  --
  0   1
  1   2
  2   3
  3   4

# check tests/core/cudf/* and tests/core/test_series.test_cudf for C UDF example

dropna()[source]¶

Returns Series without null values according to Baloo’s convention.

Returns:	Series Series with no null values.

evaluate(verbose=False, decode=True, passes=None, num_threads=1, apply_experimental=True)[source]¶

Evaluates by creating a Series containing evaluated data and index.

See LazyResult

Returns:	Series Series with evaluated data and index.

Examples

>>> sr = bl.Series(np.arange(3)) > 0
>>> weld_code = sr.values  # accessing values now returns the weld code as a string
>>> sr = sr.evaluate()
>>> sr.values  # now it is evaluated to raw data
array([False,  True,  True])

fillna(value)[source]¶

Returns Series with missing values replaced with value.

Parameters:	value : {int, float, bytes, bool} Scalar value to replace missing values with.
Returns:	Series With missing values replaced.

classmethod from_pandas(series)[source]¶

Create baloo Series from pandas Series.

Parameters:	series : pandas.series.Series
Returns:	Series

head(n=5)[source]¶

Return Series with first n values.

Parameters:	n : int Number of values.
Returns:	Series Series containing the first n values.

Examples

>>> sr = bl.Series(np.arange(3))
>>> print(sr.head(2).evaluate())
<BLANKLINE>
---  --
  0   0
  1   1

iloc¶

Retrieve Indexer by index.

Supported iloc functionality exemplified below.

Returns:	_ILocIndexer

Examples

>>> sr = bl.Series(np.arange(3))
>>> print(sr.iloc[2].evaluate())
2
>>> print(sr.iloc[0:2].evaluate())
<BLANKLINE>
---  --
  0   0
  1   1
>>> print(sr.iloc[bl.Series(np.array([0, 2]))].evaluate())
<BLANKLINE>
---  --
  0   0
  2   2

str¶

Get Access to string functions.

Returns:	StringMethods

Examples

>>> sr = bl.Series([b' aB ', b'GoOsfrABA'])
>>> print(sr.str.lower().evaluate())
<BLANKLINE>
---  ---------
  0   ab
  1  goosfraba

tail(n=5)[source]¶

Return Series with the last n values.

Parameters:	n : int Number of values.
Returns:	Series Series containing the last n values.

Examples

>>> sr = bl.Series(np.arange(3))
>>> print(sr.tail(2).evaluate())
<BLANKLINE>
---  --
  1   1
  2   2

to_pandas()[source]¶

Convert to pandas Series

Returns:	pandas.series.Series

unique()[source]¶

Return unique values in the Series.

Note that because it is hash-based, the result will NOT be in the same order (unlike pandas).

Returns:	LazyArrayResult Unique values in random order.

baloo.core.generic module¶

class baloo.core.generic.BalooCommon[source]¶

empty¶

Check whether the data structure is empty.

Returns:	bool

evaluate()[source]¶: Evaluate by returning object of the same type but now containing raw data.

values¶: The internal data representation.

class baloo.core.generic.BinaryOps[source]¶

class baloo.core.generic.BitOps[source]¶

class baloo.core.generic.IndexCommon[source]¶

name¶

Name of the Index.

Returns:	str name

baloo.core.groupby module¶

class baloo.core.groupby.DataFrameGroupBy(df, by: list)[source]¶

Object encoding a groupby operation.

max()[source]¶

mean()[source]¶

min()[source]¶

prod()[source]¶

size()[source]¶

std()[source]¶

sum()[source]¶

var()[source]¶

baloo.core.strings module¶

class baloo.core.strings.StringMethods(data)[source]¶

capitalize()[source]¶

Convert first character to uppercase and remainder to lowercase.

Returns:	Series

contains(pat)[source]¶

Test if pat is included within elements.

Parameters:	pat : str
Returns:	Series

endswith(pat)[source]¶

Test if elements end with pat.

Parameters:	pat : str
Returns:	Series

find(sub, start=0, end=None)[source]¶

Test if elements contain substring.

Parameters:	sub : str start : int, optional Index to start searching from. end : int, optional Index to stop searching from.
Returns:	Series

get(i)[source]¶

Extract i’th character of each element.

Parameters:	i : int
Returns:	Series

lower()[source]¶

Convert all characters to lowercase.

Returns:	Series

replace(pat, rep)[source]¶

Replace first occurrence of pat with rep in each element.

Parameters:	pat : str rep : str
Returns:	Series

slice(start=None, stop=None, step=None)[source]¶

Slice substrings from each element.

Note that negative step is currently not supported.

Parameters:	start : int stop : int step : int
Returns:	Series

split(pat, side='left')[source]¶

Split once each element from the left and select a side to return.

Note this is unlike pandas split in that it essentially combines the split with a select.

Parameters:	pat : str side : {‘left’, ‘right’} Which side of the split to select and return in each element.
Returns:	Series

startswith(pat)[source]¶

Test if elements start with pat.

Parameters:	pat : str
Returns:	Series

strip()[source]¶

Strip whitespace from start and end of each element.

Note it currently only looks for whitespace (ASCII 32), not tabs or EOL.

Returns:	Series

upper()[source]¶

Convert all characters to uppercase.

Returns:	Series

baloo.core package¶

Subpackages¶

baloo.core.frame module¶

baloo.core.series module¶

baloo.core.generic module¶

baloo.core.groupby module¶

baloo.core.strings module¶

Baloo

Navigation

Related Topics