baloo.core package¶
Subpackages¶
baloo.core.frame module¶
-
class
baloo.core.frame.
DataFrame
(data=None, index=None)[source]¶ Bases:
baloo.core.generic.BinaryOps
,baloo.core.generic.BalooCommon
Weld-ed pandas DataFrame.
See also
Examples
>>> import baloo as bl >>> import numpy as np >>> from collections import OrderedDict >>> df = bl.DataFrame(OrderedDict((('a', np.arange(5, 8)), ('b', [1, 0, 2])))) >>> df.index # repr RangeIndex(start=0, stop=3, step=1) >>> df # repr DataFrame(index=RangeIndex(start=0, stop=3, step=1), columns=[a: int64, b: int64]) >>> print(df.evaluate()) # omitting evaluate would trigger exception as index is now an unevaluated RangeIndex a b --- --- --- 0 5 1 1 6 0 2 7 2 >>> print(len(df)) 3 >>> print((df * 2).evaluate()) a b --- --- --- 0 10 2 1 12 0 2 14 4 >>> print((df * [2, 3]).evaluate()) a b --- --- --- 0 10 3 1 12 0 2 14 6 >>> print(df.min().evaluate()) <BLANKLINE> --- -- a 5 b 0 >>> print(df.mean().evaluate()) <BLANKLINE> --- -- a 6 b 1 >>> print(df.agg(['var', 'count']).evaluate()) a b ----- --- --- var 1 1 count 3 3 >>> df.rename({'a': 'c'}) DataFrame(index=RangeIndex(start=0, stop=3, step=1), columns=[c: int64, b: int64]) >>> df.drop('a') DataFrame(index=RangeIndex(start=0, stop=3, step=1), columns=[b: int64]) >>> print(df.reset_index().evaluate()) index a b --- ------- --- --- 0 0 5 1 1 1 6 0 2 2 7 2 >>> print(df.set_index('b').evaluate()) b a --- --- 1 5 0 6 2 7 >>> print(df.sort_values('b').evaluate()) a b --- --- --- 1 6 0 0 5 1 2 7 2 >>> df2 = bl.DataFrame({'b': np.array([0, 2])}) >>> print(df.merge(df2, on='b').evaluate()) b index_x a index_y --- --------- --- --------- 0 1 6 0 2 2 7 1 >>> df3 = bl.DataFrame({'a': [1., -999., 3.]}, bl.Index([-999, 1, 2])) >>> print(df3.dropna().evaluate()) a ---- --- -999 1 2 3 >>> print(df3.fillna({'a': 15}).evaluate()) a ---- --- -999 1 1 15 2 3 >>> print(bl.DataFrame({'a': [0, 1, 1, 2], 'b': [1, 2, 3, 4]}).groupby('a').sum().evaluate()) a b --- --- 0 1 2 4 1 5
Attributes: -
__getitem__
(item)[source]¶ Select from the DataFrame.
Supported functionality exemplified below.
Examples
>>> df = bl.DataFrame(OrderedDict({'a': np.arange(5, 8)})) >>> print(df['a'].evaluate()) a --- --- 0 5 1 6 2 7 >>> print(df[['a']].evaluate()) a --- --- 0 5 1 6 2 7 >>> print(df[df['a'] < 7].evaluate()) a --- --- 0 5 1 6
-
__init__
(data=None, index=None)[source]¶ Initialize a DataFrame object.
Note that (unlike pandas) there’s currently no index inference or alignment between the indexes of any Series passed as data. That is, all data, be it raw or Series, inherits the index of the DataFrame. Alignment is currently restricted to setitem
Parameters: - data : dict, optional
Data as a dict of str -> np.ndarray or Series or list.
- index : Index or RangeIndex or MultiIndex, optional
Index linked to the data; it is assumed to be of the same length. RangeIndex by default.
-
__len__
()[source]¶ Eagerly get the length of the DataFrame.
Note that if the length is unknown (such as for WeldObjects), it will be eagerly computed.
Returns: - int
Length of the DataFrame.
-
__setitem__
(key, value)[source]¶ Add/update DataFrame column.
Note that for raw data, it does NOT check for the same length with the DataFrame due to possibly not knowing the length before evaluation. Hence, columns of different lengths are possible if using raw data which might lead to unexpected behavior. To avoid this, use the more expensive setitem by wrapping with a Series. This, in turn, means that if knowing the indexes match and the data has the same length as the DataFrame, it is more efficient to setitem using the raw data.
Parameters: - key : str
Column name.
- value : numpy.ndarray or Series
If a Series, the data will be aligned based on the index of the DataFrame, i.e. df.index left join sr.index.
Examples
>>> df = bl.DataFrame(OrderedDict({'a': np.arange(5, 8)})) >>> df['b'] = np.arange(3) >>> print(df.evaluate()) a b --- --- --- 0 5 0 1 6 1 2 7 2
-
agg
(aggregations)[source]¶ Multiple aggregations optimized.
Parameters: - aggregations : list of str
Which aggregations to perform.
Returns: - DataFrame
DataFrame with the aggregations per column.
-
astype
(dtype)[source]¶ Cast DataFrame columns to given dtype.
Parameters: - dtype : numpy.dtype or dict
Dtype or column_name -> dtype mapping to cast columns to. Note index is excluded.
Returns: - DataFrame
With casted columns.
-
columns
¶ Index of the column names present in the DataFrame in order.
Returns: - Index
-
drop
(columns)[source]¶ Drop 1 or more columns. Any column which does not exist in the DataFrame is skipped, i.e. not removed, without raising an exception.
Unlike Pandas’ drop, this is currently restricted to dropping columns.
Parameters: - columns : str or list of str
Column name or list of column names to drop.
Returns: - DataFrame
A new DataFrame without these columns.
-
drop_duplicates
(subset=None, keep='min')[source]¶ Return DataFrame with duplicate rows (excluding index) removed, optionally only considering subset columns.
Note that the row order is NOT maintained due to hashing.
Parameters: - subset : list of str, optional
Which columns to consider
- keep : {‘+’, ‘*’, ‘min’, ‘max’}, optional
What to select from the duplicate rows. These correspond to the possible merge operations in Weld. Note that ‘+’ and ‘-‘ might produce unexpected results for strings.
Returns: - DataFrame
DataFrame without duplicate rows.
-
dropna
(subset=None)[source]¶ Remove missing values according to Baloo’s convention.
Parameters: - subset : list of str, optional
Which columns to check for missing values in.
Returns: - DataFrame
DataFrame with no null values in columns.
-
dtypes
¶ Series of NumPy dtypes present in the DataFrame with index of column names.
Returns: - Series
-
empty
¶ Check whether the data structure is empty.
Returns: - bool
-
evaluate
(verbose=False, decode=True, passes=None, num_threads=1, apply_experimental=True)[source]¶ Evaluates by creating a DataFrame containing evaluated data and index.
See LazyResult
Returns: - DataFrame
DataFrame with evaluated data and index.
-
fillna
(value)[source]¶ Returns DataFrame with missing values replaced with value.
Parameters: - value : {int, float, bytes, bool} or dict
Scalar value to replace missing values with. If dict, replaces missing values only in the key columns with the value scalar.
Returns: - DataFrame
With missing values replaced.
-
classmethod
from_pandas
(df)[source]¶ Create baloo DataFrame from pandas DataFrame.
Parameters: - df : pandas.frame.DataFrame
Returns: - DataFrame
-
groupby
(by)[source]¶ Group by certain columns, excluding index.
Simply reset_index if desiring to group by some index column too.
Parameters: - by : str or list of str
Column(s) to groupby.
Returns: - DataFrameGroupBy
Object encoding the groupby operation.
-
head
(n=5)[source]¶ Return DataFrame with first n values per column.
Parameters: - n : int
Number of values.
Returns: - DataFrame
DataFrame containing the first n values per column.
Examples
>>> df = bl.DataFrame(OrderedDict((('a', np.arange(5, 8)), ('b', np.arange(3))))) >>> print(df.head(2).evaluate()) a b --- --- --- 0 5 0 1 6 1
-
iloc
¶ Retrieve Indexer by index.
Supported iloc functionality exemplified below.
Examples
>>> df = bl.DataFrame(OrderedDict((('a', np.arange(5, 8)), ('b', np.array([1, 0, 2]))))) >>> print(df.iloc[0:2].evaluate()) a b --- --- --- 0 5 1 1 6 0 >>> print(df.iloc[bl.Series(np.array([0, 2]))].evaluate()) a b --- --- --- 0 5 1 2 7 2
-
join
(other, on=None, how='left', lsuffix=None, rsuffix=None, algorithm='merge', is_on_sorted=True, is_on_unique=True)[source]¶ Database-like join this DataFrame with the other DataFrame.
Currently assumes the on columns are sorted and the on-column(s) values are unique! Next work handles the other cases.
Note there’s no automatic cast if the type of the on columns differs.
Check DataFrame.merge() for more details.
Parameters: - other : DataFrame
With which to merge.
- on : str or list or None, optional
The columns from both DataFrames on which to join. If None, will join on the index if it has the same name.
- how : {‘inner’, ‘left’, ‘right’, ‘outer’}, optional
Which kind of join to do.
- lsuffix : str, optional
Suffix to use on columns that overlap from self.
- rsuffix : str, optional
Suffix to use on columns that overlap from other.
- algorithm : {‘merge’, ‘hash’}, optional
Which algorithm to use. Note that for ‘hash’, the other DataFrame is the one hashed.
- is_on_sorted : bool, optional
If we know that the on columns are already sorted, can employ faster algorithm.
- is_on_unique : bool, optional
If we know that the values are unique, can employ faster algorithm.
Returns: - DataFrame
DataFrame containing the merge result, with the on columns as index.
-
keys
()[source]¶ Retrieve column names as Index, i.e. for axis=1.
Returns: - Index
Column names as an Index.
-
merge
(other, how='inner', on=None, suffixes=('_x', '_y'), algorithm='merge', is_on_sorted=False, is_on_unique=True)[source]¶ Database-like join this DataFrame with the other DataFrame.
Currently assumes the on-column(s) values are unique!
Note there’s no automatic cast if the type of the on columns differs.
Algorithms and limitations:
- Merge algorithms: merge-join or hash-join. Typical pros and cons apply when choosing between the two. Merge-join shall be used on fairly equally-sized DataFrames while a hash-join would be better when one of the DataFrames is (much) smaller.
- Limitations:
- Hash-join requires the (smaller) hashed DataFrame (more precisely, the on columns) to contain no duplicates!
- Merge-join requires the on-columns to be sorted!
- For unsorted data can only sort a single column! (current Weld limitation)
- Sortedness. If the on-columns are sorted, merge-join does not require to sort the data so it can be significantly faster. Do add is_on_sorted=True if this is known to be true!
- Uniqueness. If the on-columns data contains duplicates, the algorithm is more complicated, i.e. slow. Also hash-join cannot be used on a hashed (smaller) DataFrame with duplicates. Do add is_on_unique=True if this is known to be true!
- Setting the above 2 flags incorrectly, e.g. is_on_sorted to True when data is in fact not sorted, will produce undefined results.
Parameters: - other : DataFrame
With which to merge.
- how : {‘inner’, ‘left’, ‘right’, ‘outer’}, optional
Which kind of join to do.
- on : str or list or None, optional
The columns from both DataFrames on which to join. If None, will join on the index if it has the same name.
- suffixes : tuple of str, optional
To append on columns not in on that have the same name in the DataFrames.
- algorithm : {‘merge’, ‘hash’}, optional
Which algorithm to use. Note that for ‘hash’, the other DataFrame is the one hashed.
- is_on_sorted : bool, optional
If we know that the on columns are already sorted, can employ faster algorithm. If False, the DataFrame will first be sorted by the on columns.
- is_on_unique : bool, optional
If we know that the values are unique, can employ faster algorithm.
Returns: - DataFrame
DataFrame containing the merge result, with the on columns as index.
-
rename
(columns)[source]¶ Returns a new DataFrame with renamed columns.
Currently a simplified version of Pandas’ rename.
Parameters: - columns : dict
Old names to new names.
Returns: - DataFrame
With columns renamed, if found.
-
reset_index
()[source]¶ Returns a new DataFrame with previous index as column(s).
Returns: - DataFrame
DataFrame with the new index a RangeIndex of its length.
-
set_index
(keys)[source]¶ Set the index of the DataFrame to be the keys columns.
Note this means that the old index is removed.
Parameters: - keys : str or list of str
Which column(s) to set as the index.
Returns: - DataFrame
DataFrame with the index set to the column(s) corresponding to the keys.
-
sort_index
(ascending=True)[source]¶ Sort the index of the DataFrame.
Currently MultiIndex is not supported since Weld is missing multiple-column sort.
Note this is an expensive operation (brings all data to Weld).
Parameters: - ascending : bool, optional
Returns: - DataFrame
DataFrame sorted according to the index.
-
sort_values
(by, ascending=True)[source]¶ Sort the DataFrame based on a column.
Unlike Pandas, one can sort by data from both index and regular columns.
Currently possible to sort only on a single column since Weld is missing multiple-column sort. Note this is an expensive operation (brings all data to Weld).
Parameters: - by : str or list of str
Column names to sort.
- ascending : bool, optional
Returns: - DataFrame
DataFrame sorted according to the column.
-
tail
(n=5)[source]¶ Return DataFrame with last n values per column.
Parameters: - n : int
Number of values.
Returns: - DataFrame
DataFrame containing the last n values per column.
Examples
>>> df = bl.DataFrame(OrderedDict((('a', np.arange(5, 8)), ('b', np.arange(3))))) >>> print(df.tail(2).evaluate()) a b --- --- --- 1 6 1 2 7 2
-
to_csv
(filepath, sep=', ', header=True, index=True)[source]¶ Save DataFrame as csv.
Parameters: - filepath : str
- sep : str, optional
Separator used between values.
- header : bool, optional
Whether to save the header.
- index : bool, optional
Whether to save the index columns.
Returns: - None
-
to_pandas
()[source]¶ Convert to pandas DataFrame.
Note the data is expected to be evaluated.
Returns: - pandas.frame.DataFrame
-
values
¶ Alias for data attribute.
Returns: - dict
The internal dict data representation.
-
baloo.core.series module¶
-
class
baloo.core.series.
Series
(data=None, index=None, dtype=None, name=None)[source]¶ Bases:
baloo.weld.lazy_result.LazyArrayResult
,baloo.core.generic.BinaryOps
,baloo.core.generic.BitOps
,baloo.core.generic.BalooCommon
Weld-ed Pandas Series.
See also
Examples
>>> import baloo as bl >>> import numpy as np >>> sr = bl.Series([0, 1, 2]) >>> sr Series(name=None, dtype=int64) >>> sr.index RangeIndex(start=0, stop=3, step=1) >>> sr = sr.evaluate() >>> sr # repr Series(name=None, dtype=int64) >>> print(sr) # str <BLANKLINE> --- -- 0 0 1 1 2 2 >>> sr.index Index(name=None, dtype=int64) >>> print(sr.index) [0 1 2] >>> len(sr) # eager computation 3 >>> sr.values array([0, 1, 2]) >>> (sr + 2).evaluate().values array([2, 3, 4]) >>> (sr - bl.Index(np.arange(3))).evaluate().values array([0, 0, 0]) >>> print(sr.max().evaluate()) 2 >>> print(sr.var().evaluate()) 1.0 >>> print(sr.agg(['min', 'std']).evaluate()) <BLANKLINE> --- -- min 0 std 1
Attributes: - index
- dtype
- name
iloc
Retrieve Indexer by index.
-
__getitem__
(item)[source]¶ Select from the Series.
Supported selection functionality exemplified below.
Examples
>>> sr = bl.Series(np.arange(5, dtype=np.float32), name='Test') >>> sr = sr[sr > 0] >>> sr Series(name=Test, dtype=float32) >>> print(sr.evaluate()) Test --- ------ 1 1 2 2 3 3 4 4 >>> sr = sr[(sr != 1) & ~(sr > 3)] >>> print(sr.evaluate()) Test --- ------ 2 2 3 3 >>> print(sr[:1].evaluate()) Test --- ------ 2 2
-
__init__
(data=None, index=None, dtype=None, name=None)[source]¶ Initialize a Series object.
Parameters: - data : numpy.ndarray or WeldObject or list, optional
Raw data or Weld expression.
- index : Index or RangeIndex or MultiIndex, optional
Index linked to the data; it is assumed to be of the same length. RangeIndex by default.
- dtype : numpy.dtype or type, optional
Desired Numpy dtype for the elements. If type, it must be a NumPy type, e.g. np.float32. If data is np.ndarray with a dtype different to dtype argument, it is astype’d to the argument dtype. Note that if data is WeldObject, one must explicitly astype to convert type. Inferred from data by default.
- name : str, optional
Name of the Series.
-
agg
(aggregations)[source]¶ Multiple aggregations optimized.
Parameters: - aggregations : list of str
Which aggregations to perform.
Returns: - Series
Series with resulting aggregations.
-
apply
(func, mapping=None, new_dtype=None, **kwargs)[source]¶ Apply an element-wise UDF to the Series.
There are currently 6 options for using a UDF. First 4 are lazy, other 2 are eager and require the use of the raw decorator:
- One of the predefined functions in baloo.functions.
- Implementing a function which encodes the result. kwargs are automatically passed to it.
- Pure Weld code and mapping.
- Weld code and mapping along with a dynamically linked C++ lib containing the UDF.
- Using a NumPy function, which however is EAGER and hence requires self.values to be raw. Additionally, NumPy
- does not support kwargs in (all) functions so must use raw decorator to strip away weld_type.
- Implementing an eager function with the same precondition as above. Use the raw decorator to check this.
Parameters: - func : function or str
Weld code as a str to encode or function from baloo.functions.
- mapping : dict, optional
Additional mappings in the weld_template to replace on execution. self is added by default to reference to this Series.
- new_dtype : numpy.dtype, optional
Specify the new dtype of the result Series. If None, it assumes it’s the same dtype as before the apply.
Returns: - Series
With UDF result.
Examples
>>> import baloo as bl >>> sr = bl.Series([1, 2, 3]) >>> weld_template = 'map({self}, |e| e + {scalar})' >>> mapping = {'scalar': '2L'} >>> print(sr.apply(weld_template, mapping).evaluate()) <BLANKLINE> --- -- 0 3 1 4 2 5 >>> weld_template2 = 'map({self}, |e| e + 3L)' >>> print(sr.apply(weld_template2).evaluate()) <BLANKLINE> --- -- 0 4 1 5 2 6 >>> print(bl.Series([1., 4., 100.]).apply(bl.sqrt).evaluate()) # lazy predefined function <BLANKLINE> --- -- 0 1 1 2 2 10 >>> sr = bl.Series([4, 2, 3, 1]) >>> print(sr.apply(bl.sort, kind='q').evaluate()) # eager wrapper over np.sort (which uses raw decorator) <BLANKLINE> --- -- 0 1 1 2 2 3 3 4 >>> print(sr.apply(bl.raw(np.sort, kind='q')).evaluate()) # np.sort directly <BLANKLINE> --- -- 0 1 1 2 2 3 3 4 >>> print(sr.apply(bl.raw(lambda x: np.sort(x, kind='q'))).evaluate()) # lambda also works, with x = np.array <BLANKLINE> --- -- 0 1 1 2 2 3 3 4
# check tests/core/cudf/* and tests/core/test_series.test_cudf for C UDF example
-
dropna
()[source]¶ Returns Series without null values according to Baloo’s convention.
Returns: - Series
Series with no null values.
-
evaluate
(verbose=False, decode=True, passes=None, num_threads=1, apply_experimental=True)[source]¶ Evaluates by creating a Series containing evaluated data and index.
See LazyResult
Returns: - Series
Series with evaluated data and index.
Examples
>>> sr = bl.Series(np.arange(3)) > 0 >>> weld_code = sr.values # accessing values now returns the weld code as a string >>> sr = sr.evaluate() >>> sr.values # now it is evaluated to raw data array([False, True, True])
-
fillna
(value)[source]¶ Returns Series with missing values replaced with value.
Parameters: - value : {int, float, bytes, bool}
Scalar value to replace missing values with.
Returns: - Series
With missing values replaced.
-
classmethod
from_pandas
(series)[source]¶ Create baloo Series from pandas Series.
Parameters: - series : pandas.series.Series
Returns: - Series
-
head
(n=5)[source]¶ Return Series with first n values.
Parameters: - n : int
Number of values.
Returns: - Series
Series containing the first n values.
Examples
>>> sr = bl.Series(np.arange(3)) >>> print(sr.head(2).evaluate()) <BLANKLINE> --- -- 0 0 1 1
-
iloc
¶ Retrieve Indexer by index.
Supported iloc functionality exemplified below.
Returns: - _ILocIndexer
Examples
>>> sr = bl.Series(np.arange(3)) >>> print(sr.iloc[2].evaluate()) 2 >>> print(sr.iloc[0:2].evaluate()) <BLANKLINE> --- -- 0 0 1 1 >>> print(sr.iloc[bl.Series(np.array([0, 2]))].evaluate()) <BLANKLINE> --- -- 0 0 2 2
-
str
¶ Get Access to string functions.
Returns: - StringMethods
Examples
>>> sr = bl.Series([b' aB ', b'GoOsfrABA']) >>> print(sr.str.lower().evaluate()) <BLANKLINE> --- --------- 0 ab 1 goosfraba
baloo.core.generic module¶
baloo.core.groupby module¶
baloo.core.strings module¶
-
class
baloo.core.strings.
StringMethods
(data)[source]¶ -
capitalize
()[source]¶ Convert first character to uppercase and remainder to lowercase.
Returns: - Series
-
contains
(pat)[source]¶ Test if pat is included within elements.
Parameters: - pat : str
Returns: - Series
-
find
(sub, start=0, end=None)[source]¶ Test if elements contain substring.
Parameters: - sub : str
- start : int, optional
Index to start searching from.
- end : int, optional
Index to stop searching from.
Returns: - Series
-
replace
(pat, rep)[source]¶ Replace first occurrence of pat with rep in each element.
Parameters: - pat : str
- rep : str
Returns: - Series
-
slice
(start=None, stop=None, step=None)[source]¶ Slice substrings from each element.
Note that negative step is currently not supported.
Parameters: - start : int
- stop : int
- step : int
Returns: - Series
-
split
(pat, side='left')[source]¶ Split once each element from the left and select a side to return.
Note this is unlike pandas split in that it essentially combines the split with a select.
Parameters: - pat : str
- side : {‘left’, ‘right’}
Which side of the split to select and return in each element.
Returns: - Series
-