File utils

The module provides convenience functions for dealing with files, directories and paths.

Overview

Reading

read_file_lines(f[, strip, remove_empty, nlines]) Reads all lines of a given file.
read_file_first_line(f) Reads the first line of a given file f.
read_pickle_file(f) Reads a pickled object from a file f.

Writing

write_lines_file(lns, f[, linedel]) Writes lines into a file.
write_pickle_file(e, f) Writes an object e as a pickle into a file f.

File extensions

file_ext(f) Returns the extension of a file f.
set_ext(f[, ext]) Changes the file extension of path f to ext by replacing the text starting at the last period.
file_complete_ext(f) Returns the complete extension of a file f.
set_complete_ext(f[, ext]) Changes the file extension of path f by changing the text after the first period in its basename to ext.
remove_complete_ext(f) Removes the complete extension of a path f.

Directory listing

list_dir(d[, match, sort]) Lists files and directories that match a given pattern.
walk_dir(d[, match, follow_symlinks]) Walks a given directory and returns files/dirs whose basename matches a given pattern.

Paths

abs_path(path, d) Returns an absolute POSIX path of path given a working directory d.
prepend_dir(path, d) Prepends a directory d just before the path path.

Directories

safe_create_dir(d) Creates a directory, if it does not already exist.

Sizes

file_size(f) Returns the size of a file f, in bytes.
dir_size(d[, recursive]) Returns the sum of size of files in a directory d, in bytes.

Utilities

file_frame(fs[, bf, bfne, size, stat]) Constructs a DataFrame from file names, having selected columns.
random_hash([n]) Generates n random bytes and represents them in a hexadecimal format.
daul.file_utils.abs_path(path, d)

Returns an absolute POSIX path of path given a working directory d.

>>> abs_path('dir/file', '/base')
'/base/dir/file'
daul.file_utils.as_posix_path(path)

Returns the path path as a POSIX path.

>>> as_posix_path('C:\Windows')
'C:/Windows'
>>> as_posix_path('/home/user')
'/home/user'
>>> as_posix_path('C:\Windows/system32')
'C:/Windows/system32'
daul.file_utils.dir_size(d, recursive=False)

Returns the sum of size of files in a directory d, in bytes.

recursive : bool
Whether to recurse into subdirectories.
daul.file_utils.file_complete_ext(f)

Returns the complete extension of a file f.

>>> file_complete_ext('abc.txt.gz')
'.txt.gz'
daul.file_utils.file_ext(f)

Returns the extension of a file f.

>>> file_ext('abc.txt.gz')
'.gz'
>>> file_ext('abc')
''
daul.file_utils.file_frame(fs, bf=True, bfne=False, size=False, stat=False)

Constructs a DataFrame from file names, having selected columns.

fs : list of str
a list of file names.
bf : bool
whether to include the basename of a file (as “bf” column).
bfne : bool
whether to include the basename of a file without complete extension (as “bfne” column).
size : bool
whether to include the size of a file (as “size” column).
stat : bool
whether to include the stat of a file (as “stat” column).
daul.file_utils.file_size(f)

Returns the size of a file f, in bytes.

daul.file_utils.list_dir(d, match='*', sort=False)

Lists files and directories that match a given pattern.

The pattern is checked against the basename of the file/dir.

d : str
Directory to list.
match : str
Name pattern that the files/dirs must match to be included.
sort : bool
Indicates whether to sort files/dirs by name.
daul.file_utils.prepend_dir(path, d)

Prepends a directory d just before the path path.

>>> prepend_dir('file', 'dir')
'dir/file'
>>> prepend_dir('path/file', 'dir')
'path/dir/file'
>>> prepend_dir('/home/user/file', 'dir')
'/home/user/dir/file'
daul.file_utils.random_hash(n=8)

Generates n random bytes and represents them in a hexadecimal format.

n : int
Number of bytes to generate.

Let us generate a random hash from 8 bytes:

>>> h = random_hash(n=8)
>>> len(h) # 16: two hex letters for a byte
16
daul.file_utils.read_file_first_line(f)

Reads the first line of a given file f.

daul.file_utils.read_file_lines(f, strip=False, remove_empty=False, nlines=None)

Reads all lines of a given file.

f : str
Path to the file.
strip: bool
Indicates whether to strip lines of white space.
remove_empty: bool
Indicates whether to remove empty lines.
nlines : int or None
If None, reads all the lines. If int, reads the specified number of lines.
daul.file_utils.read_pickle_file(f)

Reads a pickled object from a file f.

Creates a relative symlink, starting at a given directory.

d : str
Directory to temporarily change the current directory to.
src : str
Name of the source file.
dst : str
Name of the destination file.
daul.file_utils.remove_complete_ext(f)

Removes the complete extension of a path f.

>>> remove_complete_ext('abc.txt.gz')
'abc'
daul.file_utils.safe_create_dir(d)

Creates a directory, if it does not already exist.

d : str
Path to the directory to create.
Raises:OSError – If the d exists and it is not a directory, throws an exception.
daul.file_utils.set_complete_ext(f, ext='')

Changes the file extension of path f by changing the text after the first period in its basename to ext.

>>> set_complete_ext('abc.txt', '.tex')
'abc.tex'
>>> set_complete_ext('abc.txt.gz', '.ext')
'abc.ext'
>>> set_complete_ext('abc.txt.gz', '')
'abc'

See also

remove_complete_ext() for removing the complete extension.

daul.file_utils.set_ext(f, ext='')

Changes the file extension of path f to ext by replacing the text starting at the last period.

>>> set_ext('abc.txt', '.tex')
'abc.tex'
>>> set_ext('abc', '.txt')
'abc.txt'
>>> set_ext('/home/user/file', '.cache')
'/home/user/file.cache'

The changes are applied only to the basename:

>>> set_ext('/home/user/.conf/file', '.cache')
'/home/user/.conf/file.cache'

If multiple periods are present in the basename, it changes only the last one:

>>> set_ext('file.txt.cache', '.ch')
'file.txt.ch'

See also

set_complete_ext() for changing the complete extension.

daul.file_utils.walk_dir(d, match='*', follow_symlinks=False)

Walks a given directory and returns files/dirs whose basename matches a given pattern.

d : str
Path to the directory to walk.
match : str
The pattern of files to return.
follow_symlinks : bool
Whether to follow links as in os.walk().
daul.file_utils.write_lines_file(lns, f, linedel='')

Writes lines into a file.

lns : list
List of lines.
f : str
Path to the file.
linedel : str
Line delimiter.
daul.file_utils.write_pickle_file(e, f)

Writes an object e as a pickle into a file f.

Pandas utilities

The module collects utilities for dealing with DataFrames.

Overview

Columns

rename_cols(df, d) Returns a copy of a frame df with renamed columns as defined by the dictionary d.
set_cols(df, cols) Returns a copy of a frame df with its labels being cols.
reorder_cols(df[, first_cols, last_cols]) Returns a copy of frame df with reordered columns.
prefix_cols(df, pref) Returns a copy of a frame df with prefix pref applied to column labels.
drop_cols(df, cols) Returns a copy of a frame df with columns cols dropped.
safe_drop_cols(df, cols) Returns a copy of a frame df with columns cols dropped, but without raising exceptions if they do not exist.

Column updates

update_column(df, col, values[, copy]) Returns a copy of a frame df with column col set to values.
update_tuple_col(df, cola, colb, values[, copy]) Returns a copy of a frame df and sets values of two columns (cola, colb) from a list of 2-len tuples (values).
update_fixed_column(df, col, value[, copy]) Returns a copy of a frame df with a column col set to a fixed value value.
column_apply(df, col, f[, store_as, copy]) Returns a copy of a frame with a function f being applied over a column col and stored as the same column col.

Transformations

expand(df, col[, exp_col, remove_col]) Expands the values of a list-like column into separate rows, while keeping the rest of the columns fixed.
expand_dicts_as_cols(df, col[, remove_col]) Returns a frame in which the values of a dict-based column col in frame df are expanded into separate columns.

Inner frames

groupby_as_frame(df, col[, df_col]) Groups a frame by a column and stores the frames as inner frames.
extract_inner_col(df, df_col, inndf_col[, aggf]) Extracts values from an inner frame’s column and applies a function over them.
include_outer_col(df, inndf_col, col) Includes column from the outer frame as a fixed column in the inner frame.
attach_inner_col(df, col, df_col, inndf_col) Attaches a column to the outer frame by aggregating the values in the inner frame.
attach_inner_cols(df, cols, df_col, aggf[, …]) Attaches selected columns from the inner frame.
rename_inner_cols(df, df_col[, d]) Renames columns of inner frames.

Row splitting

row_split_ngroups(df, n[, empty]) Row-splits the frame into n approximately equal-length frames.
row_split_group_size(df, sz[, empty]) Row-splits the frame into frames having sz rows.
row_split_sizes(df, lens) Splits a frame df according to a list lens of lengths.
row_split_bool(df, arr) Row-splits a frame df into two parts: the rows for which the arr is True and those for which it is False.

Joins

left_join(ldf, rdf, jcol[, rcols]) Returns a copy of the left frame with columns joined from the right frame on a specified column.
left_join_def(df, jdf, jcol[, cols, default]) Returns a copy of the left frame with columns joined from the right frame, providing a default value.

Conversion to dictionaries

twocol_dict(df) Creates a dictionary from the first two columns of a frame df (keys — the first column, values — the second column).
twocol_dictf(df) Creates a function that maps values from the first column of df to the values in the corresponding rows of the second column.
twocol_dictfd(df, default) Creates a function that maps values from the first column of df to the values in the corresponding rows of the second column, with a default value default.
twocol_listvaldict(df) Creates a dictionary from the first two columns of a frame df, (keys — the first column, values — list-aggregated values from the second column for the same key).
twocol_listvaldictf(df) Creates a function that maps keys from the first column of df into a list of values from the second column corresponding to the same key.
twocol_listvaldictfd(df, default) Creates a function that maps keys from the first column of df into a list of values from the second column corresponding to the same key (with a default value default, if no such key is found).

Selectors

nodup(df[, cols]) Returns non-duplicated rows of a frame df.
take_middle_n_rows(df, n[, error_if_less]) Selects n middle rows of a frame.
ifst(x) Returns the first element of x using iloc of a frame, or a series.

Generic

empty_frame([cols]) Creates an empty frame with the given column labels cols.
pdmap(f, df) Maps a function f over rows of a frame df, with the function application being done over all columns as positional arguments.
renumber_index(df[, copy]) Returns a copy of a frame df, with the index having consecutive values in the range from 0 to len(df) - 1.
values_list(x) Returns the x.values of x as a list.
values_set(x) Returns the x.values of x as a set.

Empty-aware utilities

eaw_row_concat(dfs, cols) Performs a row-wise concatenation of frames dfs.

Compatibility

sort(df, col[, ascending]) Sorts frame by given col(s).
row_concat(dfs) Concatenates frames dfs, row-wise.
daul.pandas_utils.attach_inner_col(df, col, df_col, inndf_col, aggf=<type 'list'>)

Attaches a column to the outer frame by aggregating the values in the inner frame.

df : DataFrame
The frame to attach the column to.
col : str
The label of the column that will be attached to the outer frame.
df_col : str
The label of the column where the inner frames are stored.
inndf_col : str
The label of the column in the inner frame, which will be aggregated.
aggf : function
The function that aggregates the values.
daul.pandas_utils.attach_inner_cols(df, cols, df_col, aggf, namef=<function idf>)

Attaches selected columns from the inner frame.

df : DataFrame
The frame to attach the columns to.
cols : list
The labels of the columns in the inner frame to attach to the outer frame. See also namef parameter.
df_col : str
The label of the column where the inner frames are stored.
aggf : function
The function that aggregates the values from the inner frames.
namef : function
The function that maps labels of the columns from the inner frame to the labels in the outer frame.

Let us create a simple frame that we will group to obtain inner frames:

>>> df = pd.DataFrame({'a': [0, 0, 1, 2], 'b': ['a', 'b', 'c', 'd']})
>>> df
   a  b
0  0  a
1  0  b
2  1  c
3  2  d

Group it:

>>> gdf = groupby_as_frame(df, 'a') # inner frame is in 'df' column

Now we can attach the inner columns as outer ones:

>>> gdf = attach_inner_cols(gdf, ['b'], 'df', list)
>>> gdf[['a', 'b']]
   a       b
0  0  [a, b]
1  1     [c]
2  2     [d]
daul.pandas_utils.column_apply(df, col, f, store_as=None, copy=True)

Returns a copy of a frame with a function f being applied over a column col and stored as the same column col.

df : DataFrame
The frame.
col :
The column over which to apply f.
f : function
The function to apply over col.
store_as : column-label or None
The label of the column to store results of applying f. If None, it is the same as col.
copy : bool
Whether to return a copy of the frame.
>>> df = pd.DataFrame({'a': [0, 1]})
>>> df = column_apply(df, 'a', lambda x: x + 1)
>>> df
   a
0  1
1  2
daul.pandas_utils.drop_cols(df, cols)

Returns a copy of a frame df with columns cols dropped.

>>> df = empty_frame(['a', 'b', 'c'])  

We can use it for one column:

>>> list(drop_cols(df, 'a').columns)
['b', 'c']

And also for multiple columns:

>>> list(drop_cols(df, ['a', 'b']).columns)
['c']
daul.pandas_utils.eaw_groupby_agg(df, groupby, aggd)

Groups a frame and aggregates values, even for an empty frame.

df : DataFrame
Frame to group.
groupby : str
Label of column to groupby.
aggd : dict
Dictionary of labels and functions to as in pandas.DataFrame.agg().
>>> df = pd.DataFrame({'x': [0, 0, 1], 'y': [1, 2, 3]})
>>> df
   x  y
0  0  1
1  0  2
2  1  3
>>> gdf = eaw_groupby_agg(df, 'x', {'y': np.max}).reset_index()
>>> gdf
   x  y
0  0  2
1  1  3
>>> df = empty_frame(['x', 'y'])
>>> gdf = eaw_groupby_agg(df, 'x', {'y': np.max}).reset_index()
>>> gdf
Empty DataFrame
Columns: [x, y]
Index: []
daul.pandas_utils.eaw_row_concat(dfs, cols)

Performs a row-wise concatenation of frames dfs. If an empty list is given, returns an empty frame with the specified columns cols.

daul.pandas_utils.empty_frame(cols=[])

Creates an empty frame with the given column labels cols.

>>> df = empty_frame(['a', 'b'])  
>>> len(df)
0
>>> list(df.columns)
['a', 'b']
daul.pandas_utils.expand(df, col, exp_col=None, remove_col=False)

Expands the values of a list-like column into separate rows, while keeping the rest of the columns fixed.

df : DataFrame
The frame to expand.
col : str
A column of df to expand.
exp_col : str
The label to store the expanded column as. If None, then name of col used.
remove_col : bool
Indicates whether to remove the non-expanded column.
>>> df = pd.DataFrame({'a': [1, 2], 
...                    'fruits': [['orange', 'apple'], 
...                               ['kiwi']]})
>>> df
   a           fruits
0  1  [orange, apple]
1  2           [kiwi]
>>> df = expand(df, 'fruits', 'fruit')
>>> df
   a           fruits   fruit
0  1  [orange, apple]  orange
0  1  [orange, apple]   apple
1  2           [kiwi]    kiwi
>>> 
daul.pandas_utils.expand_dicts_as_cols(df, col, remove_col=False)

Returns a frame in which the values of a dict-based column col in frame df are expanded into separate columns.

remove_col : bool
Whether to remove the dict-based column.

Suppose a frame with dict-like column v.

>>> df = pd.DataFrame({'i': [0, 1], 
...                    'v': [{'a': 1, 'b': 2}, 
...                          {'a': 2, 'b': 3}]})
>>> df
   i                   v
0  0  {u'a': 1, u'b': 2}
1  1  {u'a': 2, u'b': 3}

Now let us expand the dictionaries in v.

>>> df = expand_dicts_as_cols(df, 'v', remove_col=True)
>>> df[['i', 'a', 'b']]
   i  a  b
0  0  1  2
1  1  2  3
daul.pandas_utils.extract_inner_col(df, df_col, inndf_col, aggf=<type 'list'>)

Extracts values from an inner frame’s column and applies a function over them.

df : DataFrame
The frame.
df_col : str
The label of column in df that holds the inner frames.
inndf_col : str
The label of column in the inner frames.
aggf : function
The function to aggregate the columns.

Let us first create a frame and then group it (see groupby_as_frame()), such that is has an inner frame:

>>> df = pd.DataFrame({'a': [0, 0, 1, 2], 'b': [1, 3, 5, 7]})
>>> gdf = groupby_as_frame(df, 'a') # inner frame in `df`

Now let us extract the values from the columns b from the inner frame:

>>> extract_inner_col(gdf, 'df', 'b')
0    [1, 3]
1       [5]
2       [7]
Name: df, dtype: object

We can also apply some function, e.g., to sum the values:

>>> extract_inner_col(gdf, 'df', 'b', sum)
0    4
1    5
2    7
Name: df, dtype: int64
daul.pandas_utils.groupby_as_frame(df, col, df_col='df')

Groups a frame by a column and stores the frames as inner frames.

df : DataFrame
The frame to group.
col : label
The column by which to group.
df_col : label
The label that will hold the resulting subframes.
>>> df = pd.DataFrame({'a': [0, 0, 1], 'b': ['a', 'b', 'c']})
>>> df
   a  b
0  0  a
1  0  b
2  1  c

Let us group the frame on the a column:

>>> gdf = groupby_as_frame(df, 'a')

Let us see the groups:

>>> gdf[['a']]
   a
0  0
1  1

The first group (with a = 0):

>>> gdf['df'].iloc[0]
   a  b
0  0  a
1  0  b

The second group (with a = 1):

>>> gdf['df'].iloc[1]
   a  b
2  1  c
daul.pandas_utils.ifst(x)

Returns the first element of x using iloc of a frame, or a series.

daul.pandas_utils.include_outer_col(df, inndf_col, col)

Includes column from the outer frame as a fixed column in the inner frame.

df : DataFrame
Frame to update.
inndf_col : label
The label of the column that holds the inner frames.
col : label
The label of the column which to include from df to the inner frames.
daul.pandas_utils.left_join(ldf, rdf, jcol, rcols=[])

Returns a copy of the left frame with columns joined from the right frame on a specified column.

ldf : DataFrame
The left frame.
rdf : DataFrame
The right frame.
jcol : str
The label of the column used on which to perform the join.
rcols: list
The columns from rdf to join.

Note

No default behavior if the particular value in jcol is missing in the right frame; see left_join_def().

>>> ldf = pd.DataFrame({'a': [0, 0, 1, 2]})
>>> rdf = pd.DataFrame({'a': [0, 1, 2], 'b': ['zero', 'one', 'two']})
>>> left_join(ldf, rdf, 'a', ['b'])
   a     b
0  0  zero
1  0  zero
2  1   one
3  2   two
daul.pandas_utils.left_join_def(df, jdf, jcol, cols=[], default=None)

Returns a copy of the left frame with columns joined from the right frame, providing a default value.

Note

See left_join() for explanation of parameters.

daul.pandas_utils.nodup(df, cols=None)

Returns non-duplicated rows of a frame df.

If the cols is None, then all columns are used.

daul.pandas_utils.pdmap(f, df)

Maps a function f over rows of a frame df, with the function application being done over all columns as positional arguments.

>>> df = pd.DataFrame({'a': [0, 1]})
>>> df = update_column(df, 'b', df['a'] + 10)
>>> df
   a   b
0  0  10
1  1  11
>>> pdmap(lambda x, y: x + y, df)
[10, 12]
daul.pandas_utils.prefix_cols(df, pref)

Returns a copy of a frame df with prefix pref applied to column labels.

>>> df = pd.DataFrame({'a': [], 'b': []})
>>> list(prefix_cols(df, 'l:').columns)
['l:a', 'l:b']
daul.pandas_utils.rename_cols(df, d)

Returns a copy of a frame df with renamed columns as defined by the dictionary d.

Let us create a an empty frame with two columns:

>>> adf = pd.DataFrame({'a': [], 'b': []})

Rename a copy of the frame:

>>> bdf = rename_cols(adf, {'b': 'c'})
>>> list(bdf.columns)
['a', 'c']

Note that the column labels of the previous frame are unchanged:

>>> list(adf.columns)
['a', 'b']
daul.pandas_utils.rename_inner_cols(df, df_col, d={})

Renames columns of inner frames.

df : DataFrame
The frame to transform.
df_col: str
The label of the column where the inner frames are stored.
d : dict
The renaming dictionary.
daul.pandas_utils.renumber_index(df, copy=True)

Returns a copy of a frame df, with the index having consecutive values in the range from 0 to len(df) - 1.

copy : bool
Whether to return a copy of the frame.
daul.pandas_utils.reorder_cols(df, first_cols=[], last_cols=[])

Returns a copy of frame df with reordered columns.

The cols first_cols specifies the columns that will be the put first, and last_cols the ones that will be the last.

>>> adf = empty_frame(['a', 'b', 'c', 'd'])
>>> bdf = reorder_cols(adf, ['b'], ['a'])
>>> list(bdf.columns)
['b', 'c', 'd', 'a']

Note that the order remains unchanged in the original frame.

>>> list(adf.columns)
['a', 'b', 'c', 'd']
daul.pandas_utils.row_concat(dfs)

Concatenates frames dfs, row-wise.

daul.pandas_utils.row_split_bool(df, arr)

Row-splits a frame df into two parts: the rows for which the arr is True and those for which it is False.

tdf, fdf : tuple
>>> df = pd.DataFrame({'a': np.arange(0, 10)})

Split the frame into those with an even and odd numbers:

>>> adf, bdf = row_split_bool(df, df['a'] % 2 == 0)
>>> adf  
   a
0  0
2  2
4  4
6  6
8  8
daul.pandas_utils.row_split_group_size(df, sz, empty='zerolengroup')

Row-splits the frame into frames having sz rows.

df : DataFrame
The frame to row-split.
sz : int
The number of rows in a group.
empty : str
Applies only if the frame to split is of zero length. Using ‘zerolengroup’ returns one group of zero length (with the column labels preserved). Using ‘nogroup’ returns an empty list.
>>> df = pd.DataFrame({'x': np.arange(0, 1000)})
>>> dfs = row_split_group_size(df, 5)  
>>> len(dfs)
200
>>> len(dfs[0])
5

See also

row_split_ngroups() for examples when the frame to split has zero rows.

daul.pandas_utils.row_split_ngroups(df, n, empty='zerolengroup')

Row-splits the frame into n approximately equal-length frames.

df : DataFrame
The frame to row-split.
n : int
The number of groups.
empty : str
Applies only if the frame to split is of zero length. Using ‘zerolengroup’ returns one group of zero length (with the column labels preserved). Using ‘nogroup’ returns an empty list.

Note

Uses NumPy array_split() for splitting the frame.

Note

If the number of rows r in the frame is less than n, returns r groups.

>>> df = pd.DataFrame({'x': np.arange(0, 1000)})
>>> n = 10  
>>> dfs = row_split_ngroups(df, n)

The total number of frames:

>>> len(dfs)
10

The size of the first frame:

>>> len(dfs[0])
100

Note the behavior if the number of groups is less than rows:

>>> df = pd.DataFrame({'x': np.arange(0, 5)})
>>> dfs = row_split_ngroups(df, 10)
>>> len(dfs)
5

In case of an empty frame, the result is a one group of with an empty frame, by default:

>>> df = empty_frame(['a', 'b'])
>>> dfs = row_split_ngroups(df, 5, empty='zerolengroup')
>>> len(dfs)
1
>>> len(dfs[0])
0

In this case the shape of the frame is preserved and thus further processing of the frame will likely succeed:

>>> list(dfs[0].columns)
['a', 'b']

In case of ‘nogroup’, returns an empty list:

>>> row_split_ngroups(df, 5, empty='nogroup')  
[]
daul.pandas_utils.row_split_sizes(df, lens)

Splits a frame df according to a list lens of lengths.

>>> df = pd.DataFrame({'x': np.arange(0, 10)})  
>>> lens = [1, 3, 2, 4]

Split and check whether the lengths correspond.

>>> dfs = row_split_sizes(df, lens)
>>> map(len, dfs)
[1, 3, 2, 4]

Let us take a look at the last frame:

>>> dfs[-1]
   x
6  6
7  7
8  8
9  9
daul.pandas_utils.safe_drop_cols(df, cols)

Returns a copy of a frame df with columns cols dropped, but without raising exceptions if they do not exist.

>>> df = empty_frame(['a', 'b', 'c'])
>>> list(safe_drop_cols(df, ['a', 'd']).columns)
['b', 'c']
>>> list(safe_drop_cols(df, 'a').columns)
['b', 'c']
daul.pandas_utils.set_cols(df, cols)

Returns a copy of a frame df with its labels being cols.

>>> adf = empty_frame(['a', 'b'])  
>>> bdf = set_cols(adf, ['c', 'd'])
>>> list(bdf.columns)
['c', 'd']

Note that the columns of adf are not changed:

>>> list(adf.columns)
['a', 'b']
daul.pandas_utils.sort(df, col, ascending=True)

Sorts frame by given col(s).

df : DataFrame
The frame to sort.
col : str or list
Columns by which to sort.
ascending : bool or list
Whether to sort in ascending order.

Note

Internally uses pd.DataFrame.sort_values() or pd.DataFrame.sort() if not available.

daul.pandas_utils.take_middle_n_rows(df, n, error_if_less=False)

Selects n middle rows of a frame.

Raises:

ValueError

If the df has less than n rows, and if error_if_less is set.

>>> df = pd.DataFrame({'col': ['a', 'b', 'c']})  
>>> take_middle_n_rows(df, 1)
  col
1   b
>>> take_middle_n_rows(df, 2)
  col
0   a
1   b
daul.pandas_utils.twocol_dict(df)

Creates a dictionary from the first two columns of a frame df (keys — the first column, values — the second column).

See also

twocol_listvaldict() for creating a list-valued dictionary.

>>> df = pd.DataFrame({'k': ['a', 'b'], 'v': ['x', 'y']})[['k', 'v']]
>>> df
   k  v
0  a  x
1  b  y
>>> twocol_dict(df)
{'a': 'x', 'b': 'y'}
daul.pandas_utils.twocol_dictf(df)

Creates a function that maps values from the first column of df to the values in the corresponding rows of the second column.

Note

Creates a dictionary using twocol_dict().

daul.pandas_utils.twocol_dictfd(df, default)

Creates a function that maps values from the first column of df to the values in the corresponding rows of the second column, with a default value default.

daul.pandas_utils.twocol_listvaldict(df)

Creates a dictionary from the first two columns of a frame df, (keys — the first column, values — list-aggregated values from the second column for the same key).

See also

twocol_dict().

>>> df = pd.DataFrame({'k': ['a', 'a', 'b'], 'v': ['c', 'd', 'e']})[['k', 'v']]
>>> df
   k  v
0  a  c
1  a  d
2  b  e
>>> twocol_listvaldict(df)
{'a': ['c', 'd'], 'b': ['e']}
daul.pandas_utils.twocol_listvaldictf(df)

Creates a function that maps keys from the first column of df into a list of values from the second column corresponding to the same key.

See also

The function is a wrapper over twocol_listvaldict().

daul.pandas_utils.twocol_listvaldictfd(df, default)

Creates a function that maps keys from the first column of df into a list of values from the second column corresponding to the same key (with a default value default, if no such key is found).

See also

The function is a wrapper over twocol_listvaldict().

daul.pandas_utils.update_column(df, col, values, copy=True)

Returns a copy of a frame df with column col set to values.

copy: bool
Whether to return a copy.
>>> adf = pd.DataFrame({'v': [0, 1, 2]})

If a column with the desirable label does not exists, a new one is created:

>>> bdf = update_column(adf, 'e', ['a', 'b', 'c'])
>>> list(bdf['e'])
['a', 'b', 'c']

If it does, the column will have new values:

>>> cdf = update_column(bdf, 'e', ['d', 'e', 'f'])
>>> list(cdf['e'])
['d', 'e', 'f']

Note that the previous frame remains the same.

>>> list(bdf['e'])
['a', 'b', 'c']
daul.pandas_utils.update_fixed_column(df, col, value, copy=True)

Returns a copy of a frame df with a column col set to a fixed value value.

See also daul.shortcuts.pdufc() for a shorter form.

copy: bool
Whether to return a copy.
>>> df = pd.DataFrame({'a': [0, 1]})
>>> df = update_fixed_column(df, 'b', 'text')
>>> list(df['b'])
['text', 'text']

Warning

Be aware that the objects assigned are identical and if mutable, changing one will result in the change of others.

>>> df = pd.DataFrame({'a': [0, 1]})
>>> df = update_fixed_column(df, 'b', [])
>>> list(df['b'])
[[], []]
>>> df['b'].iloc[0].append(1) # change the first object
>>> list(df['b'])
[[1], [1]]
daul.pandas_utils.update_tuple_col(df, cola, colb, values, copy=True)

Returns a copy of a frame df and sets values of two columns (cola, colb) from a list of 2-len tuples (values).

copy: bool
Whether to return a copy.
>>> df = pd.DataFrame({'id': [0, 1]})
>>> values = [('green', '#00ff00'), ('blue', '#0000ff')]
>>> df = update_tuple_col(df, 'name', 'hex', values)  
>>> df
   id   name      hex
0   0  green  #00ff00
1   1   blue  #0000ff
daul.pandas_utils.values_list(x)

Returns the x.values of x as a list.

>>> df = pd.DataFrame({'x': np.arange(5)})  
>>> values_list(df['x'])
[0, 1, 2, 3, 4]
daul.pandas_utils.values_set(x)

Returns the x.values of x as a set.

>>> df = pd.DataFrame({'x': ['a', 'a', 'a']})  
>>> values_set(df['x'])
set(['a'])

NumPy utilities

Dealing with NumPy arrays.

Overview

Selectors

first_nrows(arr, n) Selects first n rows of a 2-dimensional array arr.
each_nth(arr[, n]) Selects each n-th element of an array arr.

Run-length encoding

rle(l) Run-length encodes a given list l.

Normalization

onesum_norm(x) Normalizes an array x to sum to one.

Rounding

floor_decimals(x[, dec]) Floors a value x to a given number dec of decimal places.

Length groups

lengroup_select(arrs, inds) Selects elements from a list of arrays using absolute indices as if the arrays were concatenated.
lengroup_starts_ends(lens) Obtains starting and ending positions of individual length-groups with lengths specified by lens.

Array construction

nonzeros_at(l, pos, val, dtype) Creates an array of a particular length (l), by specifying all non-zero values (val) and their positions (pos).

Cumulative sums

zero_cumsum_nolast(arr) Returns the cumulative sum of an array arr, but starting at the zero and without the last element.

Zero/Empty-aware operations

zaw_1d_concatenate(arrs[, default, dtype]) Concatenates 1-d arrays arrs, or returns default if an empty list was provided.
eaw_loc(arr, loc) Returns the elements of an array arr specified by indices loc, and returns an empty array if the indices are empty.
daul.numpy_utils.each_nth(arr, n=1)

Selects each n-th element of an array arr.

>>> list(each_nth(np.arange(10), 3))
[0, 3, 6, 9]
daul.numpy_utils.eaw_loc(arr, loc)

Returns the elements of an array arr specified by indices loc, and returns an empty array if the indices are empty.

>>> list(eaw_loc(np.arange(0, 10), []))
[]
>>> list(eaw_loc([10, 11, 12], [1, 2]))
[11, 12]
daul.numpy_utils.first_nrows(arr, n)

Selects first n rows of a 2-dimensional array arr.

>>> arr = np.array([[1, 2], [2, 3], [3, 4]])
>>> arr  
array([[1, 2],
       [2, 3],
       [3, 4]])
>>> first_nrows(arr, 1)
array([[1, 2]])
daul.numpy_utils.floor_decimals(x, dec=0)

Floors a value x to a given number dec of decimal places.

>>> x = 123.456
>>> floor_decimals(x, 0)
123.0
>>> floor_decimals(x, 1)
123.4
>>> floor_decimals(x, 2)
123.45

Note the behavior for a negative number of decimals.

>>> floor_decimals(x, -1)
120.0
>>> floor_decimals(x, -2)
100.0
>>> floor_decimals(x, -3)
0.0
daul.numpy_utils.lengroup_outer_inner_indices(lens, lpos)

Returns the outer and inner indices given lengths lens of length-grouped arrays and the desired position lpos.

Suppose we have a length-group array with 2, 4, and 3 elements.

>>> lens = np.array([2, 4, 3])

For convenience, let us label the function with a shorter name.

>>> f = lengroup_outer_inner_indices

Now, let us obtain all possible absolute indices.

>>> lpos = np.arange(sum(lens))
>>> o, i = f(lens, lpos)

Let us take a look at outer indices: >>> list(o) [0, 0, 1, 1, 1, 1, 2, 2, 2]

We see that we are getting the proper outer index.

Let us take a look at inner within-group indices: >>> list(i) [0, 1, 0, 1, 2, 3, 0, 1, 2]

Note that an exception is raised if we have out of scope indices.

>>> f(lens, -1)
Traceback (most recent call last):
...
IndexError: The absolute index is not within limits.
>>> f(lens, sum(lens))
Traceback (most recent call last):
...
IndexError: The absolute index is not within limits.

If we provide empty indices, we also get empty ones. >>> o, i = f(lens, []) >>> len(o) == 0 and len(i) == 0 True

daul.numpy_utils.lengroup_select(arrs, inds)

Selects elements from a list of arrays using absolute indices as if the arrays were concatenated.

arrs : list of arrays
The arrays from which to select elements.
inds : array_like
Indices to choose.
>>> arrs = [np.array([10, 11, 12]), np.array([21, 22])]
>>> inds = [0, 3]  
>>> lengroup_select(arrs, inds)
[10, 21]
>>> arrs_2d = [np.array([[0, 1], [2, 3]]), np.array([[4, 5, 6], [7, 8, 9]])]
>>> lengroup_select(arrs_2d, inds)
[array([0, 1]), array([7, 8, 9])]
>>> lengroup_select(arrs_2d, [])
[]
daul.numpy_utils.lengroup_starts_ends(lens)

Obtains starting and ending positions of individual length-groups with lengths specified by lens.

>>> lens = [2, 4, 3]
>>> s, e = lengroup_starts_ends(lens)  
>>> s 
array([0, 2, 6])
>>> e
array([2, 6, 9])
daul.numpy_utils.nonzeros_at(l, pos, val, dtype)

Creates an array of a particular length (l), by specifying all non-zero values (val) and their positions (pos).

dtype : dtype
The dtype of the array.
>>> nonzeros_at(l=5, pos=[2, 3], val=1, dtype=np.int)
array([0, 0, 1, 1, 0])
daul.numpy_utils.onesum_norm(x)

Normalizes an array x to sum to one.

>>> onesum_norm(np.array([1, 3]))
array([0.25, 0.75])
daul.numpy_utils.rle(l)

Run-length encodes a given list l.

>>> l = ['a', 'a', 'a', 'b', 'c', 'c', 'a']
>>> rle(l)
[('a', 3), ('b', 1), ('c', 2), ('a', 1)]
daul.numpy_utils.zaw_1d_concatenate(arrs, default=array([], dtype=float64))

Concatenates 1-d arrays arrs, or returns default if an empty list was provided.

>>> default = np.array([], dtype=np.int32)
>>> zaw_1d_concatenate([], default=default)
array([], dtype=int32)
>>> zaw_1d_concatenate([[1, 2], [3]])
array([1, 2, 3])
daul.numpy_utils.zero_cumsum_nolast(arr)

Returns the cumulative sum of an array arr, but starting at the zero and without the last element.

Note

The function is useful when dealing with a list of arrays (of varying lengths). Applying the function over a list of lengths gives starting positions in absolute indices. cumsum() then gives the ending positions.

For the sake of illustration, let us show a practical example using a list of strings.

>>> seqs = ['abc', 'defgh', 'ij', 'klmnop']  

Let us obtain their lengths:

>>> arr = map(len, seqs)

Now we get their starting and ending positions:

>>> starts = zero_cumsum_nolast(arr)
>>> ends = np.cumsum(arr)  

Suppose the string is concatenated:

>>> useq = "".join(seqs)  
>>> useq  
'abcdefghijklmnop'

We can select the corresponding elements like this:

>>> i = 1
>>> s, e = starts[i], ends[i]
>>> useq[s:e]
'defgh'

Shortcuts

Abbreviated names of common functions from pandas_utils.

Overview

Shortcuts: DataFrame columns

pduc(df, col, values[, copy]) Returns a copy of a frame df with column col set to values.
pdutc(df, cola, colb, values[, copy]) Returns a copy of a frame df and sets values of two columns (cola, colb) from a list of 2-len tuples (values).
pdufc(df, col, value[, copy]) Returns a copy of a frame df with a column col set to a fixed value value.
pdca(df, col, f[, store_as, copy]) Returns a copy of a frame with a function f being applied over a column col and stored as the same column col.

Shortcuts: Conversion of DataFrames to dictionaries

pdtcd(df) Creates a dictionary from the first two columns of a frame df (keys — the first column, values — the second column).
pdtcdf(df) Creates a function that maps values from the first column of df to the values in the corresponding rows of the second column.
pdtcdfd(df, default) Creates a function that maps values from the first column of df to the values in the corresponding rows of the second column, with a default value default.
pdtclvd(df) Creates a dictionary from the first two columns of a frame df, (keys — the first column, values — list-aggregated values from the second column for the same key).
pdtclvdf(df) Creates a function that maps keys from the first column of df into a list of values from the second column corresponding to the same key.
pdtclvdfd(df, default) Creates a function that maps keys from the first column of df into a list of values from the second column corresponding to the same key (with a default value default, if no such key is found).

Shortcuts: Inner frames

pdioc(df, inndf_col, col) Includes column from the outer frame as a fixed column in the inner frame.

Shortcuts: Other

pdri(df[, copy]) Returns a copy of a frame df, with the index having consecutive values in the range from 0 to len(df) - 1.