genome_kmers package

kmers

class genome_kmers.kmers.Kmers(seq_coll: SequenceCollection | None = None, min_kmer_len: int = 1, max_kmer_len: int | None = None, source_strand: str = 'forward', track_strands_separately: bool = False, method: str = 'single_pass')[source]

Bases: object

Defines memory-efficient kmers calculations on a genome.

generate_get_kmer_info_func(one_based_seq_index: bool) Callable[source]

Generate the get_kmer_info function that is used to get kmer information from a sequence byte array index.

Parameters:

one_based_seq_index (bool) – whether to return one-based sequence indices

Returns:

get_kmer_info

Return type:

Callable

get_is_less_than_func(validate_kmers: bool = True, break_ties: bool = False) Callable[source]

Returns a less than function that takes two integers as input and returns whether the kmer defined by the first index is less than the kmer defined by the second index.

NOTE: If break_ties is True, then it will return True if the first of two equal kmers has a smaller sba_start_index. This is useful to gauranteeing identical output between different runs. However, it comes at a significant performance cost due to additional swapping required

Parameters:
  • validate_kmers (bool, optional) – Explicitly verify that both kmers are at least of min_kmer_len. Defaults to False.

  • break_ties (bool, optional) – if two kmers are lexicographically equivalent, break the tie usind the sba_start_index. Defaults to True.

Raises:

NotImplementedError – kmer_source_strand and strands_loaded must both be “forward”

Returns:

is_less_than_func

Return type:

Callable

get_kmer_count(kmer_len: int | None, kmer_filter_func: ~typing.Callable = CPUDispatcher(<function kmer_filter_keep_all>), min_group_size: int = 1, max_group_size: int | None = None) int[source]

A customizable function to count the total number of kmers passing filters.

Examples:

Count kmers that occur exactly once kmers.get_kmer_count(max_group_size=1)

Count kmers that are repeated at least 5 times and no more than 10 times kmers.get_kmer_count(min_group_size=5, max_group_size=10)

Count 50-mers filter = gen_kmer_length_filter_func(min_kmer_len=50) kmers.get_kmer_count(kmer_filter_func=filter)

Parameters:
  • kmer_len (Union[int, None]) – length of kmer. If None, take the longest possible.

  • kmer_filter_func (Callable, optional) – function that returns true if a kmer passes all filters. Defaults to kmer_filter_keep_all.

  • min_group_size (int, optional) – Smallest allowed group size. Defaults to 1.

  • max_group_size (Union[int, None], optional) – Largest allowed group size. If None, then there is no maximum group size. Defaults to None.

Raises:
  • NotImplementedError – if kmer_source_strand and seq_coll.strands() loaded are not both “forward”

  • ValueError – kmer_len is invalid

  • ValueError – one or more group params are invalid (min_group_size, max_group_size, yield_first_n)

  • ValueError – kmer_comparison_func is provided when kmers have not been sorted

Returns:

total_kmer_count

Return type:

int

get_kmer_group_counts(kmer_len: int | None, kmer_filter_func: ~typing.Callable = CPUDispatcher(<function kmer_filter_keep_all>), min_group_size: int = 1, max_group_size: int | None = None, max_counts_bin: int = 1000000) tuple[array, int][source]

A customizable function to build a histogram of group sizes for kmers passing filters.

Examples:

Get histogram for 50-mers filter = gen_kmer_length_filter_func(min_kmer_len=50) kmers.get_kmer_group_counts(kmer_filter_func=filter)

Parameters:
  • kmer_len (Union[int, None]) – length of kmer. If None, take the longest possible.

  • kmer_filter_func (Callable, optional) – function that returns true if a kmer passes all filters. Defaults to kmer_filter_keep_all.

  • min_group_size (int, optional) – Smallest allowed group size. Defaults to 1.

  • max_group_size (Union[int, None], optional) – Largest allowed group size. If None, then there is no maximum group size. Defaults to None.

  • max_counts_bin (int, optional) – largest group_size bin in the return counts_by_group_size array. Group sizes that exceed this value will be placed in this bin. Defaults to 1000000.

Raises:
  • NotImplementedError – if kmer_source_strand and seq_coll.strands() loaded are not

  • both "forward"

  • ValueError – kmer_len is invalid

  • ValueError – one or more group params are invalid (min_group_size, max_group_size, yield_first_n)

  • ValueError – kmer_comparison_func is provided when kmers have not been sorted

Returns:

counts_by_group_size (np.array): histogram of group sizes

counts_by_group_size[i] gives the number of groups of size i. Any group sizes > max_counts_bin will be placed in max_counts_bin.

total_kmer_count (int): total number of kmers

Return type:

tuple[np.array, int]

get_kmer_str(kmer_num: int, kmer_len: int | None = None) str[source]

Get the kmer_num’th kmer of kmer_len.

Parameters:
  • kmer_num (int) – which number kmer to return (in range [0, num_kmers - 1])

  • kmer_len (Union[int, None], optional) – length of kmer to return. If kmer_len is None, return the longest possible, which is when the segment ends or the kmer_max_len is reached. Defaults to None.

Raises:
  • NotImplementedError – kmer_source_strand and strands_loaded must both be “forward”

  • ValueError – kmer_num is invalid

  • ValueError – kmer_len is invalid

Returns:

kmer

Return type:

str

get_kmer_str_no_checks(kmer_num: int, kmer_strand: str, kmer_len: int) str[source]

Return a string representation of kmer_num on kmer_strand with kmer_len. No checks to verify that arguments provided are done. Only call this function if it is known that these checks have already been completed (e.g. when yielded get_kmers()).

Parameters:
  • kmer_num (int) – kmer number

  • kmer_strand (str) – “+” or “-”

  • kmer_len (int) – length of the kmer

Raises:
  • NotImplementedError – kmer_strand != “+”

  • ValueError – unrecognized kmer_strand

Returns:

kmer_str

Return type:

str

get_kmers(kmer_len: int | None, one_based_seq_index: bool = False, kmer_filter_func: ~typing.Callable = CPUDispatcher(<function kmer_filter_keep_all>), kmer_info_to_yield: str = 'minimum', min_group_size: int = 1, max_group_size: int | None = None, yield_first_n: int | None = None) Generator[tuple, None, None][source]

A customizable generator yielding tuples with all kmer information.

Examples:

Yield all kmers kmers.get_kmers(yield_first_n=None)

Yield only the first occurrence of a kmer kmers.get_kmers(use yield_first_n=1)

Yield up to the first 10 occurrences of a kmer kmers.get_kmers(use yield_first_n=10)

Yield all kmers that occur exactly once kmers.get_kmers(max_group_size=1)

Yield all kmers that are repeated at least 5 times and no more than 10 times kmers.get_kmers(min_group_size=5, max_group_size=10)

NOTE: group yielding is not supported if kmers are unsorted. The user cannot provide min_group_size, max_group_size, or yield_first_n in this situation.

Parameters:
  • kmer_len (Union[int, None]) – length of kmer. If None, take the longest possible.

  • one_based_seq_index (bool, optional) – whether yielded sequence indices should be 1-based. Defaults to False.

  • kmer_filter_func (Callable, optional) – function that returns true if a kmer passes all filters. Defaults to kmer_filter_keep_all.

  • kmer_info_to_yield (str) – “minimum” or “full”. Defaults to “minimum”

  • min_group_size (int, optional) – Smallest allowed group size. Defaults to 1.

  • max_group_size (Union[int, None], optional) – Largest allowed group size. If None, then there is no maximum group size. Defaults to None.

  • yield_first_n (Union[int, None], optional) – yield up to this many kmer_nums. Defaults to None.

Raises:
  • NotImplementedError – if kmer_source_strand and seq_coll.strands() loaded are not both “forward”

  • ValueError – kmer_len is invalid

  • ValueError – one or more group params are invalid (min_group_size, max_group_size, yield_first_n)

  • ValueError – kmer_comparison_func is provided when kmers have not been sorted

Yields:

Generator[tuple, None, None] – output depends on get_kmer_info_func

load(load_file_path: Path, seq_coll: SequenceCollection | None = None, format: str = 'hdf5') None[source]

Load Kmers object from saved file.

Parameters:
  • load_file_path (Path) – path to file to load.

  • seq_coll (Union[SequenceCollection, None], optional) – If provided, seq_coll will be loaded into the kmers object rather than attempting to load from the saved file. Defaults to None.

  • format (str, optional) – “hdf5” or “shelve”. Defaults to “hdf5”.

Raises:

ValueError – format not recognized

save(save_file_path: Path, include_sequence_collection: bool = False, format: str = 'hdf5', mode: str = 'w') None[source]

Save Kmers object to file.

Parameters:
  • save_file_path (Path) – path for saved file

  • include_sequence_collection (bool, optional) – whether to include sequence collection. Defaults to False.

  • format (str, optional) – “hdf5” or “shelve”. Defaults to “hdf5”.

  • mode (str, optional) – mode with which to open file and save information. “w” for write or “a” for append. Defaults to “w”.

Raises:

ValueError – format not recognized

sort()[source]

Sort (in place) the kmer_sba_start_indices array by lexicographically comparing the kmers defined at each index.

Raises:

NotImplementedError – kmer_source_strand and strands_loaded must both be “forward”

to_csv(kmer_len, output_file_path, fields=['kmer'])[source]

Write all kmers to CSV file using a simple function.

genome_kmers.kmers.compare_sba_kmers_always_less_than(sba_a: array, sba_b: array, kmer_sba_start_idx_a: int, kmer_sba_start_idx_b: int, max_kmer_len: int | None = None) tuple[int, int][source]
genome_kmers.kmers.compare_sba_kmers_lexicographically(sba_a: array, sba_b: array, kmer_sba_start_idx_a: int, kmer_sba_start_idx_b: int, max_kmer_len: int | None = None) tuple[int, int][source]

Lexicographically compare two kmers of length kmer_len. If kmer_len is None, the end of the segment defines the longest kmer.

NOTE: This function does no validation for kmer_len. It will compare up to max_kmer_len bases if required, but it will return as soon as the comparison result is known.

Parameters:
  • sba_a (np.array) – sequence byte array a

  • sba_b (np.array) – sequence byte array b

  • kmer_sba_start_idx_a (int) – index in sba that defines the start of kmer a

  • kmer_sba_start_idx_b (int) – index in sba that defines the start of kmer b

  • kmer_len (Union[int, None], optional) – Length of the kmers to compare. If None, the end of the segment defines the longest kmers to compare.. Defaults to None.

Raises:

AssertionError – there were no valid bases to compare

Returns:

comparison, last_kmer_index_compared
comparison:

+1 = kmer_a > kmer_b 0 = kmer_a == kmer_b -1 = kmer_a < kmer_b

last_kmer_index_compared: the kmer index of the last valid comparison done between two

bases. If a single base was compare, then this value will be 0.

Return type:

tuple[int, int]

genome_kmers.kmers.crispr_ngg_pam_filter(sba: array, sba_strand: str, kmer_sba_start_idx: int) bool[source]

Generate a filter that passes for all 23-mers ending in GG.

NOTE: no other checks on kmer validity are carried out.

Raises:

ValueError – kmer extend beyond the size of the sequence byte array

Returns:

whether kmer passes filter or not

Return type:

bool

genome_kmers.kmers.gen_kmer_gc_content_filter_func(min_allowed_gc_frac: float, max_allowed_gc_frac: float, kmer_len: int) Callable[source]

Generate a filter function that returns True if fraction GC content is between min_allowed_gc_frac and max_allowed_gc_frac.

NOTE: this function does not do extra checks to ensure that the kmer itself is valid (i.e. it does overflow over a boundary or beyond the sequence byte array, is made of valid bases, etc).

Parameters:
  • min_allowed_gc_frac (float) – minimum allowed fraction GC content (inclusive). Must be in the range [0.0, 1.0] and be <= max_allowed_gc_frac.

  • max_allowed_gc_frac (float) – maximum allowed fraction GC content (inclusive). Must be in the range [0.0, 1.0] and be >= min_allowed_gc_frac.

  • kmer_len (int) – length of kmer

Raises:

ValueError – min_allowed_gc_frac or max_allowed_gc_frac is invalid

Returns:

filter

Return type:

Callable

genome_kmers.kmers.gen_kmer_homopolymer_filter_func(max_homopolymer_size: int, kmer_len: int) Callable[source]

Generate a filter function that passes if there is no homopolym of length greater than max_homopolymer_size.

NOTE: this function does not do extra checks to ensure that the kmer itself is valid (i.e. it does overflow over a boundary or beyond the sequence byte array, is made of valid bases, etc).

Parameters:
  • max_homopolymer_size (int) – largest allowed homopolymer. Must be >= 1

  • kmer_len (int) – length of kmer

Raises:

ValueError – max_homopolymer_size must be >= 2

Returns:

filter

Return type:

Callable

genome_kmers.kmers.gen_kmer_length_filter_func(min_kmer_len: int) Callable[source]

Generates a filter that passes if kmer is of min_kmer_len, but otherwise fails.

Parameters:

min_kmer_len (int) – minimum required kmer length

Returns:

filter

Return type:

Callable

genome_kmers.kmers.gen_no_ambiguous_bases_filter(kmer_len: int) Callable[source]

Generate a filter that checks that only “A”, “T”, “G”, or “C” are present in the kmer.

Parameters:

kmer_len (int) – kmer length

Raises:

ValueError – end of segment or sequence byte array is reached before end of kmer

Returns:

no_ambiguous_bases_filter

Return type:

Callable

genome_kmers.kmers.get_compare_sba_kmers_func(kmer_len)[source]
genome_kmers.kmers.get_kmer_group_size_hist(sba: array, sba_strand: str, kmer_len: int | None, kmer_start_indices: array, kmer_comparison_func: Callable, kmer_filter_func: Callable, min_group_size: int = 1, max_group_size: int | None = None, max_counts_bin: int = 1000000) tuple[array, int][source]

Build a histogram of group sizes. counts_by_group_size[i] gives the number of groups of size i. Any group sizes > max_counts_bin will be placed in max_counts_bin. The total number of kmers is also calculated.

Parameters:
  • sba (np.array) – sequence byte array

  • sba_strand (str) – “forward” or “reverse_complement”

  • kmer_len (Union[int, None]) – length of kmer. If None, take the longest possible.

  • kmer_start_indices (np.array) – kmer sequence byte array start indices

  • kmer_comparison_func (Callable) – function that returns the result of a two kmer comparison

  • kmer_filter_func (Callable) – function that returns true if a kmer passes all filters

  • min_group_size (int, optional) – Smallest allowed group size. Defaults to 1.

  • max_group_size (Union[int, None], optional) – Largest allowed group size. If None, then there is no maximum group size. Defaults to None.

  • max_counts_bin (int, optional) – largest group_size bin in the return counts_by_group_size array. Group sizes that exceed this value will be placed in this bin. Defaults to 1000000.

Raises:

ValueError – max_counts_bin is invalid

Returns:

counts_by_group_size, total_kmer_count

Return type:

tuple[np.array, int]

genome_kmers.kmers.get_kmer_info_group_size_only(kmer_num: int, kmer_sba_start_indices: array, sba: array, kmer_len: int | None, group_size_yielded: int, group_size_total: int) tuple[int, int, int][source]

Return only group_size_total without any processing.

Parameters:
  • kmer_num (int) – kmer number

  • kmer_start_indices (np.array) – kmer sequence byte array start indices

  • sba (np.array) – sequence byte array

  • kmer_len (Union[int, None]) – length of kmer. If None, take the longest possible.

  • group_size_yielded (int) – number of kmers in the group that will be yielded

  • group_size_total (int) – number of kmers in the group in total

Returns:

group_size_total

Return type:

int

genome_kmers.kmers.get_kmer_info_minimal(kmer_num: int, kmer_sba_start_indices: array, sba: array, kmer_len: int | None, group_size_yielded: int, group_size_total: int) tuple[int, int, int][source]

Return only basic kmer information without any processing. Used as an input to kmer_info_by_group_generator when only basic information is required.

Parameters:
  • kmer_num (int) – kmer number

  • kmer_start_indices (np.array) – kmer sequence byte array start indices

  • sba (np.array) – sequence byte array

  • kmer_len (Union[int, None]) – length of kmer. If None, take the longest possible.

  • group_size_yielded (int) – number of kmers in the group that will be yielded

  • group_size_total (int) – number of kmers in the group in total

Returns:

kmer_num, group_size_yielded, group_size_total

Return type:

tuple[int, int, int]

genome_kmers.kmers.kmer_filter_keep_all(sba: array, sba_strand: str, kmer_sba_start_idx: int)[source]
genome_kmers.kmers.kmer_has_required_len(sba: array, sba_start_idx: int, min_kmer_len: int) bool[source]

Checks whether the kmer is of at least min_kmer_len before reaching the end of the segment.

Parameters:
  • sba (np.array) – sequence byte array

  • sba_start_idx (int) – sequence byte array start index for the kmer

  • min_kmer_len (int) – minimum kmer length

Return type:

bool

genome_kmers.kmers.kmer_info_by_group_generator(sba: array, sba_strand: str, kmer_len: int | None, kmer_start_indices: array, kmer_comparison_func: Callable, kmer_filter_func: Callable, kmer_info_func: Callable, min_group_size: int = 1, max_group_size: int | None = None, yield_first_n: int | None = None) Generator[tuple, None, None][source]

Generator that yields the valid kmer information and total group size for all groups meeting requirements. A valid kmer is one that passes the provided kmer_filter_func. A group is defined as the set of identical kmers as defined by the kmer_comparison_func. The first “yield_first_n” kmers will be yielded if the group meets all provided requirements. It must have a total group size between min_group_size and max_group_size (inclusive). The kmer information that is yielded is customizable and defined by kmer_info_func.

Parameters:
  • sba (np.array) – sequence byte array

  • sba_strand (str) – “forward” or “reverse_complement”

  • kmer_len (Union[int, None]) – length of kmer. If None, take the longest possible.

  • kmer_start_indices (np.array) – kmer sequence byte array start indices

  • kmer_comparison_func (Callable) – function that returns the result of a two kmer comparison

  • kmer_filter_func (Callable) – function that returns true if a kmer passes all filters

  • kmer_info_func (Callable) – function that returns a tuple with all the kmer information to yielded.

  • min_group_size (int, optional) – Smallest allowed group size. Defaults to 1.

  • max_group_size (Union[int, None], optional) – Largest allowed group size. If None, then there is no maximum group size. Defaults to None.

  • yield_first_n (Union[int, None], optional) – yield up to this many kmer_nums. Defaults to None.

Raises:

ValueError – invalide min_group_size, max_group_size, or yield_first_n

Yields:

Generator[tuple[list[int], int], None, None] – valid_kmer_nums_in_group, group_size

sequence_collection

class genome_kmers.sequence_collection.SequenceCollection(fasta_file_path: Path | None = None, sequence_list: list[tuple[str, str]] | None = None, strands_to_load: str = 'forward')[source]

Bases: object

Holds all the information contained within a fasta file in a format conducive to kmer sorting.

Terminology

sba: sequence byte array
revcomp: reverse complement
record: each header and its corresponding sequence is called a record.  record_num is based on
    the order that records are read in.  record_num does not change when reverse
    complemented
segment: is the same as a record except that segment_num always starts leftmost.  i.e. the
    sba end index for segment N is always less than the sba end index for segment M > N
forward_sba_idx: index in forward sequence byte array
revcomp_sba_idx: index in reverse complement sequence byte array
forward_seq_idx: 0-based index for a sequence on the forward strand
revcomp_seq_idx: 0-based index for a sequence on the reverse complement strand

Forward strand
--------------
record_num                   0           1                  2
segment_num                  0           1                  2
forward_record_name          A           B                  C
forward_sba              [-------]$[------------]$[--------------------]
                         |       | |            | |                    |
forward_sba_seg_starts   0       | 10           | 25                   |
forward_sba_seg_ends             8              23                     46

Reverse complement strand
-------------------------
revcomp_sba_seg_ends                          21             36        46
revcomp_sba_seg_starts   0                    | 23           | 38      |
                         |                    | |            | |       |
revcomp_sba              [====================]$[============]$[=======]
revcomp_record_name                C                   B           A
segment_num                        0                   1           2
record_num                         2                   1           0

Notes

- a "$" is placed between each sequence so that you can determine if you've reached the end of
    a sequence without referencing the seg_starts / seg_ends array.
- the collection must contain at least one sequence
- all sequences in the collection must have a length > 0
- duplicate record_names are not allowed

Members

forward_sba (np.array): forward sba (dtype=uint8)
_forward_sba_seg_starts (np.array): value at index i gives the sba start index for the ith
    segment on the forward strand. (dtype=uint32)
revcomp_sba (np.array): reverse complement sba (dtype=uint8)
_revcomp_sba_seg_starts (np.array): value at index i gives the sba start index for the ith
    segment on the reverse complement strand. (dtype=uint32)
generate_get_record_info_from_sba_index_func(one_based: bool = False) Callable[source]

Generate a function that returns record info when provided a sequence byte array index

Parameters:

one_based (bool, optional) – Should sequence indices by one based. Defaults to False.

Raises:

ValueError – sba_strand not recognized

Returns:

get_record_info_from_sba_index

Return type:

Callable

get_record_loc_from_sba_index(sba_idx: int, sba_strand: str = None, one_based: bool = False) tuple[str, str, int][source]

Get the sequence location (strand, record_name, seq_idx) from the sequence byte array index

Parameters:
  • sba_idx (int) – sequence byte array index

  • sba_strand (str, optional) – sequence byte array strand. If strands_loaded is “both”, then it must be specified as forward” or “reverse_complement”. Otherwise it is inferred. Defaults to None.

  • one_based (bool, optional) – whether seq index return be one-based. Defaults to False.

Raises:

ValueError – _description_

Returns:

_description_

Return type:

tuple[str, str, int]

get_record_name_from_sba_index(sba_idx: int, sba_strand: str = None) str[source]

Get the sequence record number from the sequence byte array index.

Parameters:
  • sba_idx (int) – sequence byte array index

  • sba_strand (str, optional) – for which strand is the sba_idx defined (“forward” or “reverse_complement”). Must be defined when SequenceCollection has both strands loaded. If specified when only a single strand has been loaded, it will verify that it matches what is expected. If set to None, it will automatically detect the strand that was loaded. Defaults to None.

Return type:

record_name (str)

get_sba_start_end_indices_for_segment(segment_num: int, sba_strand: str = None) tuple[int, int][source]

Given a segment number (and optionally sba_strand), return the first sba index and last sba index of the segment.

Parameters:
  • segment_num (int) – segment number

  • sba_strand (str, optional) – sequence byte array strand (“forward”, “reverse_complement”, or “both”). Defaults to None.

Raises:

ValueError – segment_num out of bounds

Returns:

sba_start_index, sba_end_index

Return type:

tuple[int, int]

get_segment_num_from_sba_index(sba_idx: int, sba_strand: str = None) int[source]

Get the segment number from the sequence byte array index defined on sba_strand (attempt to automatically detect the strand if not specified)

Parameters:
  • sba_idx (int) – sequence byte array index

  • sba_strand (str, optional) – for which strand is the sba_idx defined (“forward” or “reverse_complement”). Must be defined when SequenceCollection has both strands loaded. If specified when only a single strand has been loaded, it will verify that it matches what is expected. If set to None, it will automatically detect the strand that was loaded. Defaults to None.

Return type:

segment_num (int)

iter_records(sba_strand: str = None)[source]

Yield (record_name, record_sba_start_idx, record_sba_end_idx) for all sequence records in order of increasing record_num. i.e. if records are ordered as “chr1”, “chr2”, “chr3”, they will be yielded in that same order regardless of whether the strand is “forward” or “reverse_complement”.

Parameters:

sba_strand (str, optional) – for which strand to yield records. If self._strands_loaded == “both”, it must be specified. Otherwise it is determined from the loaded strand (“forward” or “reverse_complement”). Defaults to None.

Yields:

(record_name, record_sba_start_idx, record_sba_end_idx)

load(load_file_path: Path, format: str = 'hdf5')[source]

Load SequenceCollection object from saved file.

Parameters:
  • load_file_path (Path) – path to file to load.

  • format (str, optional) – “hdf5” or “shelve”. Defaults to “hdf5”.

Raises:

ValueError – format not recognized

reverse_complement() array[source]

Reverse complement the sequence byte array. Only valid if a single strand is loaded.

save(save_file_path: Path, mode: str = 'a', format: str = 'hdf5') None[source]

Save SequenceCollection object to file.

Parameters:
  • save_file_path (Path) – path for saved file

  • format (str, optional) – “hdf5” or “shelve”. Defaults to “hdf5”.

  • mode (str, optional) – mode with which to open file and save information. “w” for write or “a” for append. Defaults to “a”.

Raises:

ValueError – format not recognized

sequence_length(record_num=None, record_name=None)[source]
Returns:

If record_num is specified, then the length of record_num. If record_name is specified, then the length of record_num corresponding to record_name Otherwise, the total length of all sequences

strands_loaded() str[source]

Returns which strands are loaded.

Returns:

which strands are loaded “forward”, “reverse_complement”, “both”

Return type:

str

genome_kmers.sequence_collection.bisect_right(a, x)[source]

NOTE: this is a minor modification to the copy and paste from the python standard library. It is not possible to use the python standard library version with the @jit decorator, which is required since functions that call this function need to use the @jit decorator.

Return the index where to insert item x in list a, assuming a is sorted.

The return value i is such that all e in a[:i] have e <= x, and all e in a[i:] have e > x. So if x already appears in the list, a.insert(i, x) will insert just after the rightmost x already there.

Optional args lo (default 0) and hi (default len(a)) bound the slice of a to be searched.

genome_kmers.sequence_collection.get_forward_seq_idx(sba_idx: int, sba_strand: str, seg_sba_start_idx: int, seg_sba_end_idx: int, one_based: bool = False) int[source]

Get the forward sequence index given a sequence byte array index and segment start and and indices. Optionally returns a one-based sequence index.

Parameters:
  • sba_idx (int) – sequence byte array index

  • sba_strand (str, optional) – sequence byte array strand. If strands_loaded is “both”, then it must be specified as forward” or “reverse_complement”. Otherwise it is inferred. Defaults to None.

  • seg_sba_start_idx (int) – sequence byte array index for the start of the segment

  • seg_sba_end_idx (int) – sequence byte array index for the of the segment

  • one_based (bool, optional) – whether seq index return be one-based. Defaults to False.

Raises:
  • ValueError – if sba_idx, seg_sba_start_idx, or seg_sba_end_idx is not valid

  • ValueError – if sba_strand is not recognized

Returns:

sequence forward strand index

Return type:

int

genome_kmers.sequence_collection.get_sba_start_end_indices_for_segment(segment_num: int, sba_strand: str, sba_seg_starts: array, len_sba: int) tuple[int, int][source]

Given a segment number (and optionally sba_strand), return the first sba index and last sba index of the segment.

Parameters:
  • segment_num (int) – segment number

  • sba_strand (str, optional) – sequence byte array strand (“forward”, “reverse_complement”, or “both”). Defaults to None.

Raises:

ValueError – segment_num out of bounds

Returns:

sba_start_index, sba_end_index

Return type:

tuple[int, int]

genome_kmers.sequence_collection.get_segment_num_from_sba_index(sba_idx: int, sba_strand: str, sba_seg_starts: array) int[source]

Get the segment number from the sequence byte array index.

Parameters:
  • sba_idx (int) – sequence byte array index

  • sba_strand (str) – “forward” or “reverse_complement”

  • sba_seg_starts (np.array) – sba indices for each segment start (dtype=np.unit32)

Returns:

segment_num

Return type:

int

genome_kmers.sequence_collection.reverse_complement_sba(sba: array, complement_mapping_arr: array, inplace=False) array[source]

Reverse complement sequence byte array (sba) using the uint8 to uint8 mapping array (complement_mapping_arr). This function uses numba.jit for performance.

Parameters:
  • sba (np.array) – sequence byte array (dtype=uint8)

  • complement_mapping_arr (np.array) – maps from sequence byte array value (uint8) to complement sequence byte array value (uint8)

  • inplace (bool, optional) – whether to perform in place or return a newly created array. Defaults to False.

Returns:

reverse complemented sequence byte array

Return type:

np.array