Skip to content

Python Regex for Numeric Pattern that has Dashes

An answer to this question on Stack Overflow.

Question

I have a column in a pandas data frame called sample_id. Each entry contains a string, from this string I'd like to pull a numeric pattern that will have one of two forms

1-234-5-6789

or

123-4-5648

I'm having trouble defining the correct regex pattern for this. So far I have been experimenting with the following:

re.findall(pattern=r'\b2\w+', string=str(data['sample_id']))

But this is only pulling values that are starting with 2 and only the first chunk of the numeric pattern. How do I express the above patterns with the dashes?

Answer

A vertical pipe | makes an OR in a regular expression, so you can use:

test1='123-4-5648'
test2='1-234-5-6789'
re.findall(pattern=r'[0-9]-[0-9]{3}-[0-9]-[0-9]{4}|[0-9]{3}-[0-9]-[0-9]{4}', string=test1)
re.findall(pattern=r'[0-9]-[0-9]{3}-[0-9]-[0-9]{4}|[0-9]{3}-[0-9]-[0-9]{4}', string=test2)

[0-9] matches a single digit in the range 0 through 9 (inclusive), {4} indicates that four such digits should occur in a row, - means a hyphen, and | means an OR and separates the two patterns you mention.