Andrew's Blog

Regular Expressions

DBC Phase 0 Week #8

January 04, 2015

Regular expressions (regex) is string of characters and symbols that provides a search pattern. You can search based on specific digits, characters, whitespaces before or after characters, and many more constraints. It can be a very powerful tool. These symbols are universal in most programming languages including PERL, PHP, RUBY, JavaScript and many more. There are many ways to use regex. The use of regex is commonly used in changing content of strings. These examples will be written in Ruby. Here's an example of regex:

        
        greeting = "Apples socks ham sam bam am"
        /sock./.match(greeting)
        => #

Regular Expressions are always contained inside / / (2 forward slashes). Seach constraints are placed inside the slashes. Here it will search for sock. The extra . (dot) after search term is a wildcard. It matches any one character, so it searches for sock plus an extra character that happens to be an "s". It returned "socks".

Using the same greetings string, we can find the index of the beginning of found match.

        
        >> p greeting =~ /ham/
        => 13

Here =~ is used to check if there is a match with the regular expressions, if there is, it returns the index of "ham" which starts at the 13th index in the original string.

What if we want to be more specific in the search besides the (.) wildcard? The following demonstrates a character class, which is a list of characters placed inside brackets.


        p greeting.scan(/[hb]am/)
        #=> ["ham", "bam"]

.scan method looks through the string and places all matches into an array. Here we're looking for either "ham" or "bam". The [hr] denotes that it must be either "h" or "r". Not both or none. Each bracket only looks for one character. Scan can be very useful for accessing your matches by keeping them in a list.

Replacing certain parts of a string is one common use of regex.

        
        >> listing = "Yan,Andrew, 949-234-9932"
        >> result = string.gsub(/\D+/, "")
        >> puts result
        => 9492349932 
        
        >> p name = /(\w+),\s?(\w+)/.match(string)
        #=> # 
        
        >> puts "#{name[2]} #{name[1]}" 
        => Andrew Yan

Here, I created a string with my name and a phone number. I want to be able to retrieve only all the numbers from the string. I use the .gsub method which searches for all matches and replaces them with the specified change. The \D means non-digits characters. The + after means: one or more of whatever preceded the +, which was \D (non-digit character). I give it a parameter of "" so it replaces all non-Digit characters with nothing, which gets rid of the white space. I end up with just the numbers.

There are other ways of accessing matches found bedsides using .scan.

        
        >> p name = /(\w+),\s?(\w+)/.match(listing)
        #=> # 
        
        >> puts "#{name[2]} #{name[1]}" 
        => Andrew Yan

Here, I'm searching for a word a comma, possible space then another word. \w+ is one or more of word character (letter,number, underscore). \s is whitespace. The ? after anything means 0 or one of what preceded it. \s? means there may or may not be a space. I also placed each \w+ in parentheses. This is so I can access them later. The match can be accessed by using bracket notation on the variable pointing to the match method. Here I printed out the name in order.

There's many more special characters that I didn't go over. You can get searches that only start from the beginning of a line or terms that are only at the end of a string. Here's a cool site that let's you try out regular expressions on your own. Rubular There's also a key on the bottom that shows a lot of the special characters and explain what they do. These examples are just a couple small things you can do with regular expressions. Once regular expressions concepts are solidified, they can be carried across most languages, which will be very beneficial.