It was announced recently that in Crystal 1.8.0 >= 2.0.0 the default Regex engine is going to be PCRE2.
After doing some digging into this thread on Github it appears that the original developers of PCRE don't remember what was done or aren't around for comment any more.
Sadly, this looks like it's a case of bit rot and an abandoned open source project that was poorly documented. It appears the decision to move forward was with upgrading/rewriting the library is what allowed the current maintainers to fix the known issues while making some improvements. However there wasn't any documentation built along the way, which makes the transition a bit daunting because there is a possibility for undocumented breaking changes.
I'm in a similar situation with the Amber Framework. I'm not one of the original creators of Amber, I just took it over last year (2022). So I have a decent size code base to maintain with potential for breaking changes.
Here's The Plan
Here's how I'm going through and verifying the behavior does not change:
Step 1. Identify all possible regex's used in the code base.
Step 2. Create tests that uniquely test those regular expressions.
Step 3. Switch to using PCRE2 in the tests and see if anything blows up.
Before We Begin
Let's make sure we have PCRE2 installed on our systems. I work on a Mac, so I'll be using Homebrew to install PCRE2.
brew install pcre2
Step 1. Identifying Regexs
My preferred editor is Visual Studio Code which has a handy feature that allows me to search with... regexs! Since it's easy to find the syntaxes for regular expressions in Crystal, I'll summarize them here:
/123/ # Uses the slash regex literal `/` with an enclosing `/`
# Sometimes this syntax will appear with wrapping parentheses, sometimes without.
"abcd".matches? /abcd/
"abcd".matches?(/abcd/)
Regex.new("123") # Converts the string into
# the percent regex literal and the valid delimiters
# which are: (), [], {}, <>, ||
%r((/)) # => /(\/)/
%r[[/]] # => /[\/]/
%r{{/}} # => /{\/}/
%r<</>> # => /<\/>/
%r|/| # => /\//
This gives us 7 combinations of how regular expressions can be found across the code bases.
The reference material for how Visual Studio Code handles regular expressions left me a little frustrated at the lack of specificity or clear exploration path for additional resources. So, through some trial and error I managed to create this regular expression that properly matches all of the above syntaxes for Crystals regular expression syntax:
\/.*\/ |\(\/.*\/\)|%r\(\(.*\)\)|%r\[\[.*\]\]|%r\{\{.*\}\}|%r<<.*>>|%r\|.*\||Regex.new
Import Note there is a leading space at the beginning of this regex. This is important! Make sure you add that space if it is not present when you copy/paste.
You can verify this is still working as expected by copying the regular expression syntax into a file in your project and using the project search feature with that expression. It should match both the %r
syntax and the commented output with the double forward slashes but ignore the # =>
text, the Regex.new
and the more vague double forward slash syntax /.../
.
When I ran this regex on the Amber code base (Amber v1.3) I got these results:
31 results in 10 files. Not bad, this turned out to be more manageable than I first expected it would be.
Step 2. Testing The Found Regular Expressions
I can pretty easily grab all of my results here by clicking on the "Open in editor" link just below the search fields in the project search field.
The text document that pops up has just enough info to be dangerous. Let's copy that and make a new spec file. For Amber, I'm going to put this right in the root spec
folder.
Now it's time to de-dupe my results. The results aren't large, and I can see the same regex used in a split()
multiple times, otherwise everything looks unique. Now I'm down to 27 tests to make.
I'm taking a couple of approaches due to patterns that start to appear in the regex's I'm seeing. We have a group that are unique in the total contents, but they are ultimately testing the beginning and ending of the same regex like this:
# spec/amber/cli/recipes/recipe_fetcher_spec.cr:
template.should match(/.+mydefault\/app$/)
template.should match(/.+mydefault\/controller$/)
template.should match(/.+mydefault\/model$/)
template.should match(/.+mydefault\/scaffold$/)
template.should match(/.+mydefault\/.recipes\/lib\/amber_granite\/app$/)
template.should match(/.+\.recipes\/lib\/amber_granite\/controller$/)
template.should match(/.+\.recipes\/lib\/amber_granite\/model$/)
template.should match(/.+\.recipes\/lib\/amber_granite\/scaffold$/)
I'll narrow this down to a test with /.+\.test\/path$/
which should be testing a string that has "test/path" in it's contents.
As I worked my way through the regexs to test, I noticed a large grouping from a monkey patch module for the String
class. Personally, I'm not a fan of monkey patching and these regexs are pretty old with other methods from the std lib now available to do some of these same things. I also noticed there's already a spec specifically for those methods, so I'm going to skip making unique tests for them.
All in all, I ended up with only 9 tests.
require "./spec_helper"
describe "Testing regular expressions" do
# spec/amber/cli/commands/exec_spec.cr:
# 43: logs = `ls tmp/*_console_result.log`.strip.split(/\s/).sort
it "verifies the regex splits on a space when using `/\s/`" do
string_array = "test string".split(/\s/).sort
string_array.first.should eq("string")
string_array.last.should eq("test")
end
# spec/amber/cli/recipes/recipe_fetcher_spec.cr:
# 21: template.should match(/.+mydefault\/app$/)
# 27: template.should match(/.+mydefault\/controller$/)
# 33: template.should match(/.+mydefault\/model$/)
# 39: template.should match(/.+mydefault\/scaffold$/)
# 57: template.should match(/.+mydefault\/.recipes\/lib\/amber_granite\/app$/)
# 77: template.should match(/.+\.recipes\/lib\/amber_granite\/controller$/)
# 86: template.should match(/.+\.recipes\/lib\/amber_granite\/model$/)
# 95: template.should match(/.+\.recipes\/lib\/amber_granite\/scaffold$/)
it "verifies the 1 or more of any starting character, ending with a set specific string" do
test_string = "blahblah/blah/bblahaaaa1233123412341234123423this/is/a/path"
test_string.should match(/.+this\/is\/a\/path$/)
test_string2 = "blahblah4321341234!!!!.this/is/a/test/path"
test_string2.should match(/.+\.this\/is\/a\/test\/path$/)
end
# spec/amber/pipes/static_spec.cr:
# 57: response_true.body.should match(/index/)
it "verfies a basic plain set of characters in a regex works" do
# This test is so basic is probably could have been skipped, but I kept it for consistency sake
"has the word index in it".matches?(/index/).should eq(true)
end
# spec/support/helpers/cli_helper.cr:
# 123: route_table_text.split("\n").reject { |line| line =~ /(─┼─|═╦═|═╩═)/ }
it "verifies the regex for removing box drawing characters" do
test_string = "═╩═\nhere is a\ntest string with new lines\n─┼─\nanother line\n═╦═"
split_array = test_string.split("\n").reject { |line| line =~ /(─┼─|═╦═|═╩═)/ }
split_array.size.should eq(3)
split_array.first.should eq("here is a")
split_array.last.should eq("another line")
end
# src/amber/cli/generators.cr:
# 216: if name.match(/\A[a-zA-Z]/)
# src/amber/cli/recipes/recipe.cr:
# 36: if name.match(/\A[a-zA-Z]/)
it "verifies a string starts with A-z (upper and lowercase)" do
string1 = "Apex Legends123123"
string2 = "123Googal"
string3 = "lower case"
string1.should match(/\A[a-zA-Z]/)
string2.should_not match(/\A[a-zA-Z]/)
string3.should match(/\A[a-zA-Z]/)
end
# src/amber/cli/commands/pipelines.cr:
# 91: pipes = pipes.split(/,\s*/).map(&.gsub(/[:\"]/, ""))
it "verifies the string is split after a comma with multiple white spaces and removal of colons and quotes from the results" do
string = ":split, \"this\", string, properly"
final_array = string.split(/,\s*/).map(&.gsub(/[:\"]/, ""))
final_array.size.should eq(4)
final_array.first.should eq("split")
final_array.last.should eq("properly")
final_array.find { |r| r.matches?(/:/) }.should eq(nil)
final_array.find { |r| r.matches?(/\"/) }.should eq(nil)
end
# src/amber/cli/recipes/file_entries.cr:
# 50: if /^(.+)\.lqd$/ =~ filename || /^(.+)\.liquid$/ =~ filename
it "verifies the line beings with any character and ends up .ldq or .liquid" do
string1 = "blah_blah blah.lqd"
string2 = "blah blah blah.liquid"
string3 = "liquid.blahblah"
string1.matches?(/^(.+)\.lqd$/).should eq(true)
string1.matches?(/^(.+)\.liquid$/).should eq(false)
string2.matches?(/^(.+)\.lqd$/).should eq(false)
string2.matches?(/^(.+)\.liquid$/).should eq(true)
string3.matches?(/^(.+)\.lqd$/).should eq(false)
string3.matches?(/^(.+)\.liquid$/).should eq(false)
end
# src/amber/pipes/static.cr:
# 191: match = range.match(/bytes=(\d{1,})-(\d{0,})/)
it "verifies a \"Range\" header has two sets of values separated by a hyphen with 1+ values before the hyphen and 0+ values after the hyphen" do
range1 = "bytes=1231234-12421341"
range2 = "bytes=0-1241234"
range1.should match(/bytes=(\d{1,})-(\d{0,})/)
range2.should match(/bytes=(\d{1,})-(\d{0,})/)
end
end
Everything currently passes when running crystal spec spec/pcre2_regex_upgrade_spec.cr
.
Now I've ended up with only 9 tests. Not bad!
Step 3 Testing PCRE2 - Does It Explode?
Thankfully this part is pretty easy. If you already have pcre2 installed, all you have to do is add a flag to our test command:
crystal spec -Duse_pcre2 spec/pcre2_regex_upgrade_spec.cr
This is now using the PCRE2 api instead of PCRE.
Everything is still passing, that's great!
Final results
I decided to re-run the tests specifically across the entire code base by customization the bin/amber_spec
file to use the -Duse_pcre2
flag and was able to get a clean test run.
So as best I can tell, Amber v1.3 will support the migration from PCRE -> PCRE2 without any hiccups.