This application enables discovery of archival materials available in Philadelphia area. PACSCL members, Penn units and other regional archives make available EAD files which are then harvested and indexed by this application. Records can be harvested from a web page or from an ArchivesSpace instance. Blacklight is used to facilitate discovery and display of archival information. EAD XML metadata is parsed at index and display time using a custom EAD parser. A shallow integration with Aeon facilitates requesting for certain Penn Libraries collections.
The site includes a user-facing administration area available at /admin. PennKey users can access the admin area if their PennKey has been added to the Users table. Endpoint and ArchiveSpace connection information can be modified from within the admin interface.
Any PennKey-holding user can add another user, by PennKey, using the User Admin area. Users can also be set to "inactive" to temporarily withold access to the admin area.
Each organization providing records has a corresponding Endpoint. All information (slug, configuration, contact persons) about these endpoints is contained in the database and editable via the admin area of the site.
Endpoints can be set to be "inactive" so that they are not automatically harvested when Endpoint harvesting is run. Additionally, endpoints can be harvested on an as-needed basis using the "Run Harvest" button on the endpoint show page. Even inactive Endpoints can be harvested via the "Run Harvest" button.
For now, all production Endpoint data can be found in the endpoints CSV file.
Important Note: EAD files in EAD 3 spec will not be harvested. An error will be shown in the harvest outcomes if a file in EAD 3 is detected.
When adding a new Endpoint, ensure that the generated identifiers for the Endpoint's records will have unique identifiers. Identifiers for each EAD are generated from the endpoint slug and unit id. If they will not be unique this is a problem the partner needs to rectify.
We generate the id by extracting the unit_id from /ead/archdesc/did/unitid[not(@audience='internal')], removing any characters that aren't letters, numbers, period or dashes, uppercasing the value and then prefixing it with the endpoint slug followed by an underscore. The code looks something like:
endpoint_slug = 'EXAMPLE'
unit_id = xml.at_path('/ead/archdesc/did/unitid[not(@audience="internal")]').text
"#{endpoint_slug}_#{unit_id}.gsub(/[^A-Za-z0-9.-]/, '').upcase"When a new organization wishes to have their EADs indexed into the application they must provide:
- An endpoint slug, which can include uppercase letters and underscores
- A technical contact email
- A public contact email
- A webpage url, if indexing from a webpage
- A repository id, if indexing from a ArchiveSpace instance. This can be found via the ASpace Admin UI.
- An aspace_instance slug, if indexing from an ArchiveSpace instance. This slug must be no more than 20 characters.
The ArchivesSpace integration supports harvesting records directly from an ArchivesSpace instance via the ArchivesSpace API. In order to harvest from an ArchivesSpace instance:
- an
ASpaceInstanceobject must be related to theEndpoint - a username and password must be provided in Vault and exposed to the application as a Docker Secret
It is important to note that all Resources in a Repository will be harvested where publish is set to true in ArchivesSpace.
ArchivesSpace credentials are stored in Penn Libraries' HashiCorp Vault, in an environment-agnostic vault aspace_credentials. ASpaceInstance application models are linked to corresponding secrets via a naming convention. The slug value of an ASpaceInstance should prefix the vault username value names #{slug}_aspace_username and the password as #{slug}_aspace_password. As mentioned above, the ASpace slug must be no more than 20 characters for this configuration to work properly.
Steps for configuring these credentials in the application environments:
- Add appropriately-named values to the
aspace_credentialsvault. - Add credential names to Ansible configuration for each environment, starting with the development environment (see
ansible/inventories/vagrant/group_vars/docker_swarm_manager/finding_aid_discovery.yml). - Re-provision your local Vagrant environment to read the secrets from Vault into Docker Secrets.
This Endpoint configuration supports the legacy application style of basic HTML pages containing an index of links to EAD XML files. This extractor will parse a HTML document and pull out any hrefs that point to .xml files.
The site provides a few points for API access:
/api/endpointsgives top-level information about the Endpoints in the system, the number of records and a link to retrieve all records for that endpoint/api/repostoriesgives top-level information about the repositories in the system, the number of records and a link to retrieve all records for that repository/records.json?q=__SEARCH-TERM__can be used to conduct a search and return results, in addition to facet values and search options.
All data is returned in JSON. Search and document responses make use of the JSON::API schema.
The raw EAD XML can be viewed by appending
/eadto any record page URL (e.g.,https://finding-aid-discovery-dev.library.upenn.edu/records/TUBLOCKSON_BC008/ead)
The sitemap is generated via the sitemap_generator gem. It is generated at deploy in the docker-entrypoint.sh script if one isn't present and it is scheduled to be regenerated after each harvest. If a harvest is completed outside of the scheduled harvest the sitemap will have to be regenerated manually in order to reflect any changes. In most cases, its fine to wait until the next scheduled sitemap generation.
The robots.txt file is generate and added to the public folder at deploy time. A different robots.txt is generated based on the environment. To manually create the robots.txt run:
bundle exec rake tools:robotstxtOur local development environment uses vagrant in order to set up a consistent environment with the required services. Please see the root README for instructions on how to set up this environment.
The Rails application will be available at https://finding-aid-discovery-dev.library.upenn.edu.
The Solr admin console will be available at http://finding-aid-discovery-dev.library.upenn.int/solr/#/.
Once your local development environment is set up you can ssh into the vagrant box to interact with the application:
- Enter the Vagrant VM by running
vagrant sshin the/vagrantdirectory - Start a shell in the
finding_aid_discoverycontainer:
docker exec -it fad_finding_aid_discovery.1.{whatever} sh
To harvest some of the endpoints in a local development environment:
- To harvest from ASpace endpoints, it is best to use the Penn GlobalProtect VPN with the
sra.vpn.upenn.eduserver. - Start a shell in the finding aids discovery app, see interacting-with-the-application
- Run rake tasks:
bundle exec rake tools:sync_endpoints
bundle exec rake tools:harvest_from endpoints=ISM,WFIS,ANSP,LCP,CCHS,PCATo harvest from all endpoints, use the all argument:
bundle exec rake tools:harvest_from endpoints=allThe harvest_from task also supports a limit param that limits harvest of each specified endpoint to a provided integer. This makes it easier to test endpoints without having to harvest all of their records:
bundle exec rake tools:harvest_from endpoints=all limit=10In order to run the test suite (currently):
- Start a shell in the finding aids discovery app, see interacting-with-the-application
- Run
rspeccommand:RAILS_ENV=test bundle exec rspec
This application uses Rubocop to enforce Ruby and Rails style guidelines. We centralize our UPenn specific configuration in upennlib-rubocop.
If there are rubocop offenses that you are not able to fix please do not edit the rubocop configuration instead regenerate the rubocop_todo.yml using the following command:
rubocop --auto-gen-config --auto-gen-only-exclude --exclude-limit 10000To change our default Rubocop config please open an MR in the upennlib-rubocop project.